Introduction to Domain Adaptation

Domain Adaptation and Image-to-Image Translation

Arthur Pesah
Master's student at KTH

Sebastian Bujwid
DL Engineer at Univrses + ML MSc student at KTH

Can you recognize this image?

Stockholms Slott − Anna Palm de Rosa

You might have never seen paintings of the Castle...

Photo day	Photo night
Painting black and white	Painting color

...but can manage to recognize it in many domains

That's what we call...

...domain adaptation

Domain adaptation

One task (classification, segmentation...)

Two datasets

Target − paintings

Unlabeled or semi-labeled

Source − photos

Fully labelled

Image-to-image translation

One way of doing domain adaptation is to use image-to-image translation methods
Goal: translate an image, e.g. synthetic -> real
But can be used for other purposes than domain adaptation

Relation with transfer learning

Applications of domain adaptation

Calibration (physics, biology...)	Simulation vs Reality
Sentiment analysis between different categories	Adaptation between cameras

Applications of image-to-image translation: synthetic data

Data can be very expensive to obtain and annotate.
Synthetic data is relatively cheap and contains ground-truth information, but it's usefulness is limited due to a domain shift.

Applications of image-to-image translation: synthetic data

Real data is often even impossible to annotate, because of not all the information is present on just an image:
- Oclussions
- No 3D information (depth)

Applications of image-to-image translation: synthetic data

Take syntetic data and make it real!

Sure, we GAN!

Image-to-image translation methods for the win

Different flavours of domain adaptation

	Source	Target
Unsupervised domain adaptation	Fully labeled	Fully unlabeled
Semi-supervised domain adaptation	Fully labeled	Partially labeled
Few-shot domain adaptation	Fully labeled	Few samples

Different flavours of image-to-image translation

	Presence of pairs
Supervised i2i translation	Yes
Unsupervised i2i translation	No

Supervised image-to-image translation

Supervision - correspondence between samples in the datasets

pix2pix - https://phillipi.github.io/pix2pix/

Conditional GAN

source sample passed to the generator
the discriminator sees pairs: corresponding source & target images

Unsupervised image-to-image translation

More challenging - relation between datasets not present in the data
But also more interesting - practically often impossible to get create datasets with one-to-one correspondences

Synthetic images

Real images

What is the relation between images from different domains?

Classical model of domain adaptation

(Ben David et al., 2010)

Probabilistic perspective: we consider two distributions $P(X_s, Y_s)$ and $P(X_t, Y_t)$ for source and target samples/labels
Discrepency between the domains: \[ \epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{H\Delta H}(X_s, X_t) + \lambda(VC) \] where $h \in \mathcal{H}$ is a classifier, $\epsilon_{S/T}$ the error on the source/target distribution and
\[ d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] | \]
Goal: finding $f: \chi_s \rightarrow \chi_t$ to minimize $\frac{1}{2} d_{H \Delta H}(X_s, f(X_t))$ and $h$ to minimize $\epsilon_S(h)$

Classical model of domain adaptation

(Ben David et al., 2010)

\[ d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] | \]

Hypothesis 1	Hypothesis 2

To which extent can we find two hypothesis very similar in one domain, but very different in the other

Classical model of domain adaptation

But this theoretical distance is hard to compute, so even harder to minimize with any classicial optimization algorithm... We have to find other distances
In probability theory, we often use divergences

Definition. Let $S$ be a space of probability distributions. A divergence $D: S \times S \rightarrow \mathbb{R}$ is a function such that:

$D(P, Q) \geq 0$ for all $P,Q \in S$
$D(P,Q) = 0 \iff P=Q$

Examples: KL-divergence, Wasserstein distance, JS-divergence, etc.
Most DA algorithms consists in choosing a divergence and minimizing it

Optimal Transport

Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$

(Courtesy Vincent Herrmann)

Optimal Transport

Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$

(Courtesy Vincent Herrmann)

$\mathrm{W}(P_r,P_{\theta})=\inf_{\gamma \in \Pi} \, \langle \mathbf{D}, \mathbf{\Gamma} \rangle_\mathrm{F}$