By Arthur Pesah − Master's student at KTH
Can you recognize this image?

Stockholms Slott − Anna Palm de Rosa
You might have never seen paintings of the Castle...
 Photo day Photo night Painting black and white Painting color
...but can manage to recognize it in many domains

That's what we call...

Two datasets

 Target − paintings Unlabeled or semi-labeled Source − photos Fully labelled

### Applications

 Calibration (physics, biology...) Simulation vs Reality Sentiment analysis between different categories Adaptation between cameras

### Different flavours of domain adaptation

Source Target
Unsupervised domain adaptation Fully labeled Fully unlabeled
Semi-supervised domain adaptation Fully labeled Partially labeled
Few-shot domain adaptation Fully labeled Few samples

### Classical model of domain adaptation

(Ben David et al., 2010)

• Probabilistic perspective: we consider two distributions $P(X_s, Y_s)$ and $P(X_t, Y_t)$ for source and target samples/labels
• Discrepency between the domains: $\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{H\Delta H}(X_s, X_t) + \lambda(VC)$ where $h \in \mathcal{H}$ is a classifier, $\epsilon_{S/T}$ the error on the source/target distribution and
$d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] |$
• Goal: finding $f: \chi_s \rightarrow \chi_t$ to minimize $\frac{1}{2} d_{H \Delta H}(X_s, f(X_t))$ and $h$ to minimize $\epsilon_S(h)$

### Classical model of domain adaptation

(Ben David et al., 2010)

$d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] |$

 Hypothesis 1 Hypothesis 2

To which extent can we find two hypothesis very similar in one domain, but very different in the other

### Classical model of domain adaptation

• But this theoretical distance is hard to compute, so even harder to minimize with any classicial optimization algorithm... We have to find other distances
• In probability theory, we often use divergences
• Definition. Let $S$ be a space of probability distributions. A divergence $D: S \times S \rightarrow \mathbb{R}$ is a function such that:

1. $D(P, Q) \geq 0$ for all $P,Q \in S$
2. $D(P,Q) = 0 \iff P=Q$

• Examples: KL-divergence, Wasserstein distance, JS-divergence, etc.
• Most DA algorithms consists in choosing a divergence and minimizing it

### Optimal Transport

• Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$
• (Courtesy Vincent Herrmann)

### Optimal Transport

• Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$
• (Courtesy Vincent Herrmann)

• $\mathrm{W}(P_r,P_{\theta})=\inf_{\gamma \in \Pi} \, \langle \mathbf{D}, \mathbf{\Gamma} \rangle_\mathrm{F}$

### Optimal Transport

Special case: empirical distribution (uniform distribution on every sample)

(Made with the optimal transport library POT)

$\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{i,j} \Vert x_i - y_j \Vert \gamma (x_i,y_j)$ with $\gamma (x_i,y_j) \in \{0,1\}$

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

### Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

• Revolution in domain adaptation starting with Ganin et al., 2015
• Using GAN-like deep architectures to minimize the Jensen-Shannon divergence between our distributions
• A conditional generator that takes a target input and tries to generate a source-like output
• A discriminator that tries to separate real source samples from source-like generated samples
• The generator tries to make the discriminator bad
• Almost all the recent DA papers are variations on that structure

Example of CycleGAN (Zhu et al., 2017)

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

Toy example: 2D gaussians

### UNsupervised Image-to-Image Translation (UNIT)

(Liu et al., 2017)

• Hypothesis: shared latent-space
• Training: VAE and GAN

### UNsupervised Image-to-Image Translation (UNIT)

(Liu et al., 2017)

Results

(Shu et al., 2017)

• Adversarial Method: find an embedding space invariant between the two domains and a hypothesis that classify the source in this embedding space
• Cluster Assumption: the decision-boundary should not cross high-density regions ⇒ the output probability should be extreme (high-confidence) ⇒ $\min_{\theta} \mathbb{E}_{x \in \mathcal{D_t}}[-h_{\theta}(x) \log(h_{\theta}(x))]$
• Virtual Adversarial Training: the hypothesis should be invariant to slight perturbation of the input (adversarial examples). We can minimize the KL divergence between $h_{\theta}(x)$ and $h_{\theta}(x+r)$ for $||r||<\epsilon$

(Shu et al., 2017)

### Evolution of domain adaptation results

On the MNIST-SVHN benchmark

SVHN (source) → MNIST (target)

Year Algorithm Accuracy
2015 SA 59.3
DANN 73.8
2016 DRCN 82.0
DSN 82.7
DTN 90.7
2017 UNIT 90.5