Sure, we GAN!
Image-to-image translation methods for the win
Different flavours of domain adaptation
Source
Target
Unsupervised domain adaptation
Fully labeled
Fully unlabeled
Semi-supervised domain adaptation
Fully labeled
Partially labeled
Few-shot domain adaptation
Fully labeled
Few samples
Different flavours of image-to-image translation
Presence of pairs
Supervised i2i translation
Yes
Unsupervised i2i translation
No
Supervised image-to-image translation
Supervision - correspondence between samples in the datasets
pix2pix - https://phillipi.github.io/pix2pix/
Conditional GAN
source sample passed to the generator
the discriminator sees pairs: corresponding source & target images
Unsupervised image-to-image translation
More challenging - relation between datasets not present in the data
But also more interesting - practically often impossible to get create datasets with one-to-one correspondences
Synthetic images
Real images
What is the relation between images from different domains?
Classical model of domain adaptation
(Ben David et al., 2010)
Probabilistic perspective : we consider
two distributions $P(X_s, Y_s)$ and $P(X_t, Y_t)$ for
source and target samples/labels
Discrepency between the domains :
\[
\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{H\Delta H}(X_s, X_t) + \lambda(VC)
\]
where $h \in \mathcal{H}$ is a classifier, $\epsilon_{S/T}$ the error on the source/target distribution and
\[
d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}}
| \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] |
\]
Goal : finding $f: \chi_s \rightarrow \chi_t$ to minimize $\frac{1}{2} d_{H \Delta H}(X_s, f(X_t))$
and $h$ to minimize $\epsilon_S(h)$
Classical model of domain adaptation
(Ben David et al., 2010)
\[
d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}}
| \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] |
\]
Hypothesis 1
Hypothesis 2
To which extent can we find two hypothesis very similar in one domain,
but very different in the other
Classical model of domain adaptation
But this theoretical distance is hard to compute,
so even harder to minimize with any classicial optimization algorithm...
We have to find other distances
In probability theory, we often use divergences
Definition. Let $S$ be a space of probability distributions.
A divergence $D: S \times S \rightarrow \mathbb{R}$ is a function such that:
$D(P, Q) \geq 0$ for all $P,Q \in S$
$D(P,Q) = 0 \iff P=Q$
Examples: KL-divergence, Wasserstein distance, JS-divergence, etc.
Most DA algorithms consists in choosing a divergence and minimizing it
Optimal Transport
Special case: empirical distribution (uniform distribution on every sample)
(Made with the optimal transport library POT)
$\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{i,j} \Vert x_i - y_j \Vert \gamma (x_i,y_j)$
with $\gamma (x_i,y_j) \in \{0,1\}$
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Joint Distribution Optimal Transportation
(Courty et al., 2017)
(Made with the optimal transport library POT)
Deep JDOT
(Damodaran et al., 2017)
Recent improvements of the first paper:
Learn optimal transport in an embedding space (instead of the input space)
Iterative algorithm:
For a fixed transport plan, learn an embedding network that make the linked points closer, and a classifier on the pseudo-labels
For a fixed embedding and classifier, learn the best transport plan
Adversarial Domain Adaptation
Revolution in domain adaptation starting with Ganin et al., 2015
Using GAN-like deep architectures to minimize the Jensen-Shannon divergence
between our distributions
Adversarial domain adaptation framework:
A conditional generator that takes a target input
and tries to generate a source-like output
A discriminator that tries to separate real source samples
from source-like generated samples
The generator tries to make the discriminator bad
Almost all the recent DA papers are variations on that structure
Adversarial Domain Adaptation
Classical GAN architecture:
Domain Adaptation: replace the input noise by a target image
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Adversarial Domain Adaptation
Toy example: 2D gaussians
Cycle-GAN
(Zhu et al., 2017)
Cycle-consistency assumption
\[
F_{TS}(F_{ST}(x_S)) \approx x_S
\]
It's an approximate assumption - we don't want (can't have) it
UNsupervised Image-to-Image Translation (UNIT)
(Liu et al., 2017)
Hypothesis: shared latent-space
Training: VAE and GAN
UNsupervised Image-to-Image Translation (UNIT)
(Liu et al., 2017)
Your browser does not support the video tag.
UNIT
Your browser does not support the video tag.
Compared to just a target discriminator (no cycle-consistency, no shared latent space)
Extra regularize the latent space
Your browser does not support the video tag.
UNIT
Your browser does not support the video tag.
Beyond UNIT (my results) - regularizing the output space!
State-of-the art algorithm: VADA-DIRT-T
(Shu et al., 2017)
Virtual Adversarial Domain Adaptation (VADA)
Adversarial Method : find an embedding space invariant between the two domains
and a hypothesis that classify the source in this embedding space
Cluster Assumption : the decision-boundary should not cross high-density regions
⇒ the output probability should be extreme (high-confidence)
⇒ $\min_{\theta} \mathbb{E}_{x \in \mathcal{D_t}}[-h_{\theta}(x) \log(h_{\theta}(x))] $
Virtual Adversarial Training : the hypothesis should be invariant to slight perturbation of the input (adversarial examples).
We can minimize the KL divergence between $h_{\theta}(x)$ and $h_{\theta}(x+r)$ for $||r||<\epsilon$
Evolution of domain adaptation results
On the MNIST-SVHN benchmark
SVHN (source) → MNIST (target)