Domain Adaptation and Image-to-Image Translation

Arthur Pesah
Master's student at KTH

Sebastian Bujwid
DL Engineer at Univrses + ML MSc student at KTH

Can you recognize this image?

Stockholms Slott − Anna Palm de Rosa
You might have never seen paintings of the Castle...

Photo day

Photo night

Painting black and white

Painting color

...but can manage to recognize it in many domains

That's what we call...









...domain adaptation

Domain adaptation

One task (classification, segmentation...)

Two datasets

Target − paintings

Unlabeled or semi-labeled

Source − photos

Fully labelled

Image-to-image translation

  • One way of doing domain adaptation is to use image-to-image translation methods
  • Goal: translate an image, e.g. synthetic -> real
  • But can be used for other purposes than domain adaptation

Relation with transfer learning

Applications of domain adaptation

Calibration (physics, biology...)

Simulation vs Reality

Sentiment analysis between different categories

Adaptation between cameras

Applications of image-to-image translation: synthetic data

  • Data can be very expensive to obtain and annotate.
  • Synthetic data is relatively cheap and contains ground-truth information, but it's usefulness is limited due to a domain shift.

Applications of image-to-image translation: synthetic data

  • Real data is often even impossible to annotate, because of not all the information is present on just an image:
    • Oclussions
    • No 3D information (depth)

Applications of image-to-image translation: synthetic data

  • Take syntetic data and make it real!
  • Sure, we GAN!
  • Image-to-image translation methods for the win

Different flavours of domain adaptation

Source Target
Unsupervised domain adaptation Fully labeled Fully unlabeled
Semi-supervised domain adaptation Fully labeled Partially labeled
Few-shot domain adaptation Fully labeled Few samples

Different flavours of image-to-image translation

Presence of pairs
Supervised i2i translation Yes
Unsupervised i2i translation No

Supervised image-to-image translation

  • Supervision - correspondence between samples in the datasets
  • pix2pix - https://phillipi.github.io/pix2pix/
  • Conditional GAN
    • source sample passed to the generator
    • the discriminator sees pairs: corresponding source & target images

Unsupervised image-to-image translation

  • More challenging - relation between datasets not present in the data
  • But also more interesting - practically often impossible to get create datasets with one-to-one correspondences
  • Synthetic images
    Real images
  • What is the relation between images from different domains?

Classical model of domain adaptation

(Ben David et al., 2010)

  • Probabilistic perspective: we consider two distributions $P(X_s, Y_s)$ and $P(X_t, Y_t)$ for source and target samples/labels
  • Discrepency between the domains: \[ \epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{H\Delta H}(X_s, X_t) + \lambda(VC) \] where $h \in \mathcal{H}$ is a classifier, $\epsilon_{S/T}$ the error on the source/target distribution and
    \[ d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] | \]
  • Goal: finding $f: \chi_s \rightarrow \chi_t$ to minimize $\frac{1}{2} d_{H \Delta H}(X_s, f(X_t))$ and $h$ to minimize $\epsilon_S(h)$

Classical model of domain adaptation

(Ben David et al., 2010)

\[ d_{H\Delta H}(X_s, X_t) = 2 \sup_{h,h' \in \mathcal{H}} | \mathbb{E}_{x \sim X_s}[h(x) \neq h'(x)] - \mathbb{E}_{x \sim X_t}[h(x) \neq h'(x)] | \]

Hypothesis 1

Hypothesis 2

To which extent can we find two hypothesis very similar in one domain, but very different in the other

Classical model of domain adaptation

  • But this theoretical distance is hard to compute, so even harder to minimize with any classicial optimization algorithm... We have to find other distances
  • In probability theory, we often use divergences
  • Definition. Let $S$ be a space of probability distributions. A divergence $D: S \times S \rightarrow \mathbb{R}$ is a function such that:

    1. $D(P, Q) \geq 0$ for all $P,Q \in S$
    2. $D(P,Q) = 0 \iff P=Q$

  • Examples: KL-divergence, Wasserstein distance, JS-divergence, etc.
  • Most DA algorithms consists in choosing a divergence and minimizing it

Optimal Transport

  • Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$
  • (Courtesy Vincent Herrmann)

Optimal Transport

  • Mathematical framework describing how to minimize the Earth Mover Distance, also called Wasserstein Distance $\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{x,y} \Vert x - y \Vert \gamma (x,y)$
  • (Courtesy Vincent Herrmann)

  • $\mathrm{W}(P_r,P_{\theta})=\inf_{\gamma \in \Pi} \, \langle \mathbf{D}, \mathbf{\Gamma} \rangle_\mathrm{F}$

Optimal Transport

Special case: empirical distribution (uniform distribution on every sample)

(Made with the optimal transport library POT)

$\mathrm{W}(P_r,P_{\theta}) = \inf_{\gamma \in \Pi} \, \sum\limits_{i,j} \Vert x_i - y_j \Vert \gamma (x_i,y_j)$ with $\gamma (x_i,y_j) \in \{0,1\}$

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Joint Distribution Optimal Transportation

(Courty et al., 2017)

(Made with the optimal transport library POT)

Deep JDOT

(Damodaran et al., 2017)

Recent improvements of the first paper:

  • Learn optimal transport in an embedding space (instead of the input space)
  • Iterative algorithm:
    • For a fixed transport plan, learn an embedding network that make the linked points closer, and a classifier on the pseudo-labels
    • For a fixed embedding and classifier, learn the best transport plan

Adversarial Domain Adaptation

  • Revolution in domain adaptation starting with Ganin et al., 2015
  • Using GAN-like deep architectures to minimize the Jensen-Shannon divergence between our distributions
  • Adversarial domain adaptation framework:
    • A conditional generator that takes a target input and tries to generate a source-like output
    • A discriminator that tries to separate real source samples from source-like generated samples
    • The generator tries to make the discriminator bad
  • Almost all the recent DA papers are variations on that structure

Adversarial Domain Adaptation

Classical GAN architecture:

Domain Adaptation: replace the input noise by a target image

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Adversarial Domain Adaptation

Toy example: 2D gaussians

Cycle-GAN

(Zhu et al., 2017)

Cycle-GAN

(Zhu et al., 2017)

  • Cycle-consistency assumption
  • \[ F_{TS}(F_{ST}(x_S)) \approx x_S \]
  • It's an approximate assumption - we don't want (can't have) it

UNsupervised Image-to-Image Translation (UNIT)

(Liu et al., 2017)

  • Hypothesis: shared latent-space
  • Training: VAE and GAN

UNsupervised Image-to-Image Translation (UNIT)

(Liu et al., 2017)

UNIT


Compared to just a target discriminator (no cycle-consistency, no shared latent space)

Extra regularize the latent space

UNIT


Beyond UNIT (my results) - regularizing the output space!

State-of-the art algorithm: VADA-DIRT-T

(Shu et al., 2017)

  • Virtual Adversarial Domain Adaptation (VADA)
    • Adversarial Method: find an embedding space invariant between the two domains and a hypothesis that classify the source in this embedding space
    • Cluster Assumption: the decision-boundary should not cross high-density regions ⇒ the output probability should be extreme (high-confidence) ⇒ $\min_{\theta} \mathbb{E}_{x \in \mathcal{D_t}}[-h_{\theta}(x) \log(h_{\theta}(x))] $
    • Virtual Adversarial Training: the hypothesis should be invariant to slight perturbation of the input (adversarial examples). We can minimize the KL divergence between $h_{\theta}(x)$ and $h_{\theta}(x+r)$ for $||r||<\epsilon$

State-of-the art algorithm: VADA-DIRT-T

(Shu et al., 2017)

Evolution of domain adaptation results

On the MNIST-SVHN benchmark

(Russo et al., 2017)

SVHN (source) → MNIST (target)

Year Algorithm Accuracy
2015 SA 59.3
DANN 73.8
2016 DRCN 82.0
DTN 90.7
2017 UNIT 90.5
GenToAdapt 92.4
DA_assoc 97.6
2018 DIRT-T 99.4
Deep JDOT 96.7

Resources