Institute for Advanced Study, Princeton

@TingAstro

Yuan-Sen Ting

The Emergence of Deep-Learning from a Physics Point of View

Australian National University

with  Sihao Cheng and Brice Menard

Lecture at Tsinghua University, Dec 2020

Cat

Dog

Deep learning is able to distill simplicity from complex phenomena

"One of the principal objects of research in my department of knowledge is to find the point of view from which the subject appears in the greatest simplicity"

- Gibbs

Cat

Dog

Deep learning is able to distill simplicity from complex phenomena

Convolutional neural networks operations are generic. But why?

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

I(\vec{x})
*
A \circ N(
)
\psi
*
\psi
A \circ N (
)
\cdots
)
*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A

Can we understand the insights behind these operations?

I(\vec{x})

Can you try to describe the following image ?

Credit: Brice Menard

Can you try to describe the following image ?

Can you try to describe the following image ?

How about this ?

How about this ? Human heuristics are no longer sufficient

We are entering a new era of "high-definition" astronomy

COBE 1992

WMAP 2003

Planck 2013

Cosmology

Credit: Le Figaro

Cosmic Microwave Background

We are entering a new era of "high-definition" astronomy

Cosmology

Cosmic Reionization

Credit: ESO / Simulation

Data coming soon ! (e.g., SKA)

We are entering a new era of "high-definition" astronomy

Large-scale structure

Credit: TNG Simulation

Cosmology

Observed since the 1990s

We are entering a new era of "high-definition" astronomy

GAIA Early Data Release 3

Galactic physics

Credit: ESA

Motions of a billion stars

2020

The Universe is both structured and stochastic

Complexity

Stochasticity

("uninformative" variances)

(or "structures")

Mathematical objects

Mandelbrot Set

p(x|\mu, \Sigma)

Credit: Brice Menard

Complexity

Stochasticity

 Object-based phenomena

My niece 

E.g.,

orientation

projecting out uninformative variability

"Physical Modeling"

I(R|b,n,R_e) \sim \exp\bigg(-b\Big[\big(\frac{R}{R_e}\big)^{1/n} -1\Big]\bigg)

Sersic profile

(or "structures")

("uninformative" variances)

How about physical fields ?

( i.e., random processes )

Complexity

Stochasticity

?????

Cosmic Microwave Background

Weak Lensing

Reionization

Intergalactic Medium Tomography

("Cosmic Web")

(or "structures")

("uninformative" variances)

Credit: Sihao Cheng

Complexity

Stochasticity

Complex

Simple

 A stationary Gaussian Process

Cosmic Microwave Background

(or "structures")

("uninformative" variances)

On inferring physical parameters from random processes

\sim \; p(x_1,\cdots, x_n | \theta) \; p(\theta)
x_1
x_n
p( \theta | x_1,\cdots x_n)

Physical parameters

Observations

Posterior

Likelihood

\sim \; p(y_1, \cdots, y_m | \theta) \; p(\theta)

 Very high dimension, impossible to characterize

Summary statistics

m \ll n

Prior

Complexity

Stochasticity

Cosmic Microwave Background

I(x_1,\cdots,x_n)
y_1,\cdots,y_m

Eliminating uninformative variability

(or "structures")

 A stationary Gaussian Process

("uninformative" variances)

Recall : Stationary Gaussian Process

Definition :

A random process                  is a Gaussian Process iff

I(\vec{x})
\forall n, \forall \{ \vec{x_1}, \cdots,\vec{x_n} \},
\{ I(\vec{x_1}), I(\vec{x_2}), \cdots , I(\vec{x_n}) \} \sim \mathcal{N}(\vec{\mu},\Sigma)

Definition: A random process is stationary iff

\forall \vec{\Delta x},
P(I(\vec{x_1}), I(\vec{x_2}), \cdots , I(\vec{x_n})) \sim P(I(\vec{x_1} + \vec{\Delta x}), \cdots , I(\vec{x_n} + \vec{\Delta x}) \}

Translational invariance

If the Gaussian process is stationary, then

\Sigma(\vec{x_i}, \vec{x_j}) = \Sigma(\vec{\Delta x}),
\vec{\mu} = \mathrm{const.}

Complete characterization

Recap 1 :

The Universe is structured but stochastic.

The ultimate goal is to find the right "language" to describe structures while eliminating uninformative stochasticity.

For simple stochastic processes, such as the Cosmic Microwave Background, models based on human heuristic might be sufficient.

For complex processes, human heuristics often fail; what are the insights behind the generic operations in machine learning. Why do they work?

random processes

Designing summary statistics - attempt 1 : Power Spectrum

I(\vec{x}) \sim p(x_1,\cdots, x_n) \sim p(y_1,\cdots,y_m)

Stationary Gaussian Process

\simeq \langle \; I(\vec{x} + \vec{\Delta x}) I(\vec{x}) \; \rangle_{{\vec{x}}}
\tilde{I}(\vec{k}) = \mathcal{F}(I(\vec{x}))
= |\tilde{I}(\vec{k})|^2 \equiv P(\vec{k})

Parseval's Theorem

y (\vec{\Delta x}) \equiv \Sigma_{ij} = \Sigma(\vec{\Delta x})

A realization

Summary statistics

\equiv [I * I](\vec{\Delta x})

Let Fourier Transform: 

Two-point correlation function

Power spectrum

Ergodic 遍历

But these two images have the same power spectrum

Credit: Brice Menard

Complexity

Stochasticity

Complex

Simple

Taking  the power spectrum

Losing structural information

"Gaussian"

"Non-Gaussian"

(or "structures")

("uninformative" variances)

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M
\sigma_8

Power spectrum

Real application: Imaging the dark matter cosmic web with weak lensing

Dark Matter Density

Growth Amplitude

Power spectrum fails to distinguish the intricate differences between the two maps

, YST, Menard & Bruna + 2020

Cheng

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M
\sigma_8

Power spectrum

Real application: Imaging the dark matter cosmic web with weak lensing

Dark Matter Density

Growth Amplitude

Power spectrum fails to distinguish the intricate differences between the two maps

, YST, Menard & Bruna + 2020

Cheng

Our summary

statistics

But why power spectrum loses important "non-Gaussian" structural information ?

There is no locality information in a single Fourier coefficient

I(\vec{x})
\tilde{I}(\vec{k}) = \big[I(\vec{x}) * \exp(i \vec{k} \cdot \vec{x})\big]_{\vec{x}''=0}

Completely delocalized kernel in the real space

Extremely localized information in the Fourier space

\langle x^2 \rangle_{I(x)} \langle k^2 \rangle_{\tilde{I}(k)}
> \mathrm{const.}

Uncertainty principle

We need to cross correlate more than one point in the Fourier space to define locality

\langle k^2 \rangle = 0
\Rightarrow \langle x^2 \rangle = \infty

The degeneracy in the Fourier phases defines "locality"

More "localized"

"Delocalized"

"Delta process"

Superlocalized

"Delta process"

Superlocalized

(x',y')

also only has 2D of freedom

Quantifying degeneracy through cross correlation is the key

I(\vec{x})

The "locality" of a random process expresses itself in the form of the degeneracy in the Fourier phases

\mathcal{F}(\delta (x,y)) = 1
= \exp(i(k_x x' + k_y y'))
\mathcal{F}(\delta (x-x',y-y'))

Phase

\omega_{x',y'}(\vec{k})
\omega_{x',y'}(\vec{k})

But for a stationary field, the expectation any cross correlation is trivial

\Big\langle \tilde{I}(\vec{k_1}) \tilde{I}(\vec{k_2}) \Big\rangle = 0, \; \mathrm{if}\; \vec{k_1} + \vec{k_2} \neq 0

When performing a Fourier analysis, to extract the locality information, second-order moment alone is not sufficient

A simpler way to think about locality : a 1D case

p(x)
x

Consider a single random variable

x \sim p(x)

In 1D, power spectrum is equivalent to taking the second moment

\langle x^2 \rangle_{x \sim p(x)}

Variance

Skewness

But skewness defines locality

Recap 2 :

For stationary Gaussian processes, the power spectrum fully characterizes the systems

However, power spectrum discards structural information in the higher moments for non-Gaussian processes

Fourier Transform is extremely delocalized in the real space, hence a single component in the Fourier space cannot recover locality

Quantifying distribution functions with higher order moments

p(x)
x

Consider a single random variable

x \sim p(x)

Variance

Skewness

Skewness defines locality

Classical ideas : characterizing

p(x)

with all its moments

\langle x \rangle, \langle x^2 \rangle, \cdots, \langle x^n \rangle
\tilde{I}(\vec{k_1})
\tilde{I}(\vec{k_2})
B(\vec{k_1},\vec{k_2}) \equiv \langle \tilde{I}^*(\vec{k_1}+\vec{k_2}) \tilde{I}(\vec{k_1}) \tilde{I}(\vec{k_2}) \rangle

Quantifying higher order moments for random processes

Study the dependency of phases in the Fourier space

\tilde{I}^*(\vec{k_1}+\vec{k_2})
\tilde{I}(\vec{k})

E.g., Bispectrum

But higher-order moments are inefficient in compressing information

p(x_1,\cdots, x_n | \theta) \sim p(B_m(\vec{k_1},\vec{k_2}) | \theta)
m \simeq n

(1) Inefficient

\sim \; p(x_1,\cdots, x_n | \theta) \; p(\theta)
p( \theta | x_1,\cdots x_n)

Posterior

\tilde{I}(\vec{k})
\tilde{I}(\vec{k_1})
\tilde{I}(\vec{k_2})
\sim \; p(y_1, \cdots, y_m | \theta) \; p(\theta)

Summary statistics

m \ll n
\tilde{I}^*(\vec{k_1}+\vec{k_2})

Reducing dimension

Complexity

Stochasticity

Complex

Simple

Power spectrum

"Non-Gaussian"

Bispectrum

(The "dimension reduction" is too inefficient)

(or "structures")

("uninformative" variances)

p(x)
x

Heavy tail
(Non-Gaussian)

Credit: Sihao Cheng

(2) Unstable

See also Carron+ 11

Higher-order moments amplify the "tail"

p(x)
x
x \cdot p(x)
x^2 \cdot p(x)
\langle x \rangle
\langle x^2 \rangle

Credit: Sihao Cheng

(2) Unstable

See also Carron+ 11

Higher-order moments amplify the "tail"

p(x)
x
x \cdot p(x)
x^2 \cdot p(x)
\langle x \rangle
\langle x^2 \rangle
p(x)
x
x \cdot p(x)
x^2 \cdot p(x)
\langle x \rangle
\langle x^2 \rangle
x^n \cdot p(x)
\langle x^n \rangle

Credit: Sihao Cheng

(2) Unstable

Higher-order moments amplify the "tail"

See also Carron+ 11

Depend critically on the "outliers" that are usually not well sampled

The estimate can be noisy

(2) Unstable

p(x)

Let's consider an 1D distribution

Classical ideas : characterizing

p(x)

with all its moments

\langle x \rangle, \langle x^2 \rangle, \cdots, \langle x^n \rangle

But for distributions with "heavy tails", e.g., power law distributions, moments fail

p(x) \sim x^{-\alpha}

e.g.,

then

\langle x^n \rangle = \int \mathrm{d}x \, x^n p(x) = \infty

when

n \geq \alpha -1

Higher-order moments amplify the "tail"

Recap 3 :

The classical lore is that a probabilistic distribution function is fully characterized by all its moments

In this case, higher-order moments are inefficient and unstable

But this is only true for distributions without "heavy tails"

Most physical phenomena are highly skewed due to preferential growth and, e.g., are best described by power laws

How should we characterize non-Gaussian information without limited by the shortcomings of the power spectrum and higher-order moments

A "wish list" for an ideal summary statistic

(1) Preserve locality

(2) Bin information effectively

Limitations of  the power spectrum

(3) Keep data in the lower order ["stable"]

(4) Extract the locality information

A "wish list" for an ideal summary statistic

(3) Keep data in the lower order ["stable"]

(2) Bin information effectively

Limitations of higher order moments

(1) Preserve locality

(4) Extract the locality information

Understanding the generic operations in convolutional neural networks

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

A "wish list" for an ideal summary statistic

(2) Bin information effectively

In  Fourier transform :

I(\vec{x}) * \psi_m (\vec{x})

Wavelets

"Delocalized" in real space

Preserve locality

Alternative :

(1) Preserve locality

(4) Extract the locality information

\tilde{I}(\vec{k}) = I(\vec{x}) * \exp(i \vec{k} \cdot \vec{x})

(3) Keep data in the lower order ["stable"]

Wavelet convolution = "band-pass" filtering

Real Space

Fourier Space

Wavelet

I(\vec{x}) * \psi_m(\vec{x})
\psi_m(\vec{x})

Frequency mask

\tilde{\psi}_m(\vec{k})

Wavelet

\tilde{I}(\vec{k}) \cdot \tilde{\psi}_m(\vec{k})

An agglomerate of many Fourier eigenmodes

= preserve locality

A "wish list" for an ideal summary statistic

(2) Bin information effectively

I(\vec{x})
I(\vec{x}) * \psi_m(\vec{x})

(1) Preserve locality

(4) Extract the locality information

(3) Keep data in the lower order ["stable"]

Credit: Sihao Cheng

Real Space

Fourier Space

Wavelet

I(\vec{x}) * \psi_m(\vec{x})
\psi_m(\vec{x})

Frequency mask

\tilde{\psi}_m(\vec{k})

Wavelet

\{ m \}
\tilde{I}(\vec{k}) \cdot \tilde{\psi}_m(\vec{k})

The finite width of band passes = a finite number of kernels

A "wish list" for an ideal summary statistic

(2) Bin information effectively

I(\vec{x})
I(\vec{x}) * \psi_m(\vec{x})
m \ll n

Number of summary statistics

(1) Preserve locality

(4) Extract the locality information

(3) Keep data in the lower order ["stable"]

Constructing summary statistics from the band-pass-filtered maps

I(\vec{x}) * \psi_m(\vec{x})

Simple averaging  -- to construct translation invariant statistics

y_m

??

\langle I(\vec{x}) * \psi_m(\vec{x}) \rangle
= \langle I(\vec{x}) \rangle * \psi_m(\vec{x})

Convolution and expectation are commutative

= (\vec{\mathrm{const.}}) * \psi_m (\vec{x})

The system is stationary 

= \vec{0}

properties of wavelet

The representation is always trivial

(both can be written as integrals)

Attempt 1 :

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

Averaging itself will not work

Understanding the generic operations in convolutional neural networks

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

Need to couple averaging with a non-linear operation

Understanding the generic operations in convolutional neural networks

Only non-linear operations can render the representation non-trivial

Let

Attempt 2 :

f

to be a linear function

\Big\langle f(I * \psi_m) \Big\rangle
= f \Big(\langle I * \psi_m \rangle \Big)
= f (\vec{0})

Again,  leads to a trivial representation

f

need to be a non-linear function

Attempt 3 :

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

A powerful and necessary "set" of operations

???

why iterate

Understanding the generic operations in convolutional neural networks

A single "operation block" is nothing but the binned power spectrum

Let

f

be the modulus squared

\Big\langle f(I * \psi_m) \Big\rangle \equiv \langle | I * \psi_m |^2 \rangle_{\vec{x}}
\equiv \int \mathrm{d}\vec{x} \; (I * \psi_m)(I * \psi_m)^*
= \int \mathrm{d}\vec{k} \; |\tilde{I}(\vec{k})|^2 |\tilde{\psi}_m(\vec{k})|^2
\equiv \int \mathrm{d}\vec{k} \; P(k) |\tilde{\psi}_m(\vec{k})|^2
\langle | I * \psi_m |^2 \rangle
= \int \mathrm{d}\vec{k} \; P(k) |\tilde{\psi}_m(\vec{k})|^2

Power spectrum

Binning/Averaging

Weight

\tilde{\psi}_m(\vec{k})

Frequency mask

The wavelet "operation block" retains the locality information, but we have not yet extracted the locality information

A single "operation block" is nothing but the binned power spectrum

A "wish list" for an ideal summary statistic

(3) Keep data in the lower order ["stable"]

(2) Bin information effectively

(1) Preserve locality

(4) Extract the locality information

p(x)
x
x^n \cdot p(x)

In practice, choose non-linear operations that are linear w.r.t. data

f(I * \psi_m) = | I(\vec{x}) * \psi_m(\vec{x}) |^2

Higher-Order moments are unstable

E.g., take the modulus, instead of modulus squared

= | I(\vec{x}) * \psi_m(\vec{x}) |

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

The importance of choosing a linear / sublinear function at large x

???

why iterate

Understanding the generic operations in convolutional neural networks

p(x)
x
x \cdot p(x)
x^2 \cdot p(x)
\langle x \rangle
\langle x^2 \rangle
p(x)
x
x \cdot p(x)
x^2 \cdot p(x)
\langle x \rangle
\langle x^2 \rangle
x^n \cdot p(x)
\langle x^n \rangle

Credit: Sihao Cheng

(2) Unstable

The estimates of higher-order moments are unstable

See also Carron+ 11

How to characterize "higher moments" through low-order operations

p(x)
x
s_0 = \langle x \rangle
s_0
x_1 = \big| \, x - \langle x \rangle \, \big|

"Folding"

p(x)

Credit: Sihao Cheng

"Folding" = non-linear operation + averaging

p(x)
x
s_0

"Folding"

p(x)
s_1
s_1 = \langle x_1 \rangle,

Credit: Sihao Cheng

s_0 = \langle x \rangle
x_1 = \big| \, x - \langle x \rangle \, \big|

How to characterize "higher moments" through low-order operations

"Folding" = non-linear operation + averaging

p(x)
x
s_0
s_1

"Folding"

p(x)
s_2

Credit: Sihao Cheng

s_1 = \langle x_1 \rangle,
s_0 = \langle x \rangle
x_1 = \big| \, x - \langle x \rangle \, \big|
x_2 = \big| \, x_1 - \langle x_1 \rangle \, \big|
s_2 = \langle x_2 \rangle,

Linear order with respect to

x

How to characterize "higher moments" through low-order operations

stable and robust summary statistcs

(1) Convolution

(2) Non-linearity

(or "activation function")

e.g., ReLU

(4) Iterate

*
\psi

e.g., tanh, sigmoid

(3) Averaging

N
A
I(\vec{x})

Understanding the generic operations in convolutional neural networks

* \; \psi

Convolution with wavelets

Averaging

\langle \cdot \rangle

Low-order non-linear function

| \cdot |

Iterate over these operations

(2) Bin information effectively

(1) Preserve locality

(4) Extract the locality information

(3) Keep data in the lower order ["stable"]

A "wish list" for an ideal summary statistic

Mallat & Bruna 2012

Scattering Transform: approaching deep learning without "learning"

I_0
I_1 \equiv | I_0 * \psi_{m} |
I_2 \equiv | I_1 * \psi_{m} |

Summary statistics :

\vec{y}_1 = \langle I_1 \rangle
\vec{y}_2 = \langle I_2 \rangle

Mallat & Bruna 2012

Complexity

Stochasticity

Complex

Simple

Power spectrum

"Non-Gaussian"

Bispectrum

(or "structures")

("uninformative" variances)

Scattering Transform

Is it possible to "learn" from a training sample of a few ? Yes!

I_1 \equiv | I_0 * \psi_{j,l} |
I_2 \equiv | I_1 * \psi_{j,l} |
\vec{a} = \{ \vec{y}_{1,a}, \vec{y}_{2,a} \}
\vec{b} = \{ \vec{y}_{1,b}, \vec{y}_{2,b} \}
\vec{c} = \{ \vec{y}_{1,c}, \vec{y}_{2,c} \}
\vec{a} \neq \vec{b} \neq \vec{c}

Do you know how to describe these images now?

Scattering

Transform

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M
\sigma_8

Power spectrum

Real application: Imaging the dark matter cosmic web with weak lensing

Dark Matter Density

Growth Amplitude

Power spectrum fails to distinguish the intricate differences between the two maps

, YST, Menard & Bruna + 2020

Cheng

Scattering

transform

, YST, Menard & Bruna + 2020

Cheng

Power Spectrum

Peak Counts

Scattering Transform

15 coefficients

37 coefficients

20 coefficients

10

100

1000

Rubin

Roman

noiseless

DES

(\Omega_M,\sigma_8)

Figure of Merit

Galaxy number density (arcmin      )

The method provides 2-10 times stronger constraint in cosmology

-2

10

30

100

(state-of-the-art)

(our study)

x2

x10

x3

Weak lensing

Final Recap :

While deep learning is still mysterious in many ways, it is not a complete black box.

The core operations of convolutional neural networks stem naturally from the limitations of the power spectrum and higher-order moments.

The better understanding of deep learning has led to many new powerful analytic tools (e.g., scattering transform) to characterize complex structures

The intersection of statistics and machine learning is pushing the limit of deciphering how machines can learn from a sample of a few, just like human

Extra Slides

Weak Lensing

Credit: Wikipedia

Background galaxies

Foreground dark matter

Unlensed

Lensed

, YST, Menard & Bruna + 2020

Cheng

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M
\sigma_8

Hierarchical clustering contains non-Gaussian information

Scattering

Transform

Power spectrum

First-order S1

Second-order S2

Non-Gaussianity in the form of hierarchical clustering