*Institute for Advanced Study, Princeton*

*Australian National University*

with Sihao Cheng and Brice Menard

*Lecture at Tsinghua University, Dec 2020*

Cat

Dog

Cat

Dog

(or "activation function")

e.g., ReLU

I(\vec{x})

*

A \circ N(

)

\psi

*

\psi

A \circ N (

)

\cdots

)

*

\psi

e.g., tanh, sigmoid

N

A

Can we understand the *insights* behind these operations?

I(\vec{x})

Cosmic Microwave Background

Cosmic Reionization

Large-scale structure

Motions of a billion stars

Complexity

Stochasticity

(*"uninformative" *variances)

(or *"structures"*)

Mandelbrot Set

p(x|\mu, \Sigma)

Complexity

Stochasticity

My niece

E.g.,

orientation

*projecting out* uninformative variability

I(R|b,n,R_e) \sim \exp\bigg(-b\Big[\big(\frac{R}{R_e}\big)^{1/n} -1\Big]\bigg)

Sersic profile

(or *"structures"*)

(*"uninformative" *variances)

Complexity

Stochasticity

Cosmic Microwave Background

Weak Lensing

Reionization

Intergalactic Medium Tomography

("Cosmic Web")

(or *"structures"*)

(*"uninformative" *variances)

Complexity

Stochasticity

Complex

Simple

A stationary Gaussian Process

Cosmic Microwave Background

(or *"structures"*)

(*"uninformative" *variances)

\sim \; p(x_1,\cdots, x_n | \theta) \; p(\theta)

x_1

x_n

p( \theta | x_1,\cdots x_n)

Physical parameters

Observations

\sim \; p(y_1, \cdots, y_m | \theta) \; p(\theta)

Very high dimension, impossible to characterize

m \ll n

Complexity

Stochasticity

Cosmic Microwave Background

I(x_1,\cdots,x_n)

y_1,\cdots,y_m

Eliminating uninformative variability

(or *"structures"*)

A stationary Gaussian Process

(*"uninformative" *variances)

Definition :

A random process is a *Gaussian Process* iff

I(\vec{x})

\forall n, \forall \{ \vec{x_1}, \cdots,\vec{x_n} \},

\{ I(\vec{x_1}), I(\vec{x_2}), \cdots , I(\vec{x_n}) \} \sim \mathcal{N}(\vec{\mu},\Sigma)

Definition: A random process is *stationary* iff

\forall \vec{\Delta x},

P(I(\vec{x_1}), I(\vec{x_2}), \cdots , I(\vec{x_n})) \sim P(I(\vec{x_1} + \vec{\Delta x}), \cdots , I(\vec{x_n} + \vec{\Delta x}) \}

If the Gaussian process is *stationary*, then

\Sigma(\vec{x_i}, \vec{x_j}) = \Sigma(\vec{\Delta x}),

\vec{\mu} = \mathrm{const.}

I(\vec{x}) \sim p(x_1,\cdots, x_n) \sim p(y_1,\cdots,y_m)

\simeq \langle \; I(\vec{x} + \vec{\Delta x}) I(\vec{x}) \; \rangle_{{\vec{x}}}

\tilde{I}(\vec{k}) = \mathcal{F}(I(\vec{x}))

= |\tilde{I}(\vec{k})|^2 \equiv P(\vec{k})

Parseval's Theorem

y (\vec{\Delta x}) \equiv \Sigma_{ij} = \Sigma(\vec{\Delta x})

A realization

Summary statistics

\equiv [I * I](\vec{\Delta x})

Let Fourier Transform:

Ergodic 遍历

Complexity

Stochasticity

Complex

Simple

Taking the power spectrum

Losing structural information

(or *"structures"*)

(*"uninformative" *variances)

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M

\sigma_8

Dark Matter Density

Growth Amplitude

Power spectrum fails to *distinguish the intricate differences* between the two maps

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M

\sigma_8

Dark Matter Density

Growth Amplitude

Power spectrum fails to *distinguish the intricate differences* between the two maps

I(\vec{x})

\tilde{I}(\vec{k}) = \big[I(\vec{x}) * \exp(i \vec{k} \cdot \vec{x})\big]_{\vec{x}''=0}

Completely *delocalized *kernel in the real space

Extremely *localized* information in the Fourier space

\langle x^2 \rangle_{I(x)} \langle k^2 \rangle_{\tilde{I}(k)}

> \mathrm{const.}

Uncertainty principle

We need to cross correlate *more than one point* in the Fourier space to define locality

\langle k^2 \rangle = 0

\Rightarrow \langle x^2 \rangle = \infty

(x',y')

also only has 2D of freedom

I(\vec{x})

The *"locality" *of a random process* *expresses itself in the form of the *degeneracy in the Fourier phases*

\mathcal{F}(\delta (x,y)) = 1

= \exp(i(k_x x' + k_y y'))

\mathcal{F}(\delta (x-x',y-y'))

Phase

\omega_{x',y'}(\vec{k})

\omega_{x',y'}(\vec{k})

\Big\langle \tilde{I}(\vec{k_1}) \tilde{I}(\vec{k_2}) \Big\rangle = 0, \; \mathrm{if}\; \vec{k_1} + \vec{k_2} \neq 0

When performing a Fourier analysis, to extract the *locality information*, second-order moment alone is not sufficient

p(x)

x

Consider a single random variable

x \sim p(x)

In 1D, power spectrum is equivalent to taking the second moment

\langle x^2 \rangle_{x \sim p(x)}

Variance

Skewness

But skewness defines locality

p(x)

x

Consider a single random variable

x \sim p(x)

Variance

Skewness

Skewness defines locality

Classical ideas : characterizing

p(x)

with all its *moments*

\langle x \rangle, \langle x^2 \rangle, \cdots, \langle x^n \rangle

\tilde{I}(\vec{k_1})

\tilde{I}(\vec{k_2})

B(\vec{k_1},\vec{k_2}) \equiv \langle \tilde{I}^*(\vec{k_1}+\vec{k_2}) \tilde{I}(\vec{k_1}) \tilde{I}(\vec{k_2}) \rangle

Study the *dependency of phases *in the Fourier space

\tilde{I}^*(\vec{k_1}+\vec{k_2})

\tilde{I}(\vec{k})

E.g., *Bispectrum*

p(x_1,\cdots, x_n | \theta) \sim p(B_m(\vec{k_1},\vec{k_2}) | \theta)

m \simeq n

\sim \; p(x_1,\cdots, x_n | \theta) \; p(\theta)

p( \theta | x_1,\cdots x_n)

\tilde{I}(\vec{k})

\tilde{I}(\vec{k_1})

\tilde{I}(\vec{k_2})

\sim \; p(y_1, \cdots, y_m | \theta) \; p(\theta)

m \ll n

\tilde{I}^*(\vec{k_1}+\vec{k_2})

Reducing dimension

Complexity

Stochasticity

Complex

Simple

Power spectrum

Bispectrum

(The "dimension reduction" is too inefficient)

(or *"structures"*)

(*"uninformative" *variances)

p(x)

x

Heavy tail

(Non-Gaussian)

p(x)

x

x \cdot p(x)

x^2 \cdot p(x)

\langle x \rangle

\langle x^2 \rangle

p(x)

x

x \cdot p(x)

x^2 \cdot p(x)

\langle x \rangle

\langle x^2 \rangle

p(x)

x

x \cdot p(x)

x^2 \cdot p(x)

\langle x \rangle

\langle x^2 \rangle

x^n \cdot p(x)

\langle x^n \rangle

Depend critically on the "outliers" that are usually not well sampled

The estimate can be noisy

p(x)

Let's consider an *1D distribution*

Classical ideas : characterizing

p(x)

with all its *moments*

\langle x \rangle, \langle x^2 \rangle, \cdots, \langle x^n \rangle

But for distributions with "heavy tails", e.g., *power law distributions*, moments fail

p(x) \sim x^{-\alpha}

e.g.,

then

\langle x^n \rangle = \int \mathrm{d}x \, x^n p(x) = \infty

when

n \geq \alpha -1

Limitations of the power spectrum

Limitations of higher order moments

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

In Fourier transform :

I(\vec{x}) * \psi_m (\vec{x})

Wavelets

"Delocalized" in real space

Preserve locality

Alternative :

\tilde{I}(\vec{k}) = I(\vec{x}) * \exp(i \vec{k} \cdot \vec{x})

Wavelet

I(\vec{x}) * \psi_m(\vec{x})

\psi_m(\vec{x})

Frequency mask

\tilde{\psi}_m(\vec{k})

Wavelet

\tilde{I}(\vec{k}) \cdot \tilde{\psi}_m(\vec{k})

An agglomerate of many Fourier eigenmodes

= preserve locality

I(\vec{x})

I(\vec{x}) * \psi_m(\vec{x})

Wavelet

I(\vec{x}) * \psi_m(\vec{x})

\psi_m(\vec{x})

Frequency mask

\tilde{\psi}_m(\vec{k})

Wavelet

\{ m \}

\tilde{I}(\vec{k}) \cdot \tilde{\psi}_m(\vec{k})

I(\vec{x})

I(\vec{x}) * \psi_m(\vec{x})

m \ll n

Number of summary statistics

I(\vec{x}) * \psi_m(\vec{x})

*Simple averaging* -- to construct *translation invariant *statistics

y_m

\langle I(\vec{x}) * \psi_m(\vec{x}) \rangle

= \langle I(\vec{x}) \rangle * \psi_m(\vec{x})

Convolution and expectation are commutative

= (\vec{\mathrm{const.}}) * \psi_m (\vec{x})

The system is stationary

= \vec{0}

properties of wavelet

(both can be written as integrals)

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

Let

f

to be a linear function

\Big\langle f(I * \psi_m) \Big\rangle

= f \Big(\langle I * \psi_m \rangle \Big)

= f (\vec{0})

f

need to be a non-linear function

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

Let

f

be the modulus squared

\Big\langle f(I * \psi_m) \Big\rangle \equiv \langle | I * \psi_m |^2 \rangle_{\vec{x}}

\equiv \int \mathrm{d}\vec{x} \; (I * \psi_m)(I * \psi_m)^*

= \int \mathrm{d}\vec{k} \; |\tilde{I}(\vec{k})|^2 |\tilde{\psi}_m(\vec{k})|^2

\equiv \int \mathrm{d}\vec{k} \; P(k) |\tilde{\psi}_m(\vec{k})|^2

\langle | I * \psi_m |^2 \rangle

= \int \mathrm{d}\vec{k} \; P(k) |\tilde{\psi}_m(\vec{k})|^2

Power spectrum

Binning/Averaging

Weight

\tilde{\psi}_m(\vec{k})

Frequency mask

The wavelet "operation block" *retains* the locality information, but we *have not yet extracted* the locality information

p(x)

x

x^n \cdot p(x)

f(I * \psi_m) = | I(\vec{x}) * \psi_m(\vec{x}) |^2

Higher-Order moments are *unstable*

E.g., take the modulus, instead of modulus squared

= | I(\vec{x}) * \psi_m(\vec{x}) |

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

The importance of choosing a *linear */ *sublinear* function at large x

p(x)

x

x \cdot p(x)

x^2 \cdot p(x)

\langle x \rangle

\langle x^2 \rangle

p(x)

x

x \cdot p(x)

x^2 \cdot p(x)

\langle x \rangle

\langle x^2 \rangle

x^n \cdot p(x)

\langle x^n \rangle

p(x)

x

s_0 = \langle x \rangle

s_0

x_1 = \big| \, x - \langle x \rangle \, \big|

"Folding"

p(x)

"Folding" = non-linear operation + averaging

p(x)

x

s_0

"Folding"

p(x)

s_1

s_1 = \langle x_1 \rangle,

s_0 = \langle x \rangle

x_1 = \big| \, x - \langle x \rangle \, \big|

"Folding" = non-linear operation + averaging

p(x)

x

s_0

s_1

"Folding"

p(x)

s_2

s_1 = \langle x_1 \rangle,

s_0 = \langle x \rangle

x_1 = \big| \, x - \langle x \rangle \, \big|

x_2 = \big| \, x_1 - \langle x_1 \rangle \, \big|

s_2 = \langle x_2 \rangle,

*Linear order* with respect to

x

*stable* and robust summary statistcs

(or "activation function")

e.g., ReLU

*

\psi

e.g., tanh, sigmoid

N

A

I(\vec{x})

* \; \psi

Convolution with wavelets

Averaging

\langle \cdot \rangle

Low-order non-linear function

| \cdot |

Iterate over these operations

I_0

I_1 \equiv | I_0 * \psi_{m} |

I_2 \equiv | I_1 * \psi_{m} |

\vec{y}_1 = \langle I_1 \rangle

\vec{y}_2 = \langle I_2 \rangle

Complexity

Stochasticity

Complex

Simple

Power spectrum

Bispectrum

(or *"structures"*)

(*"uninformative" *variances)

Scattering Transform

I_1 \equiv | I_0 * \psi_{j,l} |

I_2 \equiv | I_1 * \psi_{j,l} |

\vec{a} = \{ \vec{y}_{1,a}, \vec{y}_{2,a} \}

\vec{b} = \{ \vec{y}_{1,b}, \vec{y}_{2,b} \}

\vec{c} = \{ \vec{y}_{1,c}, \vec{y}_{2,c} \}

\vec{a} \neq \vec{b} \neq \vec{c}

Do you know how to describe these images now?

Scattering

Transform

Vary cosmological parameters

(\Omega_M, \sigma_8)

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M

\sigma_8

Dark Matter Density

Growth Amplitude

Power spectrum fails to *distinguish the intricate differences* between the two maps

Power Spectrum

Peak Counts

Scattering Transform

15 coefficients

37 coefficients

20 coefficients

10

100

1000

(\Omega_M,\sigma_8)

Figure of Merit

Galaxy number density (arcmin )

-2

10

30

100

(state-of-the-art)

(our study)

x2

x10

x3

Weak lensing

Background galaxies

Foreground dark matter

Unlensed

Lensed

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M

\sigma_8