The State of AI in Survey Cosmology

EuCAIFCon, Cagliari, June 2025

François Lanusse

slides at eiffl.github.io/talks/Cagliari2025

What are cosmologists trying to go after with galaxy surveys?

We have been dreaming about these surveys for 20 years!

Albrecht et al. (2006)

LSST forecast on dark energy parameters

Image credit: N. Jeffrey / DES Collaboration

Image credit: Euclid Consortium / Planck Collaboration / A. Mellinger

Stage II: SDSS

Image credit: Peter Melchior

Stage III: DES

Image credit: Peter Melchior

Stage IV: Rubin Observatory LSST (HSC)

Image credit: Peter Melchior

Euclid Q1 Data Release: April 2025

Image credit: ESA/Euclid/Euclid Consortium/NASA
Image processing by J.-C. Cuillandre, E. Bertin, G. Anselmi

the Vera C. Rubin Observatory Legacy Survey of Space and Time

1000 images each night, 15 TB/night for 10 years

18,000 square degrees, observed once every few days

Tens of billions of objects, each one observed $\sim1000$ times

Rubin First Look, Monday June 23rd!

Dark Energy Spectroscopic Instrument (DESI) is teasing us with new tensions!

DESI Collaboration (2025)

Ok, exciting, but where is the AI you promised???

What does a conventional cosmological analysis look like?

The limits of traditional cosmological inference

HSC cosmic shear power spectrum

HSC Y1 constraints on $(S_8, \Omega_m)$

(Hikage et al. 2018)

Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

Compute summary statistics based on 2pt functions,
e.g. the power spectrum

Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$

Main limitation: the need for an explicit likelihood

We can only compute from theory the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a significant fraction of the information!

Full-Field Simulation-Based Inference

Instead of trying to analytically evaluate the likelihood of sub-optimal summary statistics, let us build a forward model of the full observables.
$\Longrightarrow$ The simulator becomes the physical model.

Each component of the model is now tractable, but at the cost of a large number of latent variables.

Benefits of a forward modeling approach

Fully exploits the information content of the data (aka "full field inference").

Easy to incorporate systematic effects.

Easy to combine multiple cosmological probes by joint simulations.

(Porqueres et al. 2021)

Just as a reminder, why is it classicaly hard to do simulation-based inference?

The Challenge of Simulation-Based Inference

$$ p(x|\theta) = \int p(x, z | \theta) dz = \int p(x | z, \theta) p(z | \theta) dz $$ Where $z$ are stochastic latent variables of the simulator.

$\Longrightarrow$ This marginal likelihood is intractable!

How to perform inference over forward simulation models?

Implicit Inference: Treat the simulator as a black-box with only the ability to sample from the joint distribution $$(x, \theta) \sim p(x, \theta)$$ a.k.a.
- Simulation-Based Inference (SBI)
- Likelihood-free inference (LFI)
- Approximate Bayesian Computation (ABC)

Explicit Inference: Treat the simulator as a probabilistic model and perform inference over the joint posterior $$p(\theta, z | x) \propto p(x | z, \theta) p(z, \theta) p(\theta) $$ a.k.a.
- Bayesian Hierarchical Modeling (BHM)

$\Longrightarrow$ For a given simulation model, both methods should converge to the same posterior!

Implicit Inference

The land of Neural Density Estimation

We have converged to a standard SBI recipe

A two-steps approach to Implicit Inference

Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(x) $$
Use Neural Density Estimation to either:
- build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)
- build an estimate $p_\phi$ of the posterior distribution $p(\theta \ | \ y)$ (Neural Posterior Estimation)

Information Point of View on Neural Summarisation

Learning Sufficient Statistics

Summary statistics $y$ is sufficient for $\theta$ if $$ I(Y; \Theta) = I(X; \Theta) \Leftrightarrow p(\theta | x ) = p(\theta | y) $$
Variational Mutual Information Maximization $$ \mathcal{L} \ = \ \mathbb{E}_{x, \theta} [ \log q_\phi(\theta | y=f_\varphi(x)) ] \leq I(Y; \Theta) $$ (Barber & Agakov variational lower bound)
Jeffrey, Alsing, Lanusse (2021)

Another Approach: maximizing the Fisher information

Information Maximization Neural Network (IMNN) $$\mathcal{L} \ = \ - | \det \mathbf{F} | \ \mbox{with} \ \mathbf{F}_{\alpha, \beta} = tr[ \mu_{\alpha}^t C^{-1} \mu_{\beta} ] $$

Charnock, Lavaux, Wandelt (2018)

People use a lot of variants in practice!

* grey rows are papers analyzing survey data

Optimal Neural Summarisation for Cosmological Implicit Inference

Lanzieri, Zeghal et al. (2024)

Asymptotically VMIM yields a sufficient statistics
- No reason not to use it in practice, it works well, and is asymptotically optimal

Mean Squared Error (MSE) DOES NOT yield a sufficient statistics even asymptotically
- Same for Mean Absolute Error (MAE) and weighted versions of MSE

credit: Justine Zeghal

Our humble beginnings: Likelihood-Free Parameter Inference with DES SV...

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

$w$CDM analysis of KiDS-1000 Weak Lensing (Fluri et al. 2022)

Fluri, Kacprzak, Lucchi, Schneider, Refregier, Hofmann (2022)

KiDS-1000 footprint and simulated data

Neural Compressor: Graph Convolutional Neural Network on the Sphere
Trained by Fisher information maximization.

SIMBIG: Field-level SBI of Large Scale Structure (Lemos et al. 2023)

BOSS CMASS galaxy sample: Data vs Simulations

20,000 simulated galaxy samples at 2,000 cosmologies

Hahn et al. (2022)

Lemos et al. (2023)

Finally, SBI has reached the mainstream: Official DES year 3 SBI wCDM results

Jeffrey et al. (2024)

I'm calling it!

Implicit Inference is solved for cosmological surveys!

@EiffL - Cagliari, June 2025

Has it delivered everything we hoped for?

Example of unforeseen impact of shortcuts in simulations

Gatti, Jeffrey, Whiteway et al. (2023)

Is it ok to distribute lensing source galaxies randomly in simulations, or should they be clustered?

$\Longrightarrow$ An SBI analysis could be biased by this effect and you would never know it!

How much usable information is there beyond the power spectrum?

Chisari et al. (2018)

Ratio of power spectrum in hydrodynamical simulations vs. N-body simulations

Secco et al. (2021)

DES Y3 Cosmic Shear data vector

$\Longrightarrow$ Can we find non-Gaussian information that is not affected by baryons?

takeways

Will we be able to exploit all of the information content of LSST, Euclid, DESI?

$\Longrightarrow$ Not rightaway, but it is not the fault of the inference methodology!

Deep Learning has redefined the limits of our statistical tools, creating additional demand on the accuracy of simulations far beyond the power spectrum.

Neural compression methods have the downside of being opaque. It is much harder to detect unknown systematics.

We will need a significant number of large volume, high resolution simulations.

If Implicit Inference is solved, can we still have fun solving Explicit Inference?
Credit: Yuuki Omori, Chihway Chang, Justine Zeghal, EiffL

https://github.com/EiffL/LPTLensingComparison

More seriously, Explicit Inference has some advantages:

More introspectable results to identify systematics
Allows for fitting parametric corrections/nuisances from data
Provides validation of statistical inference with a different method

Explicit Inference

Where the things are!

Simulators as Hierarchical Bayesian Models

If we have access to all latent variables $z$ of the simulator, then the joint log likelihood $p(x | z, \theta)$ is explicit.

We need to infer the joint posterior $p(\theta, z | x)$ before marginalization to yield $p(\theta | x) = \int p(\theta, z | x) dz$.
$\Longrightarrow$ Extremely difficult problem as $z$ is typically very high-dimensional.

Necessitates inference strategies with access to gradients of the likelihood. $$\frac{d \log p(x | z, \theta)}{d \theta} \quad ; \quad \frac{d \log p(x | z, \theta)}{d z} $$ For instance: Maximum A Posterior estimation, Hamiltonian Monte-Carlo, Variational Inference.

$\Longrightarrow$ The only hope for explicit cosmological inference is to have fully-differentiable cosmological simulations!

How complicated can it be to simulate the entire Universe?

Forward Models in Cosmology

Linear Field

Final Dark Matter

Dark Matter Halos

Galaxies

$\longrightarrow$

N-body simulations

$\longrightarrow$
Group Finding
algorithms

$\longrightarrow$
Semi-analytic &
distribution models

the Fast Particle-Mesh scheme for N-body simulations

The idea: approximate gravitational forces by estimating densities on a grid.

The numerical scheme:
- Estimate the density of particles on a mesh
  => compute gravitational forces by FFT
- Interpolate forces at particle positions
- Update particle velocity and positions, and iterate

Fast and simple, at the cost of approximating short range interactions.

$\Longrightarrow$ Only a series of FFTs and interpolations.

FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

https://github.com/DifferentiableUniverseInitiative/flowpm


																		 import tensorflow as tf
																		 import flowpm
																		 # Defines integration steps
																		 stages = np.linspace(0.1, 1.0, 10, endpoint=True)

																		 initial_conds = flowpm.linear_field(32,       # size of the cube
																											100,       # Physical size
																											ipklin,    # Initial powerspectrum
																											batch_size=16)

																		 # Sample particles and displace them by LPT
																		 state = flowpm.lpt_init(initial_conds, a0=0.1)

																		 # Evolve particles down to z=0
																		 final_state = flowpm.nbody(state, stages, 32)

																		 # Retrieve final density field
																		 final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
																										final_state[0])

Seamless interfacing with deep learning components
Now superseeded by the JAX-based pmwd and JaxPM libraries

MAP optimization in action

$$\arg\max_z \ \log p(x_{dm} | f(z)) \ + \ p(z| \theta) $$

credit: C. Modi

True initial conditions
$z_0$

Reconstructed initial conditions $z$

Reconstructed dark matter distribution $x_{dm} = f(z)$

Data
$x_{dm} = f(z_0)$

Check out this blogpost for more details
https://blog.tensorflow.org/2020/03/simulating-universe-in-tensorflow.html

Necessity to fill the gap in the accuracy-speed space of PM simulations

Camels simulations

PM simulations

Hybrid physical/neural differential equations

Lanzieri, Lanusse, Starck (2022)

$$\left\{ \begin{array}{ll} \frac{d \color{#6699CC}{\mathbf{x}} }{d a} & = \frac{1}{a^3 E(a)} \color{#6699CC}{\mathbf{v}} \\ \frac{d \color{#6699CC}{\mathbf{v}}}{d a} & = \frac{1}{a^2 E(a)} F_\theta( \color{#6699CC}{\mathbf{x}} , a), \\ F_\theta( \color{#6699CC}{\mathbf{x}}, a) &= \frac{3 \Omega_m}{2} \nabla \left[ \color{#669900}{\phi_{PM}} (\color{#6699CC}{\mathbf{x}}) \right] \end{array} \right. $$

$\mathbf{x}$ and $\mathbf{v}$ define the position and the velocity of the particles
$\phi_{PM}$ is the gravitational potential in the mesh

$\to$ We can use this parametrisation to complement the physical ODE with neural networks.

$$F_\theta(\mathbf{x}, a) = \frac{3 \Omega_m}{2} \nabla \left[ \phi_{PM} (\mathbf{x}) \ast \mathcal{F}^{-1} (1 + \color{#996699}{f_\theta(a,|\mathbf{k}|)}) \right] $$

Correction integrated as a Fourier-based isotropic filter $f_{\theta}$ $\to$ incorporates translation and rotation symmetries

Projections of final density field

Camels simulations

PM simulations

PM+NN correction

Results

Neural network trained using single CAMELS simulation of $25^3$ ($h^{-1}$ Mpc)$^3$ volume and $64^3$ dark matter particles at the fiducial cosmology of $\Omega_m = 0.3$

Hybrid N-body simulations with Field-Level Emulator

Jamieson et al. (2023)

Doeser et al. (2024)

Bartlett et al. (2024)

The need for distributed differentiable programming frameworks

The state vector of a moderate size cosmological simulation volume can easily require from 100GB to several TB.
$\Longrightarrow$ We need model-parallelism! Not currently fully supported by any mainstream autodiff frameworks!

(Gholami et al. 2018)

JAX-powered differentiable HPC

JAX v0.4 has made a strong push for bringing automated parallelization and support multi-host GPU clusters!

Scientific HPC still most likely requires dedicated high-performance ops

jaxDecomp: Domain Decomposition and Parallel FFTs

https://github.com/DifferentiableUniverseInitiative/jaxDecomp
- JAX bindings to the high-performance cuDecomp (Romero et al. 2022) adaptive domain decomposition library.
- Provides parallel FFTs and halo-exchange operations.
- Supports variety of backends: CUDA-aware MPI, NVIDIA NCCL, NVIDIA NVSHMEM.

Building PM components from these distributed operations

Kabalan, Lanusse, Boucaud (in prep.)

Distributed 3D FFT for force computation

Halo Exchange for CiC painting and reading

$2048^3$ LPT field, 1.02s on 32 H100 GPUs

Performance Benchmark

Strong scaling plots of 3D FFT

Official performance benchmark from NVIDIA with cuFFTMp

Timing of 1LPT computation

Final piece of the puzzle: Efficient Sampling Algorithms

Simon-Onfroy, Lanusse, de Mattia (2025)

ESS for different sampling algorithms

Microcanonical sampler

Conclusion

The next decade of cosmological inference

Stage IV surveys are here and already producing surprising results

SBI has become mainstream - the tools are mature and reliable

Differentiable computing provides new opportunities for exploration

The real challenge is not inference methods but simulation model accuracy

The future lies in learnable, adaptive simulation models that can discover new physics

The State of AI in Survey Cosmology

EuCAIFCon, Cagliari, June 2025

François Lanusse

What are cosmologists trying to go after with galaxy surveys?

We have been dreaming about these surveys for 20 years!

Euclid Q1 Data Release: April 2025

the Vera C. Rubin Observatory Legacy Survey of Space and Time

Dark Energy Spectroscopic Instrument (DESI) is teasing us with new tensions!

Ok, exciting, but where is the AI you promised???

What does a conventional cosmological analysis look like?

The limits of traditional cosmological inference

Full-Field Simulation-Based Inference

Just as a reminder, why is it classicaly hard to do simulation-based inference?

Implicit Inference

The land of Neural Density Estimation

We have converged to a standard SBI recipe

Information Point of View on Neural Summarisation

Another Approach: maximizing the Fisher information

People use a lot of variants in practice!

Optimal Neural Summarisation for Cosmological Implicit Inference

Our humble beginnings: Likelihood-Free Parameter Inference with DES SV...

$w$CDM analysis of KiDS-1000 Weak Lensing (Fluri et al. 2022)

SIMBIG: Field-level SBI of Large Scale Structure (Lemos et al. 2023)

Finally, SBI has reached the mainstream: Official DES year 3 SBI wCDM results

I'm calling it! Implicit Inference is solved for cosmological surveys! @EiffL - Cagliari, June 2025 Has it delivered everything we hoped for?

Example of unforeseen impact of shortcuts in simulations

How much usable information is there beyond the power spectrum?

takeways

If Implicit Inference is solved, can we still have fun solving Explicit Inference? Credit: Yuuki Omori, Chihway Chang, Justine Zeghal, EiffL

Explicit Inference

Where the things are!

Simulators as Hierarchical Bayesian Models

How complicated can it be to simulate the entire Universe?

Forward Models in Cosmology

the Fast Particle-Mesh scheme for N-body simulations

FlowPM: Particle-Mesh Simulations in TensorFlow

MAP optimization in action

Necessity to fill the gap in the accuracy-speed space of PM simulations

Hybrid physical/neural differential equations

Projections of final density field

Results

Hybrid N-body simulations with Field-Level Emulator

The need for distributed differentiable programming frameworks

JAX-powered differentiable HPC

Building PM components from these distributed operations

Performance Benchmark

Final piece of the puzzle: Efficient Sampling Algorithms

Conclusion

I'm calling it!

Implicit Inference is solved for cosmological surveys!

@EiffL - Cagliari, June 2025

Has it delivered everything we hoped for?

If Implicit Inference is solved, can we still have fun solving Explicit Inference?
Credit: Yuuki Omori, Chihway Chang, Justine Zeghal, EiffL