Why is Weather Forecasting such a Good Problem for ML?
Thanks to Tom Beucler, Janni Yuval and Stephan Rasp for very helpful feedback and discussions. Also available at: https://notesonclimate.substack.com/p/why-is-weather-forecasting-such-a
As ML and AI march through the sciences, one place they’ve been particularly successful is weather forecasting. After just a few years of development, a number of ML models outperform the state-of-the-art European Centre for Medium-Range Weather Forecasts (ECMWF) physics-based model, at least on some metrics, and both the ECMWF and NOAA have responded by incorporating data-driven models into their experimental suites. These forecasting models are just one part of the operational pipeline. ML is advancing into data assimilation, post-processing, and the rest. It’s still unclear where this will end up, but ML will be an important part of weather forecasting from now on.
This rapid success shows that weather is a good problem for ML/AI, as is the fact that major tech companies have devoted resources to building weather models. In some ways, this success isn’t surprising: weather forecasting has the perfect infrastructure for applying ML. Government agencies have spent billions of dollars building space and ground-based observing systems, and data sharing is routine (though there is growing interest in proprietary weather data collection). These agencies have also built excellent data archives and weather forecasting has well-established metrics like RMSE and CRPS for assessing forecast skill, with clear benchmarks. Finally, we make forecasts every 6 hours, so there is a constant stream of test cases to train and validate on.
But this infrastructure doesn’t explain why ML systems outperform traditional “physics-based” systems. These systems are quite accurate already and we have a good understanding of how the atmosphere works. It’s not like ML forecasting systems are discovering completely new physics. So why are they so good? Below, I argue their success comes from how they deal with several sources of uncertainty inherent to weather forecasting, in addition to advantages in how they use compute. Together these two factors meant that it took less than 5 years to build ML systems that can beat state-of-the-art physical models.
Structural and Parametric Uncertainty
Due to computational constraints, weather (and climate) models are run at relatively coarse horizontal resolutions of tens to hundreds of km, and unresolved processes must be parameterized. In conventional models, this means starting with a physical picture of the unresolved process, converting the picture into equations (e.g., a mass adjustment convection scheme), then tuning the free parameters against observations or higher resolution simulation data¹ These parameterizations introduce two sources of uncertainty: structural (the physical picture could be wrong) and parametric (the parameters could be wrong).
ML systems are trained on data at similar horizontal resolutions so they also include, implicitly, parameterizations of unresolved processes. However, rather than starting from physical pictures of these processes, they use whatever produces the best scores on the evaluation metrics used in training, reducing the uncertainties associated with parameterizations.
This approach may reduce parameterization error, but makes ML systems more opaque (“black boxes”), and we may be concerned that they get the right answers for the wrong reasons (see Bonavita, 2024 for some concerning examples). ML parameterizations also don’t generalize well outside the training climate; indeed, attempts to use ML for climate models have struggled in climates much warmer than the ones they were trained on. This is less of a concern for weather, but the systems will have to be retrained every few years.
Initial Condition Uncertainty
Structural and parametric uncertainties are one potential source of forecasting improvement, but there is another associated with the process of weather forecasting.
The atmosphere is a canonical example of a chaotic system, meaning it has strong sensitivity to initial conditions (ICs)². In practice, we will never eliminate IC uncertainty, so weather forecasts are always probabilistic: we want to predict the probability distribution of future weather states (the posterior distribution) given uncertain observations of current weather conditions.
Physics-based systems do this by running ensembles of simulations, e.g., 30 simulations with initial perturbations designed to sample IC uncertainty. The goal is to reconstruct the posterior distribution with this ensemble – most of the ensemble members should end up near the mean of the distribution, with a few members sketching out the tails. The average across the ensemble will hopefully equal the mean of the posterior, and the ensemble-mean is generally more accurate than any individual ensemble member.
ML systems address IC uncertainty in two different ways. The first generation of ML models, like GraphCast and Pangu, were deterministic and skipped the step of generating ensembles. Instead, they learned mappings directly from uncertain ICs to the conditional mean of the posterior³. Whereas each run of a physics-based ensemble treats the ICs as perfectly accurate and tries to faithfully simulate forward in time, a deterministic ML forecast knows the ICs are uncertain and tries to predict the posterior mean directly. In principle, a large enough physics-based ensemble will reconstruct the posterior distribution, so the fact that deterministic ML forecasts are able to beat the state-of-the-art implies they reconstruct the posterior mean more efficiently.
The problem with this approach is that deterministic ML weather forecasts tend to underpredict the magnitude and frequency of extreme events, and the resulting fields often look “smoothed-out” or “blurred” compared to physics-based forecasts⁴. The data might point to a hurricane developing, but deterministic ML systems will hedge their bets. By contrast, a traditional model will make the call much earlier. Averaged over enough events, ML models may produce more accurate forecasts, but they will miss some extreme events along the way.
To address this issue, generative ML models like GenCast and FGN inject noise into their forecasts to generate ensembles that attempt to reconstruct the full posterior distribution. There are several different ways of doing this, but all learn mappings from the ICs to the full posterior, rather than just the mean. Reconstructing the whole distribution should solve the “blurring” problem, and if you were really worried about extremes you could weight the tails of the distributions more strongly in training.
The improved skill of generative ML models likely comes in part from having less parameterization uncertainty, but it may also come because they learn better ways of dealing with IC uncertainty⁵. Much effort has gone into optimizing the IC perturbations used to generate physical ensembles, but the ML systems seem to squeeze out a little more skill.
Computational Efficiency
The final advantage for ML systems comes from their increased computation efficiency. Physical forecasts require some of the largest supercomputers on Earth to run. ML forecast models use substantial compute up front for training, but then making forecasts is cheap and they can quickly generate much larger ensembles than is feasible with physical models.
One way of thinking about this advantage is that physical systems estimate the posterior indirectly by running a large number of forward simulations, whereas ML systems estimate it directly via a pre-learned mapping. In the limit of large enough ensembles (and perfect dynamics and IC perturbations) both methods should converge on the same result, but physical systems have to do a lot more work to get there.
At the moment, this advantage may not matter so much. A rough rule of thumb is that for most quantities, ensemble skill converges to the score of an infinite-member ensemble as 1 + 1/M, where M is the ensemble size (see Leutbecher 2018 for a very thorough investigation). So going from 50 ensemble members to 200 ensemble members only gives an improvement of a few %.
Ensemble size matters more for extremes, but as the blurring discussion showed, deterministic ML models are able to beat physical models even while doing worse on the tails of the distribution. As the blurring problem is fixed, the ability to generate larger ensembles should further add to ML systems’ advantage. Moreover, cutting down the time it takes to make forecasts will be beneficial independent of skill improvements.
The ML Forecasting Revolution
We are in the middle of a revolution in weather forecasting, and ML methods will soon be incorporated into all aspects of the forecasting process. This revolution builds on decades of research and billions of dollars of government funding, which laid the perfect foundation for ML to be applied. Existing physical models are accurate and highly sophisticated. ML systems are able to beat them – and will likely replace them – for two reasons: they are better at using the available data to account for uncertainties, and they are much cheaper to run, allowing them to quickly generate very large ensembles.
This doesn’t mean that we should blindly trust generative ML systems. In a changing climate we will have to be wary of moving outside the regime they were trained for (how often do we retrain?). We also need a better understanding of how to interpret ML ensembles. We know in a statistical sense how fast the individual members of a physics-based ensemble should drift apart, but we don’t know how individual ML ensemble members should behave. As FGN shows, one reason it’s hard to develop a theory for how ML ensembles diverge is that individual members may sample both IC uncertainty and physical uncertainty.
But at the same time, the success of ML forecast systems shows that traditional models still have room for improvement. In this respect, we should think of these new systems as a valuable resource to learn from. With tools to probe what the ML systems are learning and/or distill the knowledge into equations, we could learn how they are parameterizing unresolved processes and how they are treating IC uncertainty. We’ve been studying these problems for decades, but the development of ML forecasting is an opportunity to gain new insights.
1 The tuning is often fairly ad-hoc, though there is a growing push to automate model tuning.
2 It’s not that chaotic – if it were, we wouldn’t be able to predict the weather at all. Nearby trajectories diverge, but slowly, giving a 10-14 day window of predictability and enough underlying structure for ML systems to usefully learn.
3 In practice, most current ML weather models are trained on reanalysis data (e.g., ERA5), which combine observations with a physics-based model through data assimilation, rather than on the raw observations, so the models learn the posterior as represented in the reanalysis. It’s an open question whether it’s better to feed the raw data directly into an ML system or to first process the data using a separate physics- or ML-based data assimilation system.
4 It’s actually a bit more subtle than this. These systems learn a compromise between a short-range transition function they can use to step forward in time and a long-range prediction tuned to a target lead time (e.g., 3 days). This combination allows them to make forecasts at arbitrary leads, and the blurring depends on how the system balances the two predictions.
5 Interestingly, FGN adds noise to the model weights, allowing it to sample both IC uncertainty and parameterization uncertainty. The latter is reminiscent of the stochastic parameterizations used in some traditional models.