google Nov 6, 2025

Introducing Nested Learning: A new ML paradigm for continual learning (opens in new tab)

machine-learning deep-learning transformer optimization-algorithms nested-learning continual-learning catastrophic-forgetting associative-memory

Google Research has introduced Nested Learning, a paradigm that treats machine learning models as systems of interconnected, multi-level optimization problems rather than separate architectures and training rules. By unifying structure and optimization through varying update frequencies, this approach aims to mitigate "catastrophic forgetting," the tendency for models to lose old knowledge when acquiring new skills. The researchers validated this framework through "Hope," a self-modifying architecture that outperforms current state-of-the-art models in long-context memory and language modeling.

The Nested Learning Paradigm

This framework shifts the view of machine learning from a single continuous process to a set of coherent, nested optimization problems. Each component within a model is characterized by its own "context flow"—the specific set of information it learns from—and its own update frequency.

The paradigm argues that architecture (structure) and optimization (training rules) are fundamentally the same concept, differing only by their level of computational depth and update rates.
Associative memory is used as a core illustrative concept, where the training process (backpropagation) is modeled as a system mapping data points to local error values.
By defining an update frequency rate for each component, researchers can order these problems into "levels," allowing for a more unified and efficient learning system inspired by the human brain's neuroplasticity.

Deep Optimizers and Refined Objectives

Nested Learning provides a principled way to improve standard optimization algorithms by viewing them through the lens of associative memory modules.

Existing momentum-based optimizers often rely on simple dot-product similarity, which fails to account for how different data samples relate to one another.
By replacing these simple similarities with standard loss metrics, such as L2 regression loss, the researchers derived new formulations for momentum that are more resilient to imperfect or noisy data.
This approach turns the optimizer itself into a deeper learning component with its own internal optimization objective.

Continuum Memory Systems and the "Hope" Architecture

The paradigm addresses the limitations of Large Language Models (LLMs), which are often restricted to either their immediate input window or static pre-trained knowledge.

The researchers developed "Hope," a proof-of-concept architecture that utilizes multi-time-scale updates for its internal components.
While standard Transformers act primarily as short-term memory, the Nested Learning approach allows for "continuum memory" that manages long-context information more effectively.
Experimental results show that this self-modifying architecture achieves superior performance in language modeling compared to existing state-of-the-art models.

By recognizing that every part of a model is essentially an optimizer operating at a different frequency, Nested Learning offers a path toward AI that can adapt to new experiences in real-time. This structural shift moves away from the "static pre-training" bottleneck and toward systems capable of true human-like neuroplasticity and lifelong learning.