The preprint, titled “What do Language Models Learn and When? The Implicit Curriculum Hypothesis,” starts from a gap in current scaling-law practice. Large language models can perform what the authors describe as “remarkably complex tasks,” but the work notes that the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Existing scaling laws on validation loss quantify how much a model improves with additional compute, yet they do not specify which skills are acquired or in what order, leaving capability development effectively treated as a black box behind a single scalar metric.

To address this, the authors propose the Implicit Curriculum Hypothesis: during pretraining, models follow a compositional and predictable curriculum across architectures and data mixtures. Rather than assuming that skills appear idiosyncratically as loss decreases, the hypothesis asserts that there is a structured order in which capabilities come online, and that this order can be characterized and compared across different model families. The work positions this as a direct remedy to the limitations of loss-only scaling laws, aiming to make the sequence of skill acquisition an explicit object of study.

The empirical strategy centers on a suite of “simple, composable tasks” designed to probe specific capabilities. The task set spans retrieval, morphological transformations, coreference, logical reasoning, and mathematics, giving the authors controlled handles on distinct skill types and their possible compositions. Using these tasks, they track “emergence points” across four model families ranging from 410 million to 13 billion parameters, defining emergence in terms of when a model crosses fixed accuracy thresholds on each task during pretraining.

From these measurements, the authors derive “emergence orderings” for each model: the relative order in which tasks reach their accuracy thresholds. They report that these orderings are “strikingly consistent” across models, with a Spearman correlation coefficient ρ = .81 computed over 45 model pairs. Within this framework, composite tasks “most often” emerge after their component tasks, consistent with the idea of a compositional curriculum in which simpler building-block skills are acquired before more complex combinations. The consistency across four distinct model families is presented as evidence that the curriculum is not an artifact of a single architecture or training run.

The paper then moves inside the models, arguing that this curriculum structure is encoded in their internal representations. The authors introduce “function vector” representations for tasks and find that tasks with similar function vectors tend to follow similar training trajectories. In other words, proximity in this representation space correlates with similarity in when and how performance on those tasks improves during pretraining. This link between representational geometry and learning dynamics is used to support the claim that the curriculum is not only observable at the behavioral level but also reflected in how the models internally organize different functions.

Leveraging this representation space, the authors test whether they can predict the training trajectories of simple held-out compositional tasks that were not directly evaluated during pretraining. They report that, using only the structure inferred from the original task set, they can forecast these held-out trajectories throughout pretraining with R² values between .68 and .84 across models. This predictive performance is presented as evidence that the implicit curriculum is sufficiently regular to be extrapolated: once the representation space is learned, it constrains how new, related tasks will evolve as training progresses.

Taken together, the results lead the authors to conclude that pretraining is “more structured than loss curves reveal.” In their account, skills emerge in a compositional order that is consistent across models and can be read off from model internals, in contrast to the coarse picture provided by validation-loss scaling laws alone. The Implicit Curriculum Hypothesis, supported by cross-model emergence statistics and representation-based prediction, reframes pretraining not as an undifferentiated optimization process but as one with a measurable, compositional sequence of capability acquisition.

Original source: http://arxiv.org/abs/2604.08510v1