32 private links
Having generation and verification co-evolve on the same online rollouts is the fix, and the ablation (Figure 11) shows it matters — co-evolving consistently beats non-co-evolving by 4–6%.
While SFT distillation meaningfully improves overall performance over the base model, the gap between the two approaches is most apparent when combined with test-time compute. On in-distribution tasks, SFT benefits substantially from parallel sampling (69.1 → 75.3), yet on out-of-distribution tasks the gains are negligible (59.4 → 59.6). This suggests that distillation teaches the model to imitate task-specific expert behavior, which scales well within the training distribution but fails to generalize beyond it. In contrast, KARL benefits from test-time compute both in- and out-of-distribution, indicating that RL develops more general search capabilities rather than task-specific heuristic
Abstract page for arXiv paper 2505.03335: Absolute Zero: Reinforced Self-play Reasoning with Zero Data
This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
SPRING is an LLM-based policy that outperforms Reinforcement Learning algorithms in an interactive environment requiring multi-task planning and reasoning. A group of researchers from Carnegie Mellon University, NVIDIA, Ariel University, and Microsoft have investigated the use of Large Language Models (LLMs) for understanding and reasoning with human knowledge in the context of games. They propose a two-stage approach called SPRING, which involves studying an academic paper and then using a Question-Answer (QA) framework to justify the knowledge obtained. More details about SPRING In the first stage, the authors read the LaTeX source code of the original paper by Hafner (2021)
"Finally, we believe that more powerful AI-designed hardware will fuel advances in AI, creating a symbiotic relationship between the two fields."
We are not just going to solve another reinforcement learning environment but going to create one from scratch.
light on details
"DeepMind's version of reinforcement learning that uses "temporal value transport" to send a signal from reward backward, to shape actions, does better than alternative forms of neural networks. Here, the "TVT" program is compared to "Long-short-term memory," or LSTM, neural networks, with and without memory, and a basic reconstructive memory agent."
AlphaStock fully exploits the interrelationship among stocks, and
opens a door for solving the “black box” problem of using deep learning models in financial markets. The back-testing and simulation experiments over U.S. and Chinese stock markets showed that
AlphaStock performed much better than other competing strategies. Interestingly, AlphaStock suggests buying stocks with high long-term growth, low volatility, high intrinsic value, and being
undervalued recently.