Search: [rl]

[2603.04304v1] $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Having generation and verification co-evolve on the same online rollouts is the fix, and the ablation (Figure 11) shows it matters — co-evolving consistently beats non-co-evolving by 4–6%.

ai · llm · rl · paper

March 7, 2026 at 9:14:49 PM EST * · permalink

·

https://arxiv.org/abs/2603.04304v1

KARL: A Faster Agent for Enterprise Knowledge, powered by custom RL

While SFT distillation meaningfully improves overall performance over the base model, the gap between the two approaches is most apparent when combined with test-time compute. On in-distribution tasks, SFT benefits substantially from parallel sampling (69.1 → 75.3), yet on out-of-distribution tasks the gains are negligible (59.4 → 59.6). This suggests that distillation teaches the model to imitate task-specific expert behavior, which scales well within the training distribution but fails to generalize beyond it. In contrast, KARL benefits from test-time compute both in- and out-of-distribution, indicating that RL develops more general search capabilities rather than task-specific heuristic

ai · rl · agent

March 5, 2026 at 10:10:20 PM EST * · permalink

·

https://www.databricks.com/sites/default/files/2026-03/karl.pdf

[2505.03335] Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Abstract page for arXiv paper 2505.03335: Absolute Zero: Reinforced Self-play Reasoning with Zero Data

paper · ml · rl

May 10, 2025 at 4:32:08 PM EDT * · permalink

·

https://arxiv.org/abs/2505.03335

Paper page - Mixtures of Experts Unlock Parameter Scaling for Deep RL

This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

paper · ml · rl

February 14, 2024 at 6:30:24 PM EST * · permalink

·

https://huggingface.co/papers/2402.08609

facebookresearch/Pearl: A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

ml · rl

December 11, 2023 at 11:38:57 PM EST * · permalink

·

https://github.com/facebookresearch/Pearl

LLMs Outperform Reinforcement Learning- Meet SPRING: An Innovative Prompting Framework for LLMs Designed to Enable in-Context Chain-of-Thought Planning and Reasoning

SPRING is an LLM-based policy that outperforms Reinforcement Learning algorithms in an interactive environment requiring multi-task planning and reasoning. A group of researchers from Carnegie Mellon University, NVIDIA, Ariel University, and Microsoft have investigated the use of Large Language Models (LLMs) for understanding and reasoning with human knowledge in the context of games. They propose a two-stage approach called SPRING, which involves studying an academic paper and then using a Question-Answer (QA) framework to justify the knowledge obtained. More details about SPRING In the first stage, the authors read the LaTeX source code of the original paper by Hafner (2021)

ml · paper · rl · llm

May 28, 2023 at 7:11:41 PM EDT * · permalink

·

https://www.marktechpost.com/2023/05/28/llms-outperform-reinforcement-learning-meet-spring-an-innovative-prompting-framework-for-llms-designed-to-enable-in-context-chain-of-thought-planning-and-reasoning/?amp

reinforcement learning summary

rl · ML · paper

January 6, 2023 at 11:06:51 AM EST * · permalink

·

https://twitter.com/papers_daily/status/1611218582968336390

DeepMind says reinforcement learning is 'enough' to reach general AI | VentureBeat

ml · rl

June 20, 2021 at 2:27:05 PM EDT · permalink

·

https://venturebeat.com/2021/06/09/deepmind-says-reinforcement-learning-is-enough-to-reach-general-ai/

A graph placement methodology for fast chip design | Nature

"Finally, we believe that more powerful AI-designed hardware will fuel advances in AI, creating a symbiotic relationship between the two fields."

hardware · rl · ml

June 9, 2021 at 3:39:54 PM EDT · permalink

·

https://www.nature.com/articles/s41586-021-03544-w

Introduction to Reinforcement Learning with David Silver | DeepMind

rl · ml · courses

December 28, 2020 at 2:02:13 PM EST * · permalink

·

https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver

Google Research Football with Manchester City F.C. | Kaggle

ml · rl

October 2, 2020 at 2:23:29 PM EDT * · permalink

·

https://www.kaggle.com/c/google-football/overview

Create Your Own Reinforcement Learning Environment - Towards Data Science

We are not just going to solve another reinforcement learning environment but going to create one from scratch.

ml · rl

April 5, 2020 at 5:45:53 PM EDT * · permalink

·

https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef

Model-Based Reinforcement Learning:Theory and Practice – The Berkeley Artificial Intelligence Research Blog

to_read · ml · rl

December 12, 2019 at 8:28:18 PM EST · permalink

·

https://bair.berkeley.edu/blog/2019/12/12/mbpo/

Google DeepMind gamifies memory with its latest AI work | ZDNet

light on details

"DeepMind's version of reinforcement learning that uses "temporal value transport" to send a signal from reward backward, to shape actions, does better than alternative forms of neural networks. Here, the "TVT" program is compared to "Long-short-term memory," or LSTM, neural networks, with and without memory, and a basic reconstructive memory agent."

ml · rl

December 3, 2019 at 6:32:30 PM EST * · permalink

·

https://www.zdnet.com/article/google-deepmind-gamifies-memory-with-its-latest-ai-work/

AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using Interpretable Deep Reinforcement Attention Networks

AlphaStock fully exploits the interrelationship among stocks, and
opens a door for solving the “black box” problem of using deep learning models in financial markets. The back-testing and simulation experiments over U.S. and Chinese stock markets showed that

AlphaStock performed much better than other competing strategies. Interestingly, AlphaStock suggests buying stocks with high long-term growth, low volatility, high intrinsic value, and being
undervalued recently.

finance · ml · optim · rl

August 8, 2019 at 1:23:58 PM EDT · permalink

·

http://delivery.acm.org/10.1145/3340000/3330647/p1900-wang.pdf?ip=140.80.199.91&id=3330647&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1565281073_6e77073b66c5833d9ddfca635decb45e