32 private links
If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations,...
Creating training data for software engineering agents is difficult. Until now.
Introducing SWE-smith: Generate 100s to 1000s of task instances for any GitHub repository.
We've generated 50k+ task instances for 128 popular GitHub repositories, then trained our own LM for SWE-agent.
The result? SWE-agent-LM-32B achieve 40% pass@1 on SWE-bench Verified.
Now, we've open-sourced everything, and we're excited to see what you build with it!
Check out the tutorial below to generate 100 task instances for any GitHub repository in 10 minutes.
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
A tracing of the history of GPT-1 and its predecessors.
Replace 'hub' with 'ingest' in any GitHub URL for a prompt-friendly text.
A new report reveals OpenAI's audio transcription tool, Whisper, has recorded consistent "hallucinations", according to multiple studies.
Google is gearing up to unveil its latest AI language model, Gemini 2.0, in December, according to insider sources from The Verge.
Another indication of the plateau thesis: OpenAI has just confirmed that a new model, internally considered as a potential successor to GPT-4, will not be released this year, despite looming competition from Google Gemini 2.0.
Similarly, Anthropic is rumored to have put a previously announced version 3.5 of its flagship Opus model on hold due to a lack of significant progress, instead focusing on an improved version of Sonnet 3.5 that emphasizes agent-based AI.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering.
hybrid LSTM models, significantly outperform the traditional GARCH models
Anthropic is launching a new subscription plan for its AI chatbot, Claude, catered toward enterprise customers that want more administrative controls and
Anthropic's prompt caching lets users save prompts and call these up for later sessions with additional context for a lower price.
We’re excited to offer the AI/ML community free access to Hermes 3 through Lambda’s new Chat Completions API, fully compatible with the OpenAI API. It provides endpoints for creating completions, chat completions and listing models.
Slack's engineering team recently published how it used a large language model (LLM) to automatically convert 15,000 unit and integration tests from Enzyme to React Testing Library (RTL). By combining
Pulumi claims it has culled bad infrastructure-as-code samples
To help users get better at crafting prompts, Google just published a crash course on the subject in the form of a 45-page handbook.