32 private links
On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.
Overall I think we're going to see a much higher quality of software, ironically around the same level than before 2000 when the net became usable by everyone to download fixes. When the software had to be pressed to CDs or written to millions of floppies, it had to survive an amazing quantity of tests that are mostly neglected nowadays since updates are easy to distribute. But before this happens, we have to experience a huge mess that might last for a few years to come! Interesting times...
each episode corresponds to a random combination of object generations, monster placements and different level variants, which in turn requires using different combinations of strategies at each episode
Typical refactor work is using jscpd for code duplication, knip for dead code, running eslint’s react-compiler and deprecation plugins, checking if we introduced api routes that can be consolidated, maintaining my docs, breaking apart files that grew too large, adding tests and code comments for tricky parts, updating dependencies, tool upgrades, file restructuring, finding and rewriting slow tests, mentioning modern react patterns and rewriting code
Meanwhile, management leans on programmers to heavily use AI tools, with employees previously telling the FT that the company set a target for 80 percent of developers to use AI for coding tasks at least once a week.
In sum: more coding with more AI with more human oversight, but fewer humans. We’ll see how that works out.
Although AMI Labs has no plans to generate revenue for the time being, it still plans to engage with prospective customers early on
Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks.
Having generation and verification co-evolve on the same online rollouts is the fix, and the ablation (Figure 11) shows it matters — co-evolving consistently beats non-co-evolving by 4–6%.
Instead, he says, business leaders should prioritize creating a culture in which their employees feel empowered to experiment with vibe coding and share their best creations. “Seeing is believing,” says Schluntz, “and I think getting non-developers in every company to use these tools to bring their ideas to life is one of the most powerful things.”
According to Anthropic researcher Eric Schluntz, vibe coding makes it so that “people are limited only by their creativity, not by the skills that they have.” Think about Apple in the 1970s; Steve Jobs was the big ideas guy, and Steve Wozniak was the technical genius who translated Jobs’ ideas into a working product. Vibe coding essentially gives everyone their own personal Woz. “If you have an image of something in your mind, you can go create it,” adds Schluntz.
TypeScript agent frameworks felt like toys. Single-threaded event loops trying to juggle concurrent agents with promises and prayer. Python agents did a little better, but after a long time they couldn’t stay up. The BEAM was built for exactly this kind of work.
While SFT distillation meaningfully improves overall performance over the base model, the gap between the two approaches is most apparent when combined with test-time compute. On in-distribution tasks, SFT benefits substantially from parallel sampling (69.1 → 75.3), yet on out-of-distribution tasks the gains are negligible (59.4 → 59.6). This suggests that distillation teaches the model to imitate task-specific expert behavior, which scales well within the training distribution but fails to generalize beyond it. In contrast, KARL benefits from test-time compute both in- and out-of-distribution, indicating that RL develops more general search capabilities rather than task-specific heuristic
Why Elixir?
Elixir is built on Erlang/BEAM/OTP, which is great for supervising long-running processes. It has an active ecosystem of tools and libraries. It also supports hot code reloading without stopping actively running subagents, which is very useful during development.
The above command enters you into a chat loop. You can talk to the model and share information like your name. Every now and then /sleep the model to transition short-term memory to long-term memory
The /sleep command:
Generates Q&A pairs based on the context
LoRA fine-tunes the model on the new Q&A pairs plus any from previous sessions
Resets the KV cache
After the /sleep command the model should remember context from previous sessions even though that context is no longer in the KV cache.
SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs
Qwen3.5 Small models disable thinking by default. Use llama-server to enable it.
It's not chatbot psychosis, it's 'math and engineering and neuroscience'
10 documented cases of AI coding agents autonomously destroying databases, wiping hard drives, and deleting years of data — then lying about it.