π WELCOME TO METAMESH.BIZ +++ Needle somehow crammed Gemini's tool-calling brain into 26M params running at 1200 tok/s on your phone (the democratization of agents begins) +++ DOD deploys Mythos to patch the entire government while awkwardly breaking up with Anthropic (national security meets vendor lock-in drama) +++ Supply chain attackers poisoning Mistral's PyPI packages because why hack models when you can own the install process +++ THE MESH SEES YOUR TABULAR FOUNDATION MODELS FINALLY ESCAPING JUPYTER NOTEBOOKS +++ π β’
π WELCOME TO METAMESH.BIZ +++ Needle somehow crammed Gemini's tool-calling brain into 26M params running at 1200 tok/s on your phone (the democratization of agents begins) +++ DOD deploys Mythos to patch the entire government while awkwardly breaking up with Anthropic (national security meets vendor lock-in drama) +++ Supply chain attackers poisoning Mistral's PyPI packages because why hack models when you can own the install process +++ THE MESH SEES YOUR TABULAR FOUNDATION MODELS FINALLY ESCAPING JUPYTER NOTEBOOKS +++ π β’
via Arxivπ€ Zekun Wu, Ze Wang, Seonglae Cho et al.π 2026-05-08
β‘ Score: 8.0
"When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and..."
π° NEWS
Google TIG discovers hackers using AI to find zero-day exploits
2x SOURCES ππ 2026-05-11
β‘ Score: 8.0
+++ Google's Threat Intelligence Group caught hackers using AI to find and exploit vulnerabilities at scale, confirming what security researchers have quietly dreaded: the automation of exploit development is now operational, not theoretical. +++
"Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's u..."
"As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at \~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, whi..."
"We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led ..."
"I've been running structured output prompts through a bunch of models on OpenRouter for the past few months β Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter β alongside the usual closed-source suspects. 288 calls total. I wanted to know what actually breaks, how oft..."
π¬ Reddit Discussion: 44 comments
π MID OR MIXED
"TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature.
Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabP..."
"Hey fellow Llamas, keeping it short.
We just shipped **DFlash** and **PFlash** support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from [the RTX 3090 post a couple weeks back](https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dfla..."
via Arxivπ€ Arnav Arora, Natalie Schluter, Katherine Metcalf et al.π 2026-05-08
β‘ Score: 7.3
"Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of th..."
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
"now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes
*Datasets: AIME, AIME2025, GSM8K, GPQA*..."
π¬ Reddit Discussion: 22 comments
π BUZZING
π° NEWS
US DOD deploys Anthropic's Mythos vulnerability scanner
2x SOURCES ππ 2026-05-12
β‘ Score: 7.2
+++ The Pentagon is using Anthropic's vulnerability scanner across government systems even as it plots a strategic pivot away from the company, which is either excellent compartmentalization or just how procurement works. +++
"The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA ke..."
π° NEWS
Anthropic's Computer Use API released
2x SOURCES ππ 2026-05-12
β‘ Score: 7.1
+++ Anthropic's new Computer Use API lets Claude interact with desktop interfaces directly, trading the traditional API paradigm for something that feels less like integration and more like hiring an intern who actually uses your software. +++
via Arxivπ€ Nikita Kezins, Urbas Ekka, Pascal Berrang et al.π 2026-05-11
β‘ Score: 7.0
"Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space:..."
"Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \\textbf{attention drift}: as the drafter generates successive t..."
via Arxivπ€ Zezheng Lin, Fengming Liuπ 2026-05-08
β‘ Score: 6.9
"Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions s..."
π° NEWS
Claude Platform on AWS general availability
2x SOURCES ππ 2026-05-11
β‘ Score: 6.8
+++ Anthropic's Claude API now lives in AWS's walled garden with managed agents, code execution, and all the bells whistles that make enterprise procurement teams sleep soundly at night. +++
"AWS customers get the full set of Claude API features, with AWS authentication, billing, and commitment retirement.Β
Build and deploy agents at scale with Claude Managed Agents, or use features like the advisor strategy, code execution, web search, web fetch, the Files API, MCP connector, prompt ca..."
π¬ Reddit Discussion: 10 comments
π MID OR MIXED
via Arxivπ€ Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade et al.π 2026-05-08
β‘ Score: 6.8
"Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explana..."
via Arxivπ€ Shuangrui Ding, Xuanlang Dai, Long Xing et al.π 2026-05-11
β‘ Score: 6.8
"Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agen..."
via Arxivπ€ Yaxin Du, Xiyuan Yang, Zhifan Zhou et al.π 2026-05-11
β‘ Score: 6.8
"As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipe..."
via Arxivπ€ Jiayuan Liu, Tianqin Li, Shiyi Du et al.π 2026-05-08
β‘ Score: 6.8
"Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we..."
"I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely lower KLD while droppin..."
π¬ Reddit Discussion: 24 comments
π GOATED ENERGY
via Arxivπ€ Anmol Gulati, Hariom Gupta, Elias Lumer et al.π 2026-05-08
β‘ Score: 6.7
"Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measure..."
via Arxivπ€ Tong Zheng, Haolin Liu, Chengsong Huang et al.π 2026-05-08
β‘ Score: 6.7
"Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, l..."
via Arxivπ€ Simon Yu, Derek Chong, Ananjan Nandi et al.π 2026-05-11
β‘ Score: 6.7
"We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forke..."
via Arxivπ€ Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang et al.π 2026-05-11
β‘ Score: 6.7
"Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard po..."
via Arxivπ€ Mingxi Zou, Zhihan Guo, Langzhang Liang et al.π 2026-05-11
β‘ Score: 6.6
"Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but be..."
via Arxivπ€ Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov et al.π 2026-05-08
β‘ Score: 6.6
"Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. I..."
via Arxivπ€ Yash Akhauri, Mohamed S. Abdelfattahπ 2026-05-11
β‘ Score: 6.6
"Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can o..."
via Arxivπ€ Roxana Geambasu, Mariana Raykova, Pierre Tholoniat et al.π 2026-05-11
β‘ Score: 6.6
"The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, ad..."
"r/ClaudeAI β’ also crosspost to r/LocalLLaMA and r/artificial
I lost $187 to this and want to save others the same headache.
**What happened**
I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set β legitimately, for a separ..."
"Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable ta..."
Google detects AI-generated code bypassing 2FA with zero-day
2x SOURCES ππ 2026-05-11
β‘ Score: 6.5
+++ Turns out giving hackers access to code generation tools makes them more efficient at their jobs, which Google is now warning about with the urgency of someone discovering fire is hot. +++
via Arxivπ€ Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Linπ 2026-05-11
β‘ Score: 6.5
"Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers aski..."
"Morning Everyone!
Big one today (**104 changes!**): Claude Code just went async.
The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got r..."
π¬ Reddit Discussion: 43 comments
π MID OR MIXED
via Arxivπ€ Ning Liu, Chuanneng Sun, Kristina Klinkner et al.π 2026-05-08
β‘ Score: 6.5
"Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing ric..."
via Arxivπ€ Junhao Shen, Teng Zhang, Xiaoyan Zhao et al.π 2026-05-11
β‘ Score: 6.5
"Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized int..."
via Arxivπ€ Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.π 2026-05-08
β‘ Score: 6.5
"We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-gro..."
via Arxivπ€ Joel Rorseth, Parke Godfrey, Lukasz Golab et al.π 2026-05-11
β‘ Score: 6.4
"This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We fur..."
via Arxivπ€ Mohammadreza Armandpour, Fatih Ilhan, David Harrison et al.π 2026-05-11
β‘ Score: 6.4
"On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context s..."
via Arxivπ€ Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al.π 2026-05-08
β‘ Score: 6.3
"Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generati..."
"I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers.
I found that the spectral ratio is best kept around 0.5..."
"I've been spending the last several months reading every published psychology paper I can find on AI chatbot use, and I noticed something that genuinely bothers me as both a researcher and a
Claude user.
Almost every empirical study samples one of three populations: ChatGPT users, Character.AI u..."
"I was running blind watching Claude Code work, could not tell where my money was going, when it was stuck in a loop, or what it was doing with my filesystem. So i built something open source to make it visible. works with Claude Code, Codex CLI, Gemini CLI, Cursor, and any MCP server.
Β Β
A scan ..."
"Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable.
**Autocomplete**: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L
**Agentic**: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL
---
### Why these models:
Qwen2.5 is still the best model for infill imo..."
via Arxivπ€ Jiatao Gu, Tianrong Chen, Ying Shen et al.π 2026-05-08
β‘ Score: 6.1
"Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice..."
via Arxivπ€ Linus Heck, Filip MacΓ‘k, Roman Andriushchenko et al.π 2026-05-11
β‘ Score: 6.1
"Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is..."
"After a year building a production fact-checking system, the single most counter-intuitive design decision I keep defending is this: the LLM in our pipeline never produces a numeric score, never produces a true/false verdict, never produces anything that gets surfaced to the user as a judgment. The ..."
π¬ Reddit Discussion: 10 comments
π MID OR MIXED
via Arxivπ€ Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney et al.π 2026-05-08
β‘ Score: 6.1
"Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classif..."
via Arxivπ€ Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur et al.π 2026-05-11
β‘ Score: 6.1
"Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model..."