πŸš€ WELCOME TO METAMESH.BIZ +++ Needle somehow crammed Gemini's tool-calling brain into 26M params running at 1200 tok/s on your phone (the democratization of agents begins) +++ DOD deploys Mythos to patch the entire government while awkwardly breaking up with Anthropic (national security meets vendor lock-in drama) +++ Supply chain attackers poisoning Mistral's PyPI packages because why hack models when you can own the install process +++ THE MESH SEES YOUR TABULAR FOUNDATION MODELS FINALLY ESCAPING JUPYTER NOTEBOOKS +++ πŸš€ β€’
πŸš€ WELCOME TO METAMESH.BIZ +++ Needle somehow crammed Gemini's tool-calling brain into 26M params running at 1200 tok/s on your phone (the democratization of agents begins) +++ DOD deploys Mythos to patch the entire government while awkwardly breaking up with Anthropic (national security meets vendor lock-in drama) +++ Supply chain attackers poisoning Mistral's PyPI packages because why hack models when you can own the install process +++ THE MESH SEES YOUR TABULAR FOUNDATION MODELS FINALLY ESCAPING JUPYTER NOTEBOOKS +++ πŸš€ β€’
AI Signal - PREMIUM TECH INTELLIGENCE
πŸ“Ÿ Optimized for Netscape Navigator 4.0+
πŸ“š HISTORICAL ARCHIVE - May 12, 2026
What was happening in AI on 2026-05-12
← May 11 πŸ“Š TODAY'S NEWS πŸ“š ARCHIVE
πŸ“Š You are visitor #47291 to this AWESOME site! πŸ“Š
Archive from: 2026-05-12 | Preserved for posterity ⚑

Stories from May 12, 2026

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
πŸ“‚ Filter by Category
Loading filters...
πŸ”¬ RESEARCH

Tool Calling is Linearly Readable and Steerable in Language Models

"When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and..."
πŸ“° NEWS

Google TIG discovers hackers using AI to find zero-day exploits

+++ Google's Threat Intelligence Group caught hackers using AI to find and exploit vulnerabilities at scale, confirming what security researchers have quietly dreaded: the automation of exploit development is now operational, not theoretical. +++

Google's TIG reports the first known example of hackers using AI to discover and weaponize a zero-day; TIG's chief analyst says β€œthis is the tip of the iceberg”

πŸ”¬ RESEARCH

Neural Weight Norm = Kolmogorov Complexity

"Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's u..."
πŸ“° NEWS

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

"As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at \~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, whi..."
πŸ’¬ Reddit Discussion: 115 comments 🐝 BUZZING
πŸ“° NEWS

Needle: We Distilled Gemini Tool Calling Into a 26M Model

"We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led ..."
πŸ’¬ Reddit Discussion: 23 comments 🐝 BUZZING
πŸ”¬ RESEARCH

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

πŸ“° NEWS

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls

"I've been running structured output prompts through a bunch of models on OpenRouter for the past few months β€” Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter β€” alongside the usual closed-source suspects. 288 calls total. I wanted to know what actually breaks, how oft..."
πŸ’¬ Reddit Discussion: 44 comments 😐 MID OR MIXED
πŸ“° NEWS

TabPFN-3 just released: a pre-trained tabular foundation model for up to 1M rows [R][N]

"TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature. Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabP..."
πŸ’¬ Reddit Discussion: 8 comments 🐝 BUZZING
πŸ“° NEWS

Microsoft says it is investigating a Mistral AI PyPI package v2.4.6 compromise; researchers say it is likely part of the Mini Shai-Hulud supply chain attack

πŸ“° NEWS

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

"Hey fellow Llamas, keeping it short. We just shipped **DFlash** and **PFlash** support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from [the RTX 3090 post a couple weeks back](https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dfla..."
πŸ’¬ Reddit Discussion: 8 comments 🐝 BUZZING
πŸ”¬ RESEARCH

How Value Induction Reshapes LLM Behaviour

"Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of th..."
πŸ“° NEWS

examples : add llama-eval by ggerganov Β· Pull Request #21152 Β· ggml-org/llama.cpp

"now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes *Datasets: AIME, AIME2025, GSM8K, GPQA*..."
πŸ’¬ Reddit Discussion: 22 comments 🐝 BUZZING
πŸ“° NEWS

US DOD deploys Anthropic's Mythos vulnerability scanner

+++ The Pentagon is using Anthropic's vulnerability scanner across government systems even as it plots a strategic pivot away from the company, which is either excellent compartmentalization or just how procurement works. +++

The US DOD says it is deploying Mythos to find and patch software vulnerabilities across the US government, even as it works on a transition away from Anthropic

πŸ“° NEWS

Why is Anthropic's training data disclosure AI-generated?

πŸ“° NEWS

Interfaze: A new model architecture built for high accuracy at scale

πŸ’¬ HackerNews Buzz: 17 comments πŸ‘ LOWKEY SLAPS
πŸ“° NEWS

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

"The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA ke..."
πŸ“° NEWS

Anthropic's Computer Use API released

+++ Anthropic's new Computer Use API lets Claude interact with desktop interfaces directly, trading the traditional API paradigm for something that feels less like integration and more like hiring an intern who actually uses your software. +++

Anthropic publicly releases AI tool that can take over the ' mouse cursor(2024)

πŸ”¬ RESEARCH

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

"Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space:..."
πŸ”¬ RESEARCH

Attention Drift: What Autoregressive Speculative Decoding Models Learn

"Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \\textbf{attention drift}: as the drafter generates successive t..."
πŸ“° NEWS

Agentic AI is giving cyber criminals nation-state-like powers

πŸ“° NEWS

Microsoft researchers find AI models and agents can't handle long-running tasks

πŸ“° NEWS

Natural-language messages between LLM agents are an architectural anti-pattern

πŸ’¬ HackerNews Buzz: 3 comments 😐 MID OR MIXED
πŸ”¬ RESEARCH

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

"Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions s..."
πŸ“° NEWS

Claude Platform on AWS general availability

+++ Anthropic's Claude API now lives in AWS's walled garden with managed agents, code execution, and all the bells whistles that make enterprise procurement teams sleep soundly at night. +++

Claude Platform on AWS

πŸ’¬ HackerNews Buzz: 65 comments πŸ‘ LOWKEY SLAPS
πŸ”¬ RESEARCH

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

"Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explana..."
πŸ”¬ RESEARCH

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

"Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agen..."
πŸ”¬ RESEARCH

DataMaster: Towards Autonomous Data Engineering for Machine Learning

"As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipe..."
πŸ”¬ RESEARCH

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

"Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we..."
πŸ“° NEWS

"Will I be OK?" Teen died after ChatGPT pushed deadly mix of drugs, lawsuit says

πŸ“° NEWS

What breaks when you ask an LLM for JSON (288 model outputs tested)

πŸ“° NEWS

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

"I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely lower KLD while droppin..."
πŸ’¬ Reddit Discussion: 24 comments 🐐 GOATED ENERGY
πŸ”¬ RESEARCH

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

"Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measure..."
πŸ”¬ RESEARCH

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

"Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, l..."
πŸ”¬ RESEARCH

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

"We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forke..."
πŸ”¬ RESEARCH

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

"Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard po..."
πŸ› οΈ SHOW HN

Show HN: E2a – Open-source Email gateway for AI agents

πŸ’¬ HackerNews Buzz: 3 comments πŸ‘ LOWKEY SLAPS
πŸ”¬ RESEARCH

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

"Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but be..."
πŸ”¬ RESEARCH

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space

"Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. I..."
πŸ”¬ RESEARCH

Compute Where it Counts: Self Optimizing Language Models

"Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can o..."
πŸ”¬ RESEARCH

Engineering Robustness into Personal Agents with the AI Workflow Store

"The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, ad..."
πŸ“° NEWS

PSA: If your project has an ANTHROPIC_API_KEY in any .env file, Claude Code will silently bill your API account instead of your Max plan β€” Anthropic calls it "intentional functionality"

"r/ClaudeAI β€’ also crosspost to r/LocalLLaMA and r/artificial I lost $187 to this and want to save others the same headache. **What happened** I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set β€” legitimately, for a separ..."
πŸ’¬ Reddit Discussion: 35 comments πŸ‘ LOWKEY SLAPS
πŸ”¬ RESEARCH

Learning CLI Agents with Structured Action Credit under Selective Observation

"Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable ta..."
πŸ“° NEWS

Sources: the White House's Office of the National Cyber Director and Commerce Department's CAISI are fighting over which agency should lead AI model evaluations

πŸ“° NEWS

Google detects AI-generated code bypassing 2FA with zero-day

+++ Turns out giving hackers access to code generation tools makes them more efficient at their jobs, which Google is now warning about with the urgency of someone discovering fire is hot. +++

Google detects hackers using AI-generated code to bypass 2FA with zero-day vulnerability

"External link discussion - see full content at original source."
πŸ”¬ RESEARCH

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

"Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers aski..."
πŸ“° NEWS

Claude Code just shipped a "run until done" mode. Upgrade to v2.1.139 for /goal.

"Morning Everyone! Big one today (**104 changes!**): Claude Code just went async. The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got r..."
πŸ’¬ Reddit Discussion: 43 comments 😐 MID OR MIXED
πŸ”¬ RESEARCH

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

"Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing ric..."
πŸ› οΈ SHOW HN

Show HN: Statewright – Visual state machines that make AI agents reliable

πŸ’¬ HackerNews Buzz: 11 comments 🐐 GOATED ENERGY
πŸ”¬ RESEARCH

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

"Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized int..."
πŸ”¬ RESEARCH

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

"We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-gro..."
πŸ”¬ RESEARCH

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

"This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We fur..."
πŸ› οΈ SHOW HN

Show HN: Agentic interface for mainframes and COBOL

πŸ’¬ HackerNews Buzz: 15 comments 🐝 BUZZING
πŸ”¬ RESEARCH

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

"On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context s..."
πŸ“° NEWS

Google unveils Gemini Intelligence, bundling existing and new Gemini features, including task automation across apps and letting users vibe-code Android widgets

πŸ”¬ RESEARCH

Fast Byte Latent Transformer

"Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generati..."
πŸ“° NEWS

Interaction Models

πŸ’¬ HackerNews Buzz: 26 comments 🐐 GOATED ENERGY
πŸ“° NEWS

Plumbers, electricians, and HVAC techs watching AI replace everyone except them.

"External link discussion - see full content at original source."
πŸ’¬ Reddit Discussion: 315 comments 😐 MID OR MIXED
πŸ“° NEWS

Gemma 4 running fully offline on WebGPU with Transformers.js, controlling Reachy Mini over WebSerial.

"External link discussion - see full content at original source."
πŸ’¬ Reddit Discussion: 9 comments 🐐 GOATED ENERGY
πŸ“° NEWS

CME Group and Silicon Data announce a futures market for computing capacity, with contracts based on daily GPU benchmarks for on-demand rental rates

πŸ“° NEWS

We Ran 250 AI Agent Evals to Find Out If Skills Beat Docs

πŸ› οΈ SHOW HN

Show HN: Agent FM – local, open-source radio for Claude Code and Codex agents

πŸ“° NEWS

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

"I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers. I found that the spectral ratio is best kept around 0.5..."
πŸ“° NEWS

Why Claude users are systematically missing from AI psychology research (and what that means)

"I've been spending the last several months reading every published psychology paper I can find on AI chatbot use, and I noticed something that genuinely bothers me as both a researcher and a Claude user. Almost every empirical study samples one of three populations: ChatGPT users, Character.AI u..."
πŸ’¬ Reddit Discussion: 16 comments 🐝 BUZZING
πŸ“° NEWS

TUI to actually see what Claude Code is doing: cost, loops, tool commands…

"I was running blind watching Claude Code work, could not tell where my money was going, when it was stuck in a loop, or what it was doing with my filesystem. So i built something open source to make it visible. works with Claude Code, Codex CLI, Gemini CLI, Cursor, and any MCP server. Β Β  A scan ..."
πŸ’¬ Reddit Discussion: 14 comments πŸ‘ LOWKEY SLAPS
πŸ“° NEWS

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

"Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. **Autocomplete**: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L **Agentic**: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --- ### Why these models: Qwen2.5 is still the best model for infill imo..."
πŸ’¬ Reddit Discussion: 28 comments 🐝 BUZZING
πŸ”¬ RESEARCH

Normalizing Trajectory Models

"Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice..."
πŸ“° NEWS

Through the looking glass of benchmark hacking

πŸ’¬ HackerNews Buzz: 5 comments 😐 MID OR MIXED
πŸ”¬ RESEARCH

Shields to Guarantee Probabilistic Safety in MDPs

"Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is..."
πŸ“° NEWS

I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here's why.

"After a year building a production fact-checking system, the single most counter-intuitive design decision I keep defending is this: the LLM in our pipeline never produces a numeric score, never produces a true/false verdict, never produces anything that gets surfaced to the user as a judgment. The ..."
πŸ’¬ Reddit Discussion: 10 comments 😐 MID OR MIXED
πŸ”¬ RESEARCH

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

"Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classif..."
πŸ”¬ RESEARCH

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

"Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model..."
πŸ› οΈ SHOW HN

Show HN: Prempti – Guardrails and observability for AI coding agents

πŸ¦†
HEY FRIENDO
CLICK HERE IF YOU WOULD LIKE TO JOIN MY PROFESSIONAL NETWORK ON LINKEDIN
🀝 LETS BE BUSINESS PALS 🀝