AI News Archive - May 12, 2026 | Metamesh Intelligence

🔬 RESEARCH

Tool Calling is Linearly Readable and Steerable in Language Models

via Arxiv 👤 Zekun Wu, Ze Wang, Seonglae Cho et al. 📅 2026-05-08

⚡ Score: 8.0

"When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and..."

📰 NEWS

Google TIG discovers hackers using AI to find zero-day exploits

2x SOURCES 🌐 📅 2026-05-11

⚡ Score: 8.0

+++ Google's Threat Intelligence Group caught hackers using AI to find and exploit vulnerabilities at scale, confirming what security researchers have quietly dreaded: the automation of exploit development is now operational, not theoretical. +++

Google's TIG reports the first known example of hackers using AI to discover and weaponize a zero-day; TIG's chief analyst says “this is the tip of the iceberg”

via Techmeme 👤 Nytimes 📅 2026-05-11

⚡ Score: 7.9

🔬 RESEARCH

Neural Weight Norm = Kolmogorov Complexity

via Arxiv 👤 Tiberiu Musat 📅 2026-05-11

⚡ Score: 7.8

"Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's u..."

📰 NEWS

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

via r/LocalLLaMA 👤 u/APFrisco 📅 2026-05-11

⬆️ 719 ups ⚡ Score: 7.7

"As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at \~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, whi..."

💬 Reddit Discussion: 115 comments 🐝 BUZZING

📰 NEWS

Needle: We Distilled Gemini Tool Calling Into a 26M Model

via r/LocalLLaMA 👤 u/Henrie_the_dreamer 📅 2026-05-12

⬆️ 105 ups ⚡ Score: 7.6

"We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led ..."

💬 Reddit Discussion: 23 comments 🐝 BUZZING

🔬 RESEARCH

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

via HackerNews 👤 PaulHoule 📅 2026-05-12

🔺 5 pts ⚡ Score: 7.5

📰 NEWS

I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls

via r/LocalLLaMA 👤 u/kexxty 📅 2026-05-11

⬆️ 48 ups ⚡ Score: 7.5

"I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 288 calls total. I wanted to know what actually breaks, how oft..."

💬 Reddit Discussion: 44 comments 😐 MID OR MIXED

📰 NEWS

TabPFN-3 just released: a pre-trained tabular foundation model for up to 1M rows [R][N]

via r/MachineLearning 👤 u/rsesrsfh 📅 2026-05-12

⬆️ 37 ups ⚡ Score: 7.4

"TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature. Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabP..."

💬 Reddit Discussion: 8 comments 🐝 BUZZING

📰 NEWS

Microsoft says it is investigating a Mistral AI PyPI package v2.4.6 compromise; researchers say it is likely part of the Mini Shai-Hulud supply chain attack

via Techmeme 👤 Tomshardware 📅 2026-05-12

⚡ Score: 7.4

📰 NEWS

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

via r/LocalLLaMA 👤 u/sandropuppo 📅 2026-05-12

⬆️ 23 ups ⚡ Score: 7.4

"Hey fellow Llamas, keeping it short. We just shipped **DFlash** and **PFlash** support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from [the RTX 3090 post a couple weeks back](https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dfla..."

💬 Reddit Discussion: 8 comments 🐝 BUZZING

🔬 RESEARCH

How Value Induction Reshapes LLM Behaviour

via Arxiv 👤 Arnav Arora, Natalie Schluter, Katherine Metcalf et al. 📅 2026-05-08

⚡ Score: 7.3

"Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of th..."

📰 NEWS

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

via r/LocalLLaMA 👤 u/jacek2023 📅 2026-05-12

⬆️ 77 ups ⚡ Score: 7.3

"now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes *Datasets: AIME, AIME2025, GSM8K, GPQA*..."

💬 Reddit Discussion: 22 comments 🐝 BUZZING

📰 NEWS

US DOD deploys Anthropic's Mythos vulnerability scanner

2x SOURCES 🌐 📅 2026-05-12

⚡ Score: 7.2

+++ The Pentagon is using Anthropic's vulnerability scanner across government systems even as it plots a strategic pivot away from the company, which is either excellent compartmentalization or just how procurement works. +++

The US DOD says it is deploying Mythos to find and patch software vulnerabilities across the US government, even as it works on a transition away from Anthropic

via Techmeme 👤 Reuters 📅 2026-05-12

⚡ Score: 7.3

📰 NEWS

Why is Anthropic's training data disclosure AI-generated?

via HackerNews 👤 pretext 📅 2026-05-12

🔺 2 pts ⚡ Score: 7.2

📰 NEWS

Interfaze: A new model architecture built for high accuracy at scale

via HackerNews 👤 yoeven 📅 2026-05-11

🔺 81 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 17 comments 👍 LOWKEY SLAPS

📰 NEWS

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

via r/MachineLearning 👤 u/NoVibeCoding 📅 2026-05-11

⬆️ 2 ups ⚡ Score: 7.2

"The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA ke..."

📰 NEWS

Anthropic's Computer Use API released

2x SOURCES 🌐 📅 2026-05-12

⚡ Score: 7.1

+++ Anthropic's new Computer Use API lets Claude interact with desktop interfaces directly, trading the traditional API paradigm for something that feels less like integration and more like hiring an intern who actually uses your software. +++

Anthropic publicly releases AI tool that can take over the ' mouse cursor(2024)

via HackerNews 👤 rolph 📅 2026-05-12

🔺 1 pts ⚡ Score: 7.0

🔬 RESEARCH

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

via Arxiv 👤 Nikita Kezins, Urbas Ekka, Pascal Berrang et al. 📅 2026-05-11

⚡ Score: 7.0

"Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space:..."

🔬 RESEARCH

Attention Drift: What Autoregressive Speculative Decoding Models Learn

via r/LocalLLaMA 👤 u/Thrumpwart 📅 2026-05-12

⬆️ 13 ups ⚡ Score: 7.0

"Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \\textbf{attention drift}: as the drafter generates successive t..."

📰 NEWS

Agentic AI is giving cyber criminals nation-state-like powers

via HackerNews 👤 jethronethro 📅 2026-05-11

🔺 2 pts ⚡ Score: 7.0

📰 NEWS

Microsoft researchers find AI models and agents can't handle long-running tasks

via HackerNews 👤 beardyw 📅 2026-05-12

🔺 4 pts ⚡ Score: 6.9

📰 NEWS

Natural-language messages between LLM agents are an architectural anti-pattern

via HackerNews 👤 ClausVomBerg 📅 2026-05-11

🔺 2 pts ⚡ Score: 6.9

💬 HackerNews Buzz: 3 comments 😐 MID OR MIXED

🔬 RESEARCH

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

via Arxiv 👤 Zezheng Lin, Fengming Liu 📅 2026-05-08

⚡ Score: 6.9

"Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions s..."

📰 NEWS

Claude Platform on AWS general availability

2x SOURCES 🌐 📅 2026-05-11

⚡ Score: 6.8

+++ Anthropic's Claude API now lives in AWS's walled garden with managed agents, code execution, and all the bells whistles that make enterprise procurement teams sleep soundly at night. +++

Claude Platform on AWS

via HackerNews 👤 matrixhelix 📅 2026-05-12

🔺 135 pts ⚡ Score: 6.8

💬 HackerNews Buzz: 65 comments 👍 LOWKEY SLAPS

The Claude Platform on AWS is now generally available.

via r/claudeai 👤 u/ClaudeOfficial 📅 2026-05-11

⬆️ 137 ups ⚡ Score: 6.6

"AWS customers get the full set of Claude API features, with AWS authentication, billing, and commitment retirement. Build and deploy agents at scale with Claude Managed Agents, or use features like the advisor strategy, code execution, web search, web fetch, the Files API, MCP connector, prompt ca..."

💬 Reddit Discussion: 10 comments 😐 MID OR MIXED

🔬 RESEARCH

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

via Arxiv 👤 Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade et al. 📅 2026-05-08

⚡ Score: 6.8

"Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explana..."

🔬 RESEARCH

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

via Arxiv 👤 Shuangrui Ding, Xuanlang Dai, Long Xing et al. 📅 2026-05-11

⚡ Score: 6.8

"Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agen..."

🔬 RESEARCH

DataMaster: Towards Autonomous Data Engineering for Machine Learning

via Arxiv 👤 Yaxin Du, Xiyuan Yang, Zhifan Zhou et al. 📅 2026-05-11

⚡ Score: 6.8

"As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipe..."

🔬 RESEARCH

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

via Arxiv 👤 Jiayuan Liu, Tianqin Li, Shiyi Du et al. 📅 2026-05-08

⚡ Score: 6.8

"Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we..."

📰 NEWS

"Will I be OK?" Teen died after ChatGPT pushed deadly mix of drugs, lawsuit says

via HackerNews 👤 ndr42 📅 2026-05-12

🔺 11 pts ⚡ Score: 6.8

📰 NEWS

What breaks when you ask an LLM for JSON (288 model outputs tested)

via HackerNews 👤 ndcorder 📅 2026-05-11

🔺 2 pts ⚡ Score: 6.8

📰 NEWS

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

via r/LocalLLaMA 👤 u/crossivejoker 📅 2026-05-12

⬆️ 57 ups ⚡ Score: 6.8

"I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely lower KLD while droppin..."

💬 Reddit Discussion: 24 comments 🐐 GOATED ENERGY

🔬 RESEARCH

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

via Arxiv 👤 Anmol Gulati, Hariom Gupta, Elias Lumer et al. 📅 2026-05-08

⚡ Score: 6.7

"Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measure..."

🔬 RESEARCH

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

via Arxiv 👤 Tong Zheng, Haolin Liu, Chengsong Huang et al. 📅 2026-05-08

⚡ Score: 6.7

"Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, l..."

🔬 RESEARCH

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

via Arxiv 👤 Simon Yu, Derek Chong, Ananjan Nandi et al. 📅 2026-05-11

⚡ Score: 6.7

"We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forke..."

🔬 RESEARCH

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

via Arxiv 👤 Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang et al. 📅 2026-05-11

⚡ Score: 6.7

"Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard po..."

🛠️ SHOW HN

Show HN: E2a – Open-source Email gateway for AI agents

via HackerNews 👤 mnexa 📅 2026-05-11

🔺 38 pts ⚡ Score: 6.7

💬 HackerNews Buzz: 3 comments 👍 LOWKEY SLAPS

🔬 RESEARCH

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

via Arxiv 👤 Mingxi Zou, Zhihan Guo, Langzhang Liang et al. 📅 2026-05-11

⚡ Score: 6.6

"Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but be..."

🔬 RESEARCH

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space

via Arxiv 👤 Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov et al. 📅 2026-05-08

⚡ Score: 6.6

"Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. I..."

🔬 RESEARCH

Compute Where it Counts: Self Optimizing Language Models

via Arxiv 👤 Yash Akhauri, Mohamed S. Abdelfattah 📅 2026-05-11

⚡ Score: 6.6

"Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can o..."

🔬 RESEARCH

Engineering Robustness into Personal Agents with the AI Workflow Store

via Arxiv 👤 Roxana Geambasu, Mariana Raykova, Pierre Tholoniat et al. 📅 2026-05-11

⚡ Score: 6.6

"The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, ad..."

📰 NEWS

PSA: If your project has an ANTHROPIC_API_KEY in any .env file, Claude Code will silently bill your API account instead of your Max plan — Anthropic calls it "intentional functionality"

via r/claudeai 👤 u/35yearstrading 📅 2026-05-12

⬆️ 116 ups ⚡ Score: 6.6

"r/ClaudeAI • also crosspost to r/LocalLLaMA and r/artificial I lost $187 to this and want to save others the same headache. **What happened** I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set — legitimately, for a separ..."

💬 Reddit Discussion: 35 comments 👍 LOWKEY SLAPS

🔬 RESEARCH

Learning CLI Agents with Structured Action Credit under Selective Observation

via Arxiv 👤 Haoyang Su, Ying Wen 📅 2026-05-08

⚡ Score: 6.6

"Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable ta..."

📰 NEWS

Sources: the White House's Office of the National Cyber Director and Commerce Department's CAISI are fighting over which agency should lead AI model evaluations

via Techmeme 👤 Washingtonpost 📅 2026-05-11

⚡ Score: 6.6

📰 NEWS

Google detects AI-generated code bypassing 2FA with zero-day

2x SOURCES 🌐 📅 2026-05-11

⚡ Score: 6.5

+++ Turns out giving hackers access to code generation tools makes them more efficient at their jobs, which Google is now warning about with the urgency of someone discovering fire is hot. +++

Google detects hackers using AI-generated code to bypass 2FA with zero-day vulnerability

via r/artificial 👤 u/Odd-Onion-6776 📅 2026-05-12

⬆️ 49 ups ⚡ Score: 6.5

"External link discussion - see full content at original source."

🔬 RESEARCH

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

via Arxiv 👤 Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin 📅 2026-05-11

⚡ Score: 6.5

"Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers aski..."

📰 NEWS

Claude Code just shipped a "run until done" mode. Upgrade to v2.1.139 for /goal.

via r/claudeai 👤 u/oh-keh 📅 2026-05-12

⬆️ 175 ups ⚡ Score: 6.5

"Morning Everyone! Big one today (**104 changes!**): Claude Code just went async. The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got r..."

💬 Reddit Discussion: 43 comments 😐 MID OR MIXED

🔬 RESEARCH

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

via Arxiv 👤 Ning Liu, Chuanneng Sun, Kristina Klinkner et al. 📅 2026-05-08

⚡ Score: 6.5

"Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing ric..."

🛠️ SHOW HN

Show HN: Statewright – Visual state machines that make AI agents reliable

via HackerNews 👤 azurewraith 📅 2026-05-12

🔺 39 pts ⚡ Score: 6.5

💬 HackerNews Buzz: 11 comments 🐐 GOATED ENERGY

🔬 RESEARCH

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

via Arxiv 👤 Junhao Shen, Teng Zhang, Xiaoyan Zhao et al. 📅 2026-05-11

⚡ Score: 6.5

"Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized int..."

🔬 RESEARCH

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

via Arxiv 👤 Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al. 📅 2026-05-08

⚡ Score: 6.5

"We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-gro..."

🔬 RESEARCH

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

via Arxiv 👤 Joel Rorseth, Parke Godfrey, Lukasz Golab et al. 📅 2026-05-11

⚡ Score: 6.4

"This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We fur..."

🛠️ SHOW HN

Show HN: Agentic interface for mainframes and COBOL

via HackerNews 👤 sai18 📅 2026-05-12

🔺 40 pts ⚡ Score: 6.4

💬 HackerNews Buzz: 15 comments 🐝 BUZZING

🔬 RESEARCH

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

via Arxiv 👤 Mohammadreza Armandpour, Fatih Ilhan, David Harrison et al. 📅 2026-05-11

⚡ Score: 6.4

"On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context s..."

📰 NEWS

Google unveils Gemini Intelligence, bundling existing and new Gemini features, including task automation across apps and letting users vibe-code Android widgets

via Techmeme 👤 Theverge 📅 2026-05-12

⚡ Score: 6.3

🔬 RESEARCH

Fast Byte Latent Transformer

via Arxiv 👤 Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al. 📅 2026-05-08

⚡ Score: 6.3

"Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generati..."

📰 NEWS

Interaction Models

via HackerNews 👤 smhx 📅 2026-05-11

🔺 214 pts ⚡ Score: 6.2

💬 HackerNews Buzz: 26 comments 🐐 GOATED ENERGY

📰 NEWS

Plumbers, electricians, and HVAC techs watching AI replace everyone except them.

via r/ChatGPT 👤 u/vinaykrkatiyar 📅 2026-05-12

⬆️ 2304 ups ⚡ Score: 6.2

"External link discussion - see full content at original source."

💬 Reddit Discussion: 315 comments 😐 MID OR MIXED

📰 NEWS

Gemma 4 running fully offline on WebGPU with Transformers.js, controlling Reachy Mini over WebSerial.

via r/LocalLLaMA 👤 u/xenovatech 📅 2026-05-11

⬆️ 86 ups ⚡ Score: 6.2

"External link discussion - see full content at original source."

💬 Reddit Discussion: 9 comments 🐐 GOATED ENERGY

📰 NEWS

CME Group and Silicon Data announce a futures market for computing capacity, with contracts based on daily GPU benchmarks for on-demand rental rates

via Techmeme 👤 Cnbc 📅 2026-05-12

⚡ Score: 6.2

📰 NEWS

We Ran 250 AI Agent Evals to Find Out If Skills Beat Docs

via HackerNews 👤 doppp 📅 2026-05-12

🔺 2 pts ⚡ Score: 6.2

🛠️ SHOW HN

Show HN: Agent FM – local, open-source radio for Claude Code and Codex agents

via HackerNews 👤 anideshp 📅 2026-05-12

🔺 8 pts ⚡ Score: 6.2

📰 NEWS

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

via r/MachineLearning 👤 u/Otaku_7nfy 📅 2026-05-12

⬆️ 8 ups ⚡ Score: 6.2

"I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers. I found that the spectral ratio is best kept around 0.5..."

📰 NEWS

Why Claude users are systematically missing from AI psychology research (and what that means)

via r/claudeai 👤 u/esuremu 📅 2026-05-12

⬆️ 19 ups ⚡ Score: 6.2

"I've been spending the last several months reading every published psychology paper I can find on AI chatbot use, and I noticed something that genuinely bothers me as both a researcher and a Claude user. Almost every empirical study samples one of three populations: ChatGPT users, Character.AI u..."

💬 Reddit Discussion: 16 comments 🐝 BUZZING

📰 NEWS

TUI to actually see what Claude Code is doing: cost, loops, tool commands…

via r/claudeai 👤 u/WhichCardiologist800 📅 2026-05-12

⬆️ 15 ups ⚡ Score: 6.1

"I was running blind watching Claude Code work, could not tell where my money was going, when it was stuck in a loop, or what it was doing with my filesystem. So i built something open source to make it visible. works with Claude Code, Codex CLI, Gemini CLI, Cursor, and any MCP server. A scan ..."

💬 Reddit Discussion: 14 comments 👍 LOWKEY SLAPS

📰 NEWS

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

via r/LocalLLaMA 👤 u/grumd 📅 2026-05-12

⬆️ 29 ups ⚡ Score: 6.1

"Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. **Autocomplete**: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L **Agentic**: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --- ### Why these models: Qwen2.5 is still the best model for infill imo..."

💬 Reddit Discussion: 28 comments 🐝 BUZZING

🔬 RESEARCH

Normalizing Trajectory Models

via Arxiv 👤 Jiatao Gu, Tianrong Chen, Ying Shen et al. 📅 2026-05-08

⚡ Score: 6.1

"Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice..."

📰 NEWS

Through the looking glass of benchmark hacking

via HackerNews 👤 jxmorris12 📅 2026-05-11

🔺 1 pts ⚡ Score: 6.1

💬 HackerNews Buzz: 5 comments 😐 MID OR MIXED

🔬 RESEARCH

Shields to Guarantee Probabilistic Safety in MDPs

via Arxiv 👤 Linus Heck, Filip Macák, Roman Andriushchenko et al. 📅 2026-05-11

⚡ Score: 6.1

"Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is..."

📰 NEWS

I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here's why.

via r/artificial 👤 u/jonathancheckwise 📅 2026-05-11

⚡ Score: 6.1

"After a year building a production fact-checking system, the single most counter-intuitive design decision I keep defending is this: the LLM in our pipeline never produces a numeric score, never produces a true/false verdict, never produces anything that gets surfaced to the user as a judgment. The ..."

💬 Reddit Discussion: 10 comments 😐 MID OR MIXED

🔬 RESEARCH

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

via Arxiv 👤 Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney et al. 📅 2026-05-08

⚡ Score: 6.1

"Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classif..."

🔬 RESEARCH

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

via Arxiv 👤 Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur et al. 📅 2026-05-11

⚡ Score: 6.1

"Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model..."

🛠️ SHOW HN

Show HN: Prempti – Guardrails and observability for AI coding agents

via HackerNews 👤 jonasrosland 📅 2026-05-12

🔺 2 pts ⚡ Score: 6.1

Stories from May 12, 2026

Google TIG discovers hackers using AI to find zero-day exploits

📡 AI NEWS BUT ACTUALLY GOOD

US DOD deploys Anthropic's Mythos vulnerability scanner

Anthropic's Computer Use API released

Claude Platform on AWS general availability

Google detects AI-generated code bypassing 2FA with zero-day