AI News Archive - May 23, 2026 | Metamesh Intelligence

📰 NEWS

I reproduced a Claude Code RCE. The bug pattern is everywhere

via HackerNews 👤 GeorgeWoff25 📅 2026-05-23

🔺 5 pts ⚡ Score: 8.8

💬 HackerNews Buzz: 1 comments 🐝 BUZZING

📰 NEWS

Project Glasswing vulnerability disclosure results

2x SOURCES 🌐 📅 2026-05-22

⚡ Score: 8.4

+++ Anthropic's vulnerability-hunting model has apparently become quite good at finding security problems, which is either reassuring or terrifying depending on whether you're the one deploying it. +++

Project Glasswing: An Initial Update

via HackerNews 👤 louiereederson 📅 2026-05-22

🔺 429 pts ⚡ Score: 8.4

💬 HackerNews Buzz: 253 comments 👍 LOWKEY SLAPS

📰 NEWS

Microsoft reports AI is more expensive than paying human employees

via HackerNews 👤 nreece 📅 2026-05-23

🔺 203 pts ⚡ Score: 8.3

💬 HackerNews Buzz: 60 comments 😐 MID OR MIXED

🔬 RESEARCH

Evaluating Commercial AI Chatbots as News Intermediaries

via Arxiv 👤 Mirac Suzgun, Emily Shen, Federico Bianchi et al. 📅 2026-05-21

⚡ Score: 8.1

"AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February..."

📰 NEWS

Sometimes people outside AI say things like 'it can't be that bad, there must be experts on top of it. As 'an expert', I would like to be clear we are not on top of it ... We are on track for human

via r/OpenAI 👤 u/EchoOfOppenheimer 📅 2026-05-23

⬆️ 23 ups ⚡ Score: 7.8

"External link discussion - see full content at original source."

💬 Reddit Discussion: 125 comments 😤 NEGATIVE ENERGY

🔬 RESEARCH

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

via Arxiv 👤 Yunpeng Dong, Jingkai He, Yuze Hou et al. 📅 2026-05-21

⚡ Score: 7.8

"LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the e..."

📰 NEWS

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

via r/LocalLLaMA 👤 u/Anbeeld 📅 2026-05-22

⬆️ 180 ups ⚡ Score: 7.7

"**BeeLlama v0.2.0 is here!** >Not quite a pegasus, but close enough. **GitHub** **|** **Qwen 3.6 27B Quick Start** **|** [**Gemma 4 31B Quick Start**](https://github."

💬 Reddit Discussion: 108 comments 🐝 BUZZING

📰 NEWS

TranscendPlexity: 540/540 ARC-AGI-1/2/3, 13 tasks with 0% AI solve rate, solved

via HackerNews 👤 wormsWorld 📅 2026-05-22

🔺 1 pts ⚡ Score: 7.5

🔬 RESEARCH

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

via Arxiv 👤 Piercosma Bisconti, Matteo Prandi, Federico Pierucci et al. 📅 2026-05-21

⚡ Score: 7.3

"Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an e..."

📰 NEWS

Measuring LLMs' ability to develop exploits

via HackerNews 👤 Kneenex 📅 2026-05-22

🔺 1 pts ⚡ Score: 7.3

📰 NEWS

How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)

via r/LocalLLaMA 👤 u/HomoAgens1 📅 2026-05-22

⬆️ 8 ups ⚡ Score: 7.1

"I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me split it into two jobs: (a) Heavy one-shot generation — write ..."

💬 Reddit Discussion: 5 comments 👍 LOWKEY SLAPS

🛠️ SHOW HN

Show HN: TruLayer – tracing, evals, and a control loop for production LLMs

via HackerNews 👤 trulayer 📅 2026-05-23

🔺 2 pts ⚡ Score: 7.1

📰 NEWS

Frontier labs don't use most AI compute(yet)

via HackerNews 👤 sleepyguy 📅 2026-05-23

🔺 3 pts ⚡ Score: 7.0

📰 NEWS

Spice: We built an open-sourced decision layer that sits above your AI agents (controls agent actions before execution) [P]

via r/MachineLearning 👤 u/Ok-Sir-8964 📅 2026-05-23

⚡ Score: 7.0

"Hi guys, been exploring here for a while, wanted to share something we've been working on. It's called Spice, an open-source decision layer above agents. We have tons of great execution agents now — Claude Code, Codex, hermes, etc. They're good at doing stu..."

📰 NEWS

SteelSpine: Replay tool for debugging AI agents

via HackerNews 👤 jeremyfelps 📅 2026-05-22

🔺 3 pts ⚡ Score: 7.0

📰 NEWS

AI Ops SOP Pack: SOPs for reviewing AI-assisted engineering work

via HackerNews 👤 monkidy 📅 2026-05-23

🔺 1 pts ⚡ Score: 7.0

🔬 RESEARCH

Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

via HackerNews 👤 sbulaev 📅 2026-05-22

🔺 16 pts ⚡ Score: 7.0

💬 HackerNews Buzz: 8 comments 😤 NEGATIVE ENERGY

🔬 RESEARCH

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

via Arxiv 👤 Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al. 📅 2026-05-21

⚡ Score: 6.9

"Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files,..."

📰 NEWS

The Verification Tree: Turning AI bug report floods into a confidence signal

via HackerNews 👤 yellow_glovez 📅 2026-05-23

🔺 2 pts ⚡ Score: 6.9

🔬 RESEARCH

Reducing Political Manipulation with Consistency Training

via Arxiv 👤 Long Phan, Devin Kim, Alexander Pan et al. 📅 2026-05-21

⚡ Score: 6.8

"Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which..."

🔬 RESEARCH

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

via Arxiv 👤 Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al. 📅 2026-05-21

⚡ Score: 6.7

"Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can..."

🔬 RESEARCH

Advancing Mathematics Research with AI-Driven Formal Proof Search

via Arxiv 👤 George Tsoukalas, Anton Kovsharov, Sergey Shirobokov et al. 📅 2026-05-21

⚡ Score: 6.7

"Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve..."

📰 NEWS

Turning a dashcam drive into PAS 2161-ready road condition data - SAM 3 + ray-plane IPM, 100 m segments

via r/computervision 👤 u/UrbanVueAI 📅 2026-05-23

⬆️ 90 ups ⚡ Score: 6.7

"Most road-damage models report frame-level mAP. Road authorities don’t buy mAP - they buy “which 100 m of asphalt is bad, how bad, where,” in a format their pavement-management system can ingest. I’m aiming the pipeline at BSI PAS 2161:2024 (new standard for AI-derived road condition data) so the ou..."

🔬 RESEARCH

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

via Arxiv 👤 Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al. 📅 2026-05-21

⚡ Score: 6.7

"Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specifie..."

📰 NEWS

OpenCode and Cursor's Composer 2.5

via HackerNews 👤 lcavalcare 📅 2026-05-22

🔺 6 pts ⚡ Score: 6.6

📰 NEWS

The deployment funnel nobody talks about: 60% evaluate, 20% pilot, 5% ship. MIT tracked 300 real AI implementations against profit metrics.

via r/artificial 👤 u/Quantum_Merlin 📅 2026-05-22

⚡ Score: 6.6

"Late 2025, MIT researchers measured something the industry had avoided looking at directly. Not projections or pilot numbers. Documented outcomes from 300 AI deployments in real businesses, tracked against profit metrics. The funnel breaks down like this. Sixty percent of companies evaluated AI too..."

💬 Reddit Discussion: 7 comments 😤 NEGATIVE ENERGY

🔬 RESEARCH

AMEL: Accumulated Message Effects on LLM Judgments

via Arxiv 👤 Sid-ali Temkit 📅 2026-05-21

⚡ Score: 6.6

"Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa..."

📰 NEWS

Models.dev: open-source database of AI model specs, pricing, and capabilities

via HackerNews 👤 maxloh 📅 2026-05-22

🔺 138 pts ⚡ Score: 6.5

💬 HackerNews Buzz: 25 comments 🐝 BUZZING

📰 NEWS

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

via r/LocalLLaMA 👤 u/gvij 📅 2026-05-23

⬆️ 6 ups ⚡ Score: 6.5

"Ran a head-to-head on two open-weight models for tool-calling on a 4-core CPU, no GPU, no cherry-picking. Wanted to see if the small specialist (Needle, 26M, distilled from Gemini 3.1 for function calls) actually holds up against a small generalist (Qwen3-0.6B) that also does tools. Setup: 50 queri..."

🛠️ SHOW HN

Show HN: Mneme – Open-protocol AI memory that lives on your device

via HackerNews 👤 ptengelmann 📅 2026-05-22

🔺 2 pts ⚡ Score: 6.3

📰 NEWS

Experts first llama.cpp

via r/LocalLLaMA 👤 u/comanderxv 📅 2026-05-22

⬆️ 51 ups ⚡ Score: 6.3

"This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers. The reason is I own an RTX 2060 with 12GB VRAM. That sounds big but is too little for dense models. That is why I use mainly MoE models because of that. The problem is..."

💬 Reddit Discussion: 24 comments 🐐 GOATED ENERGY

📰 NEWS

Llmff v0.1.2: FFmpeg-Shaped Pipelines for LLM Workflows

via HackerNews 👤 syndicalt 📅 2026-05-22

🔺 3 pts ⚡ Score: 6.2

🛠️ SHOW HN

Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%)

via HackerNews 👤 degutemesgen 📅 2026-05-23

🔺 2 pts ⚡ Score: 6.1

📰 NEWS

I fine-tuned an LLM to be C-3PO to test which training data format works best for persona injection [P]

via r/MachineLearning 👤 u/Georgiou1226 📅 2026-05-23

⚡ Score: 6.1

"Tested three formats: chat demos, first-person statements ("I am C-3PO..."), and synthetic Wikipedia-style docs. Same model, same LoRA config, 500 examples each. First-person statements won on generalization, which I didn't expect. The synthetic doc model was the weirdest result: it knew C-3PO was ..."

📰 NEWS

Embedded acoustic AI with <16ms latency running on 8MB RAM

via HackerNews 👤 shermanliu 📅 2026-05-23

🔺 3 pts ⚡ Score: 6.1

Stories from May 23, 2026

I reproduced a Claude Code RCE. The bug pattern is everywhere

Project Glasswing vulnerability disclosure results

Project Glasswing: An Initial Update

Anthropic says Claude Mythos Preview has been used to find more than 10,000 high- or critical-severity vulnerabilities since the launch of Project Glasswing

Microsoft reports AI is more expensive than paying human employees

Evaluating Commercial AI Chatbots as News Intermediaries

Sometimes people outside AI say things like 'it can't be that bad, there must be experts on top of it. As 'an expert', I would like to be clear we are not on top of it ... We are on track for human

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

TranscendPlexity: 540/540 ARC-AGI-1/2/3, 13 tasks with 0% AI solve rate, solved

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Measuring LLMs' ability to develop exploits

How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)

Show HN: TruLayer – tracing, evals, and a control loop for production LLMs

Frontier labs don't use most AI compute(yet)

Spice: We built an open-sourced decision layer that sits above your AI agents (controls agent actions before execution) [P]

SteelSpine: Replay tool for debugging AI agents

AI Ops SOP Pack: SOPs for reviewing AI-assisted engineering work

Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

The Verification Tree: Turning AI bug report floods into a confidence signal

Reducing Political Manipulation with Consistency Training

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

Advancing Mathematics Research with AI-Driven Formal Proof Search

Turning a dashcam drive into PAS 2161-ready road condition data - SAM 3 + ray-plane IPM, 100 m segments

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

OpenCode and Cursor's Composer 2.5

The deployment funnel nobody talks about: 60% evaluate, 20% pilot, 5% ship. MIT tracked 300 real AI implementations against profit metrics.

AMEL: Accumulated Message Effects on LLM Judgments

Models.dev: open-source database of AI model specs, pricing, and capabilities

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

Show HN: Mneme – Open-protocol AI memory that lives on your device

Experts first llama.cpp

Llmff v0.1.2: FFmpeg-Shaped Pipelines for LLM Workflows

Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%)

I fine-tuned an LLM to be C-3PO to test which training data format works best for persona injection [P]

Embedded acoustic AI with <16ms latency running on 8MB RAM

Stories from May 23, 2026

Project Glasswing vulnerability disclosure results

📡 AI NEWS BUT ACTUALLY GOOD