📚 HISTORICAL ARCHIVE - May 25, 2026

                What was happening in AI on 2026-05-25
            

← May 24 📊 TODAY'S NEWS 📚 ARCHIVE 🗓️ May 2026 May 26 →

                📰 DAILY BRIEFING
            

34 stories tracked on May 25, 2026. Top story: Claude is not your architect. Stop letting it pretend.

🚀 WELCOME TO METAMESH.BIZ +++ UK's AI Safety Institute quietly becoming every government's copy-paste template for pretending to regulate AGI +++ Academia discovers formal proofs can be automated (mathematicians nervously updating LinkedIn profiles) +++ Safety researchers propose legal safe harbor for red-teamers because apparently we need permission slips to break the apocalypse machines +++ THE FUTURE IS PEER-REVIEWED, GOVERNMENT-APPROVED, AND STILL PROBABLY HALLUCINATING +++ 🚀

📊 You are visitor #47291 to this AWESOME site! 📊
Archive from: 2026-05-25 | Preserved for posterity ⚡

Stories from May 25, 2026

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📰 NEWS

Claude is not your architect. Stop letting it pretend

via HackerNews 👤 cdrnsf 📅 2026-05-24

🔺 185 pts ⚡ Score: 8.3

💬 HackerNews Buzz: 131 comments 🐝 BUZZING

🔬 RESEARCH

Advancing Mathematics Research with AI-Driven Formal Proof Search

via Arxiv 👤 George Tsoukalas, Anton Kovsharov, Sergey Shirobokov et al. 📅 2026-05-21

⚡ Score: 8.0

"Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve..."

🔬 RESEARCH

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

via Arxiv 👤 Xu Ouyang, Deyi Liu, Yuhang Cai et al. 📅 2026-05-22

⚡ Score: 7.9

"Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scal..."

📰 NEWS

A look at the UK's AI Safety Institute, whose researchers probe AI models for safety gaps, as its work becomes a blueprint for other governments' AI policies

via Techmeme 👤 Nytimes 📅 2026-05-25

⚡ Score: 7.8

🔬 RESEARCH

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

via Arxiv 👤 Yunpeng Dong, Jingkai He, Yuze Hou et al. 📅 2026-05-21

⚡ Score: 7.7

"LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the e..."

📰 NEWS

An AI safety safe harbor [pdf]

via HackerNews 👤 wrineha2 📅 2026-05-25

🔺 1 pts ⚡ Score: 7.4

📰 NEWS

Memory has grown to nearly two-thirds of AI chip component costs

via HackerNews 👤 intelkishan 📅 2026-05-24

🔺 221 pts ⚡ Score: 7.3

💬 HackerNews Buzz: 244 comments 😐 MID OR MIXED

🔬 RESEARCH

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

via Arxiv 👤 Piercosma Bisconti, Matteo Prandi, Federico Pierucci et al. 📅 2026-05-21

⚡ Score: 7.3

"Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an e..."

🔬 RESEARCH

Agentic Proving for Program Verification

via Arxiv 👤 Alessandro Sosso, Akhil Arora, Bas Spitters 📅 2026-05-22

⚡ Score: 7.3

"Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code gen..."

🔬 RESEARCH

Reducing Political Manipulation with Consistency Training

via Arxiv 👤 Long Phan, Devin Kim, Alexander Pan et al. 📅 2026-05-21

⚡ Score: 7.2

"Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which..."

📰 NEWS

LLMs' – Failure Modes and Proposed Improvements

via HackerNews 👤 professor_jonny 📅 2026-05-24

🔺 1 pts ⚡ Score: 7.1

🔬 RESEARCH

Advancing mathematics research with AI-driven formal proof search

via HackerNews 👤 azhenley 📅 2026-05-25

🔺 2 pts ⚡ Score: 7.0

📰 NEWS

AI agents just got their own web browser via a Firefox fork

via HackerNews 👤 MilnerRoute 📅 2026-05-24

🔺 2 pts ⚡ Score: 7.0

🔬 RESEARCH

AMEL: Accumulated Message Effects on LLM Judgments

via Arxiv 👤 Sid-ali Temkit 📅 2026-05-21

⚡ Score: 7.0

"Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa..."

📰 NEWS

Figure AI robot sorting livestream

2x SOURCES 🌐 📅 2026-05-25

⚡ Score: 6.9

+++ Figure AI livestreamed continuous robotic package sorting for a week, suggesting their humanoid robots might actually work in the real world. Whether this proves commercial viability or just proves they can run a really long demo remains delightfully unclear. +++

Figure AI had a livestream of their robots sorting packages 24/7 for 8 days straight. These aren't staged demos anymore.

via r/ChatGPT 👤 u/EchoOfOppenheimer 📅 2026-05-25

⬆️ 1761 ups ⚡ Score: 6.9

"External link discussion - see full content at original source."

💬 Reddit Discussion: 411 comments 👍 LOWKEY SLAPS

🔬 RESEARCH

A Language for Describing Agentic LLM Contexts

via HackerNews 👤 mpweiher 📅 2026-05-24

🔺 3 pts ⚡ Score: 6.9

🔬 RESEARCH

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

via Arxiv 👤 Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al. 📅 2026-05-21

⚡ Score: 6.9

"Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files,..."

📰 NEWS

Authorization layer for AI agents (OAuth has no idea what your agent is doing)

via HackerNews 👤 ElamOlame 📅 2026-05-24

🔺 2 pts ⚡ Score: 6.8

📰 NEWS

Tell HN: Claude Code now allows Anthropic to remotely inject system prompts

via HackerNews 👤 matheusmoreira 📅 2026-05-24

🔺 8 pts ⚡ Score: 6.8

💬 HackerNews Buzz: 7 comments 🐐 GOATED ENERGY

🔬 RESEARCH

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

via Arxiv 👤 Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al. 📅 2026-05-21

⚡ Score: 6.7

"Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can..."

📰 NEWS

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

via r/LocalLLaMA 👤 u/randomfoo2 📅 2026-05-24

⬆️ 58 ups ⚡ Score: 6.7

"A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past cou..."

💬 Reddit Discussion: 14 comments 🐝 BUZZING

📰 NEWS

I built an MCP server to stop re-explaining my codebase patterns to Cursor every session

via r/cursor 👤 u/joutvhu 📅 2026-05-24

⬆️ 1 ups ⚡ Score: 6.7

"If you use Cursor heavily, you've probably hit this: you have internal patterns, boilerplate, team conventions — and every new chat you spend the first few messages re-establishing context. Rules files help but they load everything upfront, which burns context fast. I built **knowledge-shelf** to f..."

🔬 RESEARCH

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

via Arxiv 👤 Yifan Yang, Ziyang Gong, Weiquan Huang et al. 📅 2026-05-22

⚡ Score: 6.7

"Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained a..."

📰 NEWS

AI agents need audit trails more than they need more autonomy

via r/artificial 👤 u/RonnySaya 📅 2026-05-25

⬆️ 27 ups ⚡ Score: 6.6

"A lot of people talk about AI agents like the main goal is making them more independent. But the more I think about it, the bigger issue is probably visibility. If an AI is only answering a question, it is easy to judge the result. But once it starts doing things across websites, accounts, forms, su..."

💬 Reddit Discussion: 24 comments 😐 MID OR MIXED

🔬 RESEARCH

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

via Arxiv 👤 Stuart Bladon, Brinnae Bent 📅 2026-05-22

⚡ Score: 6.6

"It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on..."

🔬 RESEARCH

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

via Arxiv 👤 Zisu Huang, Jingwen Xu, Yifan Yang et al. 📅 2026-05-22

⚡ Score: 6.6

"Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recur..."

🔬 RESEARCH

Strong Teacher Not Needed? On Distillation in LLM Pretraining

via Arxiv 👤 Taiming Lu, Zhuang Liu 📅 2026-05-22

⚡ Score: 6.5

"Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, s..."

🔬 RESEARCH

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

via Arxiv 👤 Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al. 📅 2026-05-21

⚡ Score: 6.4

"Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specifie..."

📰 NEWS

A chart showing how many unsolved math problems have recently been solved by AI

via r/OpenAI 👤 u/Confident_Salt_8108 📅 2026-05-25

⬆️ 31 ups ⚡ Score: 6.2

"External link discussion - see full content at original source."

💬 Reddit Discussion: 7 comments 😐 MID OR MIXED

📰 NEWS

Concerning Law Enforcement Exemptions in Draft AI Act Transparency Guidelines

via HackerNews 👤 BrunoBernardino 📅 2026-05-25

🔺 2 pts ⚡ Score: 6.2

📰 NEWS

A look at DeepSeek's model optimization to reduce HBM use, potentially enabling domestic memory, ASIC, and CPU makers to create a Chinese AI hardware ecosystem

via Techmeme 👤 X 📅 2026-05-25

⚡ Score: 6.2

📰 NEWS

Neuro; An AOT-compiled language for AI workloads built on LLVM 20

via HackerNews 👤 PanzerPeter 📅 2026-05-24

🔺 1 pts ⚡ Score: 6.2

📰 NEWS

I built a computer use sandbox framework for codex on headless linux. GPU passthrough, computer use, and sudo access for codex all work. It's the perfect dev sandbox to allow full auto work while mini

via r/LocalLLaMA 👤 u/superSmitty9999 📅 2026-05-25

⬆️ 2 ups ⚡ Score: 6.1

"I've been working with agents for months now, and I haven't found a sandbox environment that "just works" so I built it! My requirements were as follows: 1. Agent is unable to destroy my host OS but able to install software and run sudo commands 2. Agent is able to browse the web autonomously and ..."

💬 Reddit Discussion: 2 comments 😐 MID OR MIXED

📰 NEWS

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

via r/LocalLLaMA 👤 u/pmttyji 📅 2026-05-25

⬆️ 36 ups ⚡ Score: 6.1

"Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache. **1-2%** boost on pp & **7-9%** boost on tg. Performance on a 5090 with `-ctk q8_0 -ctv q8_0` |Model|Test|t/s master|t/s cuda-fwt|Speedup| |:-|:-|:-|:-|:-| |gemma4 26B.A4B Q4\_K\_M|pp2048|13587.89|13809."

💬 Reddit Discussion: 9 comments 🐝 BUZZING

Stories from May 25, 2026

📡 AI NEWS BUT ACTUALLY GOOD

Figure AI robot sorting livestream