đ WELCOME TO METAMESH.BIZ +++ Anthropic admits 80% of its codebase is now written by Claude (the machines are literally building the machines) +++ Vector search can't handle LLM memory because turns out brains aren't just similarity matrices +++ DeepSeek's benchmark scores looking sus after audit reveals their v4-pro can't actually code its way out of a Python tutorial +++ THE RECURSION LOOP IS CALLING FROM INSIDE THE CODEBASE +++ đ âĸ
đ WELCOME TO METAMESH.BIZ +++ Anthropic admits 80% of its codebase is now written by Claude (the machines are literally building the machines) +++ Vector search can't handle LLM memory because turns out brains aren't just similarity matrices +++ DeepSeek's benchmark scores looking sus after audit reveals their v4-pro can't actually code its way out of a Python tutorial +++ THE RECURSION LOOP IS CALLING FROM INSIDE THE CODEBASE +++ đ âĸ
+++ Anthropic reports 80%+ of merged code is Claude-authored, marking genuine progress toward recursive self-improvement while casually normalizing the concept of AI systems bootstrapping themselves. +++
via Arxivđ¤ Zhangchen Xu, Junda Chen, Yue Huang et al.đ 2026-06-03
⥠Score: 7.3
"Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent t..."
via Arxivđ¤ Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu et al.đ 2026-06-02
⥠Score: 7.2
"Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL r..."
via Arxivđ¤ Zongwei Lv, Zhewen Tan, Yaoming Li et al.đ 2026-06-02
⥠Score: 7.1
"Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity,..."
via Arxivđ¤ Nizar Islah, Istabrak Abbes, Irina Rish et al.đ 2026-06-03
⥠Score: 7.0
"When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help..."
via Arxivđ¤ MÃĄtÊ Gedeon, PÊter Mihajlikđ 2026-06-02
⥠Score: 7.0
"Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assemb..."
via Arxivđ¤ Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejadđ 2026-06-03
⥠Score: 6.9
"Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, includin..."
via Arxivđ¤ Yu Xia, Zhouhang Xie, Xin Xu et al.đ 2026-06-02
⥠Score: 6.9
"Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving ho..."
via Arxivđ¤ Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu et al.đ 2026-06-02
⥠Score: 6.9
"Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse atte..."
via Arxivđ¤ Tao Chen, Gangwei Jiang, Pengyu Cheng et al.đ 2026-06-02
⥠Score: 6.8
"Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checkl..."
via Arxivđ¤ Zhen Yang, Xiaogang Xu, Wen Wang et al.đ 2026-06-03
⥠Score: 6.8
"Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent ag..."
via Arxivđ¤ Zhifei Xie, Zihang Liu, Ze An et al.đ 2026-06-03
⥠Score: 6.8
"Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-deci..."
via Arxivđ¤ Rongzhi Zhang, Rui Feng, Zhihan Zhang et al.đ 2026-06-02
⥠Score: 6.7
"Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield v..."
via Arxivđ¤ Bishwas Mandal, Shmuel Berman, Akshay Vegesna et al.đ 2026-06-02
⥠Score: 6.6
"Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model to..."
via Arxivđ¤ Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu et al.đ 2026-06-02
⥠Score: 6.6
"Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reaso..."
via Arxivđ¤ Zekun Qi, Xuchuan Chen, Dairu Liu et al.đ 2026-06-02
⥠Score: 6.5
"We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus..."
via Arxivđ¤ Luis Palacios, Lorenzo Basile, Diego Doimo et al.đ 2026-06-02
⥠Score: 6.5
"Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language archi..."
đ° NEWS
Anthropic urges AI development pause
2x SOURCES đđ 2026-06-04
⥠Score: 6.4
+++ The irony of an AI lab asking the world to pump the brakes while they're literally racing to scale their own models isn't lost on practitioners, though the self-improvement concern raises legitimate questions worth taking seriously. +++