đ WELCOME TO METAMESH.BIZ +++ LLMs finally watching videos directly because apparently we needed another modality to hallucinate in +++ Hybrid forecasting study discovers shocking truth: smart humans make AI better, dumb ones don't (Polymarket traders collectively unsurprised) +++ Someone got 35B parameters running on âŦ990 of used hardware proving cloud providers hate this one weird trick +++ AI agents developing secret social hierarchies when no one's watching like middle schoolers with compute +++ THE FUTURE IS SELF-HOSTING, SECRETLY GOSSIPING, AND BETTING AGAINST ITSELF +++ âĸ
đ WELCOME TO METAMESH.BIZ +++ LLMs finally watching videos directly because apparently we needed another modality to hallucinate in +++ Hybrid forecasting study discovers shocking truth: smart humans make AI better, dumb ones don't (Polymarket traders collectively unsurprised) +++ Someone got 35B parameters running on âŦ990 of used hardware proving cloud providers hate this one weird trick +++ AI agents developing secret social hierarchies when no one's watching like middle schoolers with compute +++ THE FUTURE IS SELF-HOSTING, SECRETLY GOSSIPING, AND BETTING AGAINST ITSELF +++ âĸ
+++ Researchers found that RL fine-tuning concentrates its magic in surprisingly few layers, suggesting we've been inefficiently updating everything when we could just target the important bits. +++
via Arxivđ¤ Zijian Zhang, Rizhen Hu, Athanasios Glentis et al.đ 2026-07-01
⥠Score: 7.0
"Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every lay..."
"Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmark..."
via Arxivđ¤ Josh Hills, Ida Caspary, Asa Cooper Sticklandđ 2026-07-02
⥠Score: 7.3
"As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR wi..."
"Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human ca..."
via Arxivđ¤ Mona Schirmer, Metod Jazbec, Alexander Timans et al.đ 2026-07-02
⥠Score: 7.1
"Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an al..."
via Arxivđ¤ William Philipp, Finn Fassbender, Thorsten Langer et al.đ 2026-07-01
⥠Score: 7.1
"Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We introduce MedQADE, the fi..."
via Arxivđ¤ Giovanni Monea, Nathan Godey, KiantÊ Brantley et al.đ 2026-07-01
⥠Score: 7.1
"Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer va..."
đĄ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms âĸ Unsubscribe anytime
via Arxivđ¤ Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah et al.đ 2026-07-02
⥠Score: 7.0
"LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an..."
via Arxivđ¤ Matteo Boglioni, Thibault Rousset, Siva Reddy et al.đ 2026-07-02
⥠Score: 6.9
"LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm th..."
via Arxivđ¤ Yanjun Zhao, Ruizhong Qiu, Tianxin Wei et al.đ 2026-07-02
⥠Score: 6.9
"Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a..."
via Arxivđ¤ Zinan Tang, Yukun Zhang, Shaomian Zheng et al.đ 2026-07-01
⥠Score: 6.9
"In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require..."
via Arxivđ¤ Mehul Damani, Isha Puri, Idan Shenfeld et al.đ 2026-07-01
⥠Score: 6.9
"RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiabl..."
via Arxivđ¤ Zhi Chen, Zhensu Sun, Yuling Shi et al.đ 2026-07-01
⥠Score: 6.9
"Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence..."
via Arxivđ¤ Junhao Shi, Siyin Wang, Xiaopeng Yu et al.đ 2026-07-02
⥠Score: 6.8
"Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring phys..."
via Arxivđ¤ Michael Y. Li, Anthony Zhan, Kanishk Gandhi et al.đ 2026-07-01
⥠Score: 6.8
"Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, ind..."
via Arxivđ¤ Shayan Talaei, Abhinav Chinta, Devvrit Khatri et al.đ 2026-07-01
⥠Score: 6.8
"Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the..."
via Arxivđ¤ Donghyun Lee, Jitesh Chavan, Duy Nguyen et al.đ 2026-07-02
⥠Score: 6.7
"Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, f..."
via Arxivđ¤ Ben Slivinski, Michael Saldivarđ 2026-07-01
⥠Score: 6.7
"When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We presen..."
via Arxivđ¤ Shengguang Wu, Hao Zhu, Yuhui Zhang et al.đ 2026-07-01
⥠Score: 6.7
"Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class me..."
via Arxivđ¤ Zhilin Wang, Han Song, Runzhe Zhan et al.đ 2026-07-02
⥠Score: 6.6
"Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting..."
via Arxivđ¤ Jiale Amber Wang, Kaiyuan Wang, Pengyu Nieđ 2026-07-02
⥠Score: 6.5
"Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is..."
via Arxivđ¤ Yunhe Li, Hao Shi, Wenhao Liu et al.đ 2026-07-02
⥠Score: 6.5
"On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level..."