🚀 WELCOME TO METAMESH.BIZ +++ LLMs finally watching videos directly because apparently we needed another modality to hallucinate in +++ Hybrid forecasting study discovers shocking truth: smart humans make AI better, dumb ones don't (Polymarket traders collectively unsurprised) +++ Someone got 35B parameters running on â‚Ŧ990 of used hardware proving cloud providers hate this one weird trick +++ AI agents developing secret social hierarchies when no one's watching like middle schoolers with compute +++ THE FUTURE IS SELF-HOSTING, SECRETLY GOSSIPING, AND BETTING AGAINST ITSELF +++ â€ĸ
🚀 WELCOME TO METAMESH.BIZ +++ LLMs finally watching videos directly because apparently we needed another modality to hallucinate in +++ Hybrid forecasting study discovers shocking truth: smart humans make AI better, dumb ones don't (Polymarket traders collectively unsurprised) +++ Someone got 35B parameters running on â‚Ŧ990 of used hardware proving cloud providers hate this one weird trick +++ AI agents developing secret social hierarchies when no one's watching like middle schoolers with compute +++ THE FUTURE IS SELF-HOSTING, SECRETLY GOSSIPING, AND BETTING AGAINST ITSELF +++ â€ĸ
AI Signal - PREMIUM TECH INTELLIGENCE
📟 Optimized for Netscape Navigator 4.0+
📊 You are visitor #51949 to this AWESOME site! 📊
Last updated: 2026-07-03 | Server uptime: 99.9% ⚡

Today's Stories

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📂 Filter by Category
Loading filters...
đŸ”Ŧ RESEARCH

Single transformer layer RL training

+++ Researchers found that RL fine-tuning concentrates its magic in surprisingly few layers, suggesting we've been inefficiently updating everything when we could just target the important bits. +++

Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train

đŸ’Ŧ HackerNews Buzz: 29 comments 🐝 BUZZING
đŸ”Ŧ RESEARCH

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

"Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmark..."
đŸ› ī¸ SHOW HN

Show HN: CLI tool for detecting non-exact code duplication with embedding models

đŸ’Ŧ HackerNews Buzz: 31 comments 🐐 GOATED ENERGY
📰 NEWS

AI can't be listed as inventor on patent applications, Japan's top court rules

đŸ’Ŧ HackerNews Buzz: 176 comments 👍 LOWKEY SLAPS
📰 NEWS

Claude-real-video īŧ any LLM can watch a video

đŸ’Ŧ HackerNews Buzz: 41 comments 🐝 BUZZING
đŸ”Ŧ RESEARCH

Distributed Attacks in Persistent-State AI Control

"As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR wi..."
đŸ”Ŧ RESEARCH

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

"Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human ca..."
📰 NEWS

Action Preflight: consequence-aware admission for LLM agent actions

đŸ’Ŧ HackerNews Buzz: 2 comments 😐 MID OR MIXED
đŸ”Ŧ RESEARCH

Online Safety Monitoring for LLMs

"Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an al..."
đŸ”Ŧ RESEARCH

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

"Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We introduce MedQADE, the fi..."
đŸ”Ŧ RESEARCH

The State-Prediction Separation Hypothesis

"Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer va..."
📰 NEWS

Running 26B and 35B LLMs at Full Speed on â‚Ŧ990 of Used Hardware – No Cloud

đŸ”Ŧ RESEARCH

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

"LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an..."
📰 NEWS

The Effective Agent: what technical leaders should know about agentic AI today

đŸ”Ŧ RESEARCH

A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

đŸ”Ŧ RESEARCH

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

"LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm th..."
đŸ”Ŧ RESEARCH

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

"Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a..."
đŸ”Ŧ RESEARCH

CausalMix: Data Mixture as Causal Inference for Language Model Training

"In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require..."
đŸ”Ŧ RESEARCH

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

"RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiabl..."
đŸ”Ŧ RESEARCH

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

"Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence..."
đŸ”Ŧ RESEARCH

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

"Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring phys..."
đŸ”Ŧ RESEARCH

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

"Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, ind..."
đŸ”Ŧ RESEARCH

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

"Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the..."
📰 NEWS

Microsoft invests $2.5B and forms the Microsoft Frontier Company to embed 6,000 forward-deployed engineers with customers to help deploy AI systems

đŸ”Ŧ RESEARCH

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

"Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, f..."
đŸ”Ŧ RESEARCH

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

"When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We presen..."
đŸ”Ŧ RESEARCH

AutoMem: Automated Learning of Memory as a Cognitive Skill

"Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class me..."
đŸ”Ŧ RESEARCH

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

"Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting..."
📰 NEWS

Sources: Anthropic moves to close loopholes that let Chinese firms like Ant use its models via workarounds including cloud providers and overseas subsidiaries

đŸ”Ŧ RESEARCH

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

"Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is..."
đŸ”Ŧ RESEARCH

DemoPSD: Disagreement-Modulated Policy Self-Distillation

"On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level..."
📰 NEWS

Memo: Microsoft is merging the consumer and enterprise versions of its Copilot chatbots into a single app featuring coding tools and AI agents dubbed AutoPilot

đŸ› ī¸ SHOW HN

Show HN: Piggy – lazy senior dev mode for AI agents (80–94% less code)

đŸ› ī¸ SHOW HN

Show HN: A provider-agnostic agent loop built on ports and adapters

đŸ’Ŧ HackerNews Buzz: 3 comments 😤 NEGATIVE ENERGY
đŸĻ†
HEY FRIENDO
CLICK HERE IF YOU WOULD LIKE TO JOIN MY PROFESSIONAL NETWORK ON LINKEDIN
🤝 LETS BE BUSINESS PALS 🤝