AI News Archive - March 12, 2026 | Metamesh Intelligence

🛠️ SHOW HN

Show HN: Understudy – Teach a desktop agent by demonstrating a task once

via HackerNews 👤 bayes-song 📅 2026-03-12

🔺 59 pts ⚡ Score: 8.8

💬 HackerNews Buzz: 17 comments 🐝 BUZZING

🎯 Desktop task automation • Browser-specific solutions • LLM capabilities

💬 "The look-click-look-click loop it used for sending the Telegram for Musk was pretty slow." • "One more tool targeting OSX only. That platform is overserved with desktop agents already while others are underserved, especially Linux."

🤖 AI MODELS

OpenAI: We built a computer environment for agents

via HackerNews 👤 danebalia 📅 2026-03-11

🔺 1 pts ⚡ Score: 8.3

🔒 SECURITY

CNN and CCDH investigation: 80% of major AI chatbots gave guidance on weapons or targets to “teen” personas 50%+ of the time; only Claude consistently refused

via Techmeme 👤 Cnn 📅 2026-03-11

⚡ Score: 8.3

🛠️ SHOW HN

Show HN: OneCLI – Vault for AI Agents in Rust

via HackerNews 👤 guyb3 📅 2026-03-12

🔺 89 pts ⚡ Score: 8.0

💬 HackerNews Buzz: 34 comments 🐝 BUZZING

🎯 Credential management • Proxy-based architecture • Secure agent access

💬 "The proxy fills that gap. You get per-request scope enforcement" • "Secret and credential sprawl is a real problem in agent pipelines"

🔒 SECURITY

Document poisoning in RAG systems: How attackers corrupt AI's sources

via HackerNews 👤 aminerj 📅 2026-03-12

🔺 9 pts ⚡ Score: 8.0

💰 FUNDING

Axiom Math, which uses AI and the Lean language to verify code in much the same way that mathematicians prove math problems, raised $200M at a $1.6B valuation

via Techmeme 👤 Nytimes 📅 2026-03-12

⚡ Score: 8.0

📊 DATA

Google Research launches Groundsource, a geo-tagged time series dataset created by using Gemini to extract 2.6M flood events from 5M historical news articles

via Techmeme 👤 Techcrunch 📅 2026-03-12

⚡ Score: 7.9

🔒 SECURITY

MCP Security 2026: 30 CVEs in 60 Days

via HackerNews 👤 danebalia 📅 2026-03-12

🔺 1 pts ⚡ Score: 7.8

💰 FUNDING

Nvidia Will Spend $26 Billion to Build Open-Weight AI Models, Filings Show

via r/LocalLLaMA 👤 u/dan945 📅 2026-03-11

⬆️ 852 ups ⚡ Score: 7.5

"External link discussion - see full content at original source."

💬 Reddit Discussion: 119 comments 👍 LOWKEY SLAPS

🎯 Nvidia's competitive strategy • Commoditization of AI models • Nvidia's revenue and profit margins

💬 "Such a ruthless move" • "we can train some badass models before you can design a better chip than us"

🧠 NEURAL NETWORKS

Built an AI memory system based on cognitive science instead of vector databases

via r/artificial 👤 u/Ni2021 📅 2026-03-12

⬆️ 41 ups ⚡ Score: 7.5

"Most AI agent memory is just vector DB + semantic search. Store everything, retrieve by similarity. It works, but it doesn't scale well over time. The noise floor keeps rising and recall quality degrades. I took a different approach and built memory using actual cognitive science models. ACT-R ac..."

💬 Reddit Discussion: 26 comments 👍 LOWKEY SLAPS

🎯 Cognitive models • Memory decay • Semantic search

💬 "The forgetting curve insight resonates a lot" • "Using ACT-R and Ebbinghaus curves to turn forgetting into a feature"

👁️ COMPUTER VISION

Where VLMs actually beat traditional CV in production and where they don't

via r/computervision 👤 u/aaron_IoTeX 📅 2026-03-12

⬆️ 2 ups ⚡ Score: 7.4

"There's been a lot of debate on this sub about VLMs replacing traditional CV vs being overhyped. I've shipped production systems with both so here's what I've actually seen. For context: I saw RentHuman, a platform where AI agents rent humans to do physical tasks, and realized it was missing..."

🔒 SECURITY

AI error jails innocent grandmother for months in North Dakota fraud case

via HackerNews 👤 rectang 📅 2026-03-12

🔺 80 pts ⚡ Score: 7.4

💬 HackerNews Buzz: 34 comments 😤 NEGATIVE ENERGY

🎯 Police Incompetence • Facial Recognition Flaws • Wrongful Imprisonment

💬 "The AI is no more responsible than the cars and airplanes they used" • "Facial recognition should never be the sole basis for a warrant"

🛠️ SHOW HN

Show HN: Axe – A 12MB binary that replaces your AI framework

via HackerNews 👤 jrswab 📅 2026-03-12

🔺 113 pts ⚡ Score: 7.3

💬 HackerNews Buzz: 75 comments 🐝 BUZZING

🎯 Agent Orchestration • Cost Control • Workflow Automation

💬 "small tools, small contexts, and explicit data flowing between steps" • "how do you think about cost control?"

🔬 RESEARCH

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

via Arxiv 👤 Patricia Paskov, Kevin Wei, Shen Zhou Hong et al. 📅 2026-03-11

⚡ Score: 7.3

"Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying..."

🔬 RESEARCH

Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

via Arxiv 👤 Mingyang Song, Mao Zheng 📅 2026-03-10

⚡ Score: 7.3

"Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alt..."

🛠️ TOOLS

Claude Code builds games from prompts

3x SOURCES 🌐 📅 2026-03-11

⚡ Score: 7.2

+++ Developer builds Godot game generator that uses Claude to write GDScript, then validates output by actually playing the results, neatly sidestepping the "did it compile?" problem that plagues most LLM code evals. +++

Claude Code now builds entire games from a single prompt — GDScript, assets, and visual QA to find its own bugs

via r/claudeai 👤 u/crush-name 📅 2026-03-12

⬆️ 49 ups ⚡ Score: 7.3

"Open source: https://github.com/htdt/godogen..."

💬 Reddit Discussion: 6 comments 👍 LOWKEY SLAPS

🎯 Automated game development • AI-generated assets • Failure and iteration

💬 "a pipeline that goes from a text prompt to a playable Godot game" • "Animations are done programmatically"

🔧 INFRASTRUCTURE

Meta's MTIA chips announcement

2x SOURCES 🌐 📅 2026-03-11

⚡ Score: 7.2

+++ Meta is churning out inference silicon faster than most companies ship software updates, with modular chiplets that let them iterate without total redesigns. The MTIA 300 is already handling real workloads. +++

Meta announces four new MTIA chips, focussed on inference

via r/LocalLLaMA 👤 u/Balance- 📅 2026-03-12

⬆️ 46 ups ⚡ Score: 7.4

"Meta shared details on four generations of their custom MTIA chips (300–500), all developed in roughly two years. Meta's building their own silicon and iterating fast, a new chip roughly every 6 months, using modular chiplets where they can swap out pieces without redesigning everything. Notable: ..."

💬 Reddit Discussion: 17 comments 😐 MID OR MIXED

🎯 Powerful Memory Tech • Costly High-End Hardware • Potential Industry Impact

💬 "216 GB HBM memory with 16 of these, holy fuck" • "1700 watt TDP holy moly"

🔬 RESEARCH

A Field Guide to Reward Hacking in AI Kernel Generation

via HackerNews 👤 matt_d 📅 2026-03-12

🔺 1 pts ⚡ Score: 7.2

⚡ BREAKTHROUGH

Executing programs inside transformers with exponentially faster inference

via HackerNews 👤 u1hcw9nx 📅 2026-03-12

🔺 2 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 3 comments 🐐 GOATED ENERGY

🎯 Model interpretability • Transformers and attention • Executing code within models

💬 "This is an idea I had thought about, integrating tools into the main computation path of a model" • "It makes sense that a next token predictor could execute assembly code"

🛠️ SHOW HN

Show HN: Rudel – Claude Code Session Analytics

via HackerNews 👤 keks0r 📅 2026-03-12

🔺 118 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 72 comments 👍 LOWKEY SLAPS

🎯 Documentation Quality • Model Efficiency • Reproducibility & Transparency

💬 "Documentation (that's too long and often out of date) contributes to greater entropy rather than greater efficiency" • "Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly"

🔬 RESEARCH

Leech Lattice Vector Quantization for Efficient LLM Compression

via Arxiv 👤 Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough et al. 📅 2026-03-11

⚡ Score: 7.1

"Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explici..."

🤖 AI MODELS

EVR-1 Maano: 3.93 GiB compression of Llama 3.1 8B. Under 6% repetition at 500 tokens where standard 3-4 bit quants hit 77-80%. Novel compression method, not standard quantisation.

via r/LocalLLaMA 👤 u/Upper-Emphasis1696 📅 2026-03-12

⬆️ 13 ups ⚡ Score: 7.1

"Hey everyone, I'm Ibrahim from Evrmind, a UK start-up working on AI compression and edge compute. We've been working on a compression method that focuses on something most quant methods don't optimise for: whether the model actually produces coherent text beyond a few hundred tokens. We're announc..."

💬 Reddit Discussion: 12 comments 🐝 BUZZING

🎯 Caution with unknown binaries • AI model compression • AI model scaling

💬 "I am afraid to run unknown binaries, please share the source code." • "Lets show us what you can do with QWEN 3.5"

🛠️ TOOLS

How OpenAI Uses Codex [pdf]

via HackerNews 👤 d0able 📅 2026-03-12

🔺 1 pts ⚡ Score: 7.1

🤖 AI MODELS

Opus 4.6 was more than a model update

via HackerNews 👤 wordsaboutcode 📅 2026-03-11

🔺 1 pts ⚡ Score: 7.1

🎨 CREATIVE

Claude generates interactive charts and visualizations

2x SOURCES 🌐 📅 2026-03-12

⚡ Score: 7.1

+++ Anthropic's latest Claude update adds chart and diagram generation to conversations, rolling out in beta to all users. A genuinely useful feature that makes your AI assistant slightly less useless for data communication tasks. +++

Claude now creates interactive charts, diagrams and visualizations

via HackerNews 👤 adocomplete 📅 2026-03-12

🔺 150 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 92 comments 🐝 BUZZING

🎯 ChatGPT data analysis • AI-generated visualizations • Multi-agent pipelines

💬 "ChatGPT advanced data analysis for example." • "The artifact output model is more useful than it looks at first."

🛠️ SHOW HN

Show HN: A context-aware permission guard for Claude Code

via HackerNews 👤 schipperai 📅 2026-03-11

🔺 100 pts ⚡ Score: 7.0

💬 HackerNews Buzz: 47 comments 🐝 BUZZING

🎯 Agent security • Policy enforcement • Supervision and fault tolerance

💬 "What's stopping your agent from overwriting an arbitrary source file?" • "Permission guards prevent known-bad actions. Supervision makes unknown-bad outcomes survivable."

🏢 BUSINESS

Inside OpenAI's race to catch up with Claude Code, based on interviews with 30+ sources; a source says Codex had $1B+ in annualized revenue by January's end

via Techmeme 👤 Wired 📅 2026-03-11

⚡ Score: 7.0

🛠️ TOOLS

CostRouter – Cut AI API costs 60% by routing to the cheapest capable model

via HackerNews 👤 alex_1002 📅 2026-03-12

🔺 3 pts ⚡ Score: 6.9

🔬 RESEARCH

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

via Arxiv 👤 Mingyang Song, Mao Zheng, Chenning Xu 📅 2026-03-11

⚡ Score: 6.9

"The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We..."

🔬 RESEARCH

Think Before You Lie: How Reasoning Improves Honesty

via Arxiv 👤 Ann Yuan, Asma Ghandeharioun, Carter Blum et al. 📅 2026-03-10

⚡ Score: 6.9

"While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to..."

🔬 RESEARCH

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

via Arxiv 👤 Konstantin Dobler, Simon Lehnerer, Federico Scozzafava et al. 📅 2026-03-11

⚡ Score: 6.8

"We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptat..."

🔬 RESEARCH

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

via Arxiv 👤 Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee et al. 📅 2026-03-11

⚡ Score: 6.8

"Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic e..."

🗣️ SPEECH/AUDIO

Voxtral WebGPU: Real-time speech transcription entirely in your browser with Transformers.js

via r/LocalLLaMA 👤 u/xenovatech 📅 2026-03-11

⬆️ 21 ups ⚡ Score: 6.8

"Mistral recently released Voxtral-Mini-4B-Realtime, a multilingual, realtime speech-transcription model that supports 13 languages and is capable of <500 ms latency. Today, we added support for it to Transformers.js, enabling live ..."

💬 Reddit Discussion: 5 comments 👍 LOWKEY SLAPS

🎯 Model Capabilities • Browser vs Operating System • Deployment Options

💬 "This model is awesome, and they are planning for speaker diarization in the next release!" • "You can run it inside a mobile browser without having to deploy an App - Just one of many use cases"

🔬 RESEARCH

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

via Arxiv 👤 Zorik Gekhman, Roee Aharoni, Eran Ofek et al. 📅 2026-03-10

⚡ Score: 6.8

"While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Never..."

🔬 RESEARCH

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

via Arxiv 👤 Chengyu Shen, Yanheng Hou, Minghui Pan et al. 📅 2026-03-10

⚡ Score: 6.8

"Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggrega..."

🔬 RESEARCH

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

via Arxiv 👤 Peter Balogh 📅 2026-03-11

⚡ Score: 6.8

"We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we fin..."

📊 DATA

BrowseComp: The Benchmark That Tests What AI Agents Can Find

via HackerNews 👤 kacper-vstorm 📅 2026-03-12

🔺 1 pts ⚡ Score: 6.8

🚀 STARTUP

Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do

via HackerNews 👤 anayrshukla 📅 2026-03-11

🔺 20 pts ⚡ Score: 6.8

💬 HackerNews Buzz: 7 comments 😤 NEGATIVE ENERGY

🎯 Adversarial Attacks • Agent Monitoring • Agent Reputation

💬 "Prompt injection is the clearest example: an attacker embeds instructions in content your agent processes." • "Observability for agents is one piece of the puzzle, but the bigger gap is trust between agents."

🔬 RESEARCH

Benchmarking Political Persuasion Risks Across Frontier Large Language Models

via Arxiv 👤 Zhongren Chen, Joshua Kalla, Quan Le 📅 2026-03-10

⚡ Score: 6.7

"Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=1..."

🛠️ SHOW HN

Show HN: Autoresearch@home

via HackerNews 👤 austinbaggio 📅 2026-03-11

🔺 63 pts ⚡ Score: 6.7

💬 HackerNews Buzz: 11 comments 🐝 BUZZING

🎯 Cryptocurrency Rewards • Measuring Model Improvements • Gamifying Research Contribution

💬 "I'm looking at the descending graph of progress here, and wondering if being able to claim improvement tokens (even for no reason other than NFT-esque bragging rights) wouldn't be a cool thing here?" • "Is there anything to be learned from the differences in logprobs between them for the same input?"

🔬 RESEARCH

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

via Arxiv 👤 Jinwoo Ahn, Ingyu Seong, Akhil Kedia et al. 📅 2026-03-11

⚡ Score: 6.7

"Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context..."

🔬 RESEARCH

Ranking Reasoning LLMs under Test-Time Scaling

via Arxiv 👤 Mohsen Hariri, Michael Hinczewski, Jing Ma et al. 📅 2026-03-11

⚡ Score: 6.7

"Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-compari..."

🔬 RESEARCH

Anthropic debuts Anthropic Institute, an internal think tank led by co-founder Jack Clark, combining its Societal Impacts, Red Team, and Economic Research teams

via Techmeme 👤 Theverge 📅 2026-03-11

⚡ Score: 6.7

🔬 RESEARCH

Towards a Neural Debugger for Python

via Arxiv 👤 Maximilian Beck, Jonas Gehring, Jannik Kossen et al. 📅 2026-03-10

⚡ Score: 6.7

"Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs..."

🛠️ TOOLS

Perplexity Personal Computer agent

2x SOURCES 🌐 📅 2026-03-11

⚡ Score: 6.6

+++ Perplexity rolls out Personal Computer, a locally-runnable AI agent for your Mac plus an enterprise flavor, because apparently the future of work involves letting your laptop think for itself without phoning home first. +++

Perplexity announces Personal Computer, an OpenClaw-like AI agent that can run on a Mac, and an enterprise version of Perplexity Computer

via Techmeme 👤 Axios 📅 2026-03-11

⚡ Score: 6.6

Personal Computer by Perplexity

via HackerNews 👤 josephwegner 📅 2026-03-11

🔺 153 pts ⚡ Score: 6.4

💬 HackerNews Buzz: 123 comments 👍 LOWKEY SLAPS

🎯 Skepticism towards AI hype • Lack of innovation in AI products • Concerns about AI's impact on jobs

💬 "This bubble is so ridiculous at this point." • "We're not solving problems with technology, we're taking technology and applying it to problems."

🔒 SECURITY

Brex tests agents: by committing fraud

via HackerNews 👤 brandonbloom 📅 2026-03-12

🔺 2 pts ⚡ Score: 6.6

🔬 RESEARCH

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

via Arxiv 👤 Yunhang Qian, Xiaobin Hu, Jiaquan Yu et al. 📅 2026-03-10

⚡ Score: 6.6

"While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-rea..."

🔬 RESEARCH

Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents

via Arxiv 👤 Naman Gupta, Vaibhav Singh, Arun Iyer et al. 📅 2026-03-10

⚡ Score: 6.6

"Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to appr..."

🔬 RESEARCH

CREATE: Testing LLMs for Associative Creativity

via Arxiv 👤 Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman et al. 📅 2026-03-10

⚡ Score: 6.6

"A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concep..."

🔬 RESEARCH

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

via Arxiv 👤 Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi et al. 📅 2026-03-11

⚡ Score: 6.6

"With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are L..."

🔬 RESEARCH

GLM-OCR Technical Report

via Arxiv 👤 Shuaiqi Duan, Yadong Xue, Weihan Wang et al. 📅 2026-03-11

⚡ Score: 6.5

"GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To a..."

🔬 RESEARCH

MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

via Arxiv 👤 Yiyang Lu, Yu He, Jianlong Chen et al. 📅 2026-03-10

⚡ Score: 6.5

"Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic..."

🤖 AI MODELS

Nemotron-3-Super-120B-A12B NVFP4 inference benchmark on one RTX Pro 6000 Blackwell

via r/LocalLLaMA 👤 u/jnmi235 📅 2026-03-12

⬆️ 30 ups ⚡ Score: 6.3

"Ran Nemotron-3-Super-120B-A12B NVFP4 through a full benchmark sweep on a single RTX Pro 6000 using vLLM. fp8 KV cache (per Nvidia's setup, unclear if their metrics were tested at fp8 KV cache or not). Context from 1K to 512K, 1 to 5 concurrent requests, 1024 output tokens per request. No prompt cach..."

💬 Reddit Discussion: 18 comments 🐝 BUZZING

🎯 Model Performance • Context Length • Benchmark Comparison

💬 "the speed barely dropping at long context is the real story here" • "Comparatively, 1M-context DeepSeek preview not only did a much better job, but also captured most of Nemotron's errors"

🤖 AI MODELS

AI productivity gains are 10%, not 10x

via HackerNews 👤 donutshop 📅 2026-03-11

🔺 50 pts ⚡ Score: 6.3

💬 HackerNews Buzz: 33 comments 🐝 BUZZING

🎯 AI's impact on developer productivity • Limits of AI-assisted development • Potential future improvements

💬 "AI is a force multiplier. A 10x developer is now a 100x developer" • "LLMs don't have a worldview; this means that they miss a lot of inconsistencies and logical contradictions"

🤖 AI MODELS

llama : add support for Nemotron 3 Super by danbev · Pull Request #20411 · ggml-org/llama.cpp

via r/LocalLLaMA 👤 u/jacek2023 📅 2026-03-11

⬆️ 20 ups ⚡ Score: 6.3

"GGUF: https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF ..."

⚖️ ETHICS

Don't post generated/AI-edited comments. HN is for conversation between humans.

via HackerNews 👤 usefulposter 📅 2026-03-11

🔺 3522 pts ⚡ Score: 6.2

💬 HackerNews Buzz: 1317 comments 👍 LOWKEY SLAPS

🎯 AI usage in online discussions • Responsibility and authenticity • Moderation and community standards

💬 "While I share the concerns raised in this thread, I believe the focus on 'LLM usage' is a bit of a red herring." • "It should clearly states that pasting AI-generated replies is discouraged and does not fit within the community spirit."

🛠️ TOOLS

MCP/Skill for deploying full-stack apps directly from Cursor

via r/cursor 👤 u/1amrocket 📅 2026-03-11

⬆️ 1 ups ⚡ Score: 6.2

"I built Ink (https://ml.ink), a deployment platform where the primary users are AI agents. Tell the agent to deploy. The platform auto-detects the framework, builds it, passes env variables, deploys on cloud and returns a live URL at \*.ml.ink. How I personally been usin..."

🛠️ TOOLS

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive

via r/LocalLLaMA 👤 u/srigi 📅 2026-03-12

⬆️ 89 ups ⚡ Score: 6.2

"You should really invest some time into enabling this for your-self. It is pretty funny (and also addictive) to see fans of your graphic card spinning up, while you utilize "Your own Google"."

💬 Reddit Discussion: 45 comments 🐝 BUZZING

🎯 F1 race results • Search engine limitations • Alternative search tools

💬 "The most recent race was Australia: Russell, Antonelli, Leclerc." • "Any alternative? like selenium with an MCP server?"

📈 BENCHMARKS

Qwen3.5-9B Quantization Comparison

via r/LocalLLaMA 👤 u/TitwitMuffbiscuit 📅 2026-03-11

⬆️ 142 ups ⚡ Score: 6.2

"This is a quantization sweep across major community GGUF quants of Qwen3.5-9B, comparing mean KLD to the BF16 baseline. The goal is to give people a data-driven basis for picking a file rather than just grabbing whatever is available. **KLD (KL Divergence):** "Faithfulness." It shows how much the ..."

💬 Reddit Discussion: 51 comments 🐝 BUZZING

🎯 Quant Performance Comparison • Quantization Methodology Insights • Calibration Data Importance

💬 "Bartowski's quants just feel more stable." • "the bartowski q4_k_m vs unsloth q4_k_m difference is wild"

🛠️ TOOLS

Galileo releases Agent Control, a centralized guardrails platform for AI agents

via HackerNews 👤 CrankyBear 📅 2026-03-12

🔺 2 pts ⚡ Score: 6.2

🧠 NEURAL NETWORKS

[P] Applying the Ebbinghaus forgetting curve to AI agent retrieval -- a biologically-inspired memory system

via r/MachineLearning 👤 u/haustorium12 📅 2026-03-12

⚡ Score: 6.1

"Most retrieval systems for AI agents treat all indexed content as equally available regardless of age, access frequency, or contextual importance. This doesn't reflect how effective memory systems actually work. I built claude-memory, an open-source ..."

🛠️ SHOW HN

Show HN: CAS – I reverse-engineered Claude Code to build a better orchestrator

via HackerNews 👤 aceelric 📅 2026-03-11

🔺 3 pts ⚡ Score: 6.1

🤖 AI MODELS

Are LLM merge rates not getting better?

via HackerNews 👤 4diii 📅 2026-03-12

🔺 81 pts ⚡ Score: 6.1

💬 HackerNews Buzz: 90 comments 🐐 GOATED ENERGY

🎯 LLM Performance Trends • AI Tooling Improvements • AI Agent Interactions

💬 "LLM's have 100% gotten better, but it's hard to say if it's intrinsically better" • "The improved tooling and agent-based approaches that I'm using now make the LLM one-shot performance only a small part of the puzzle"

🛠️ TOOLS

Llama.cpp now with a true reasoning budget!

via r/LocalLLaMA 👤 u/ilintar 📅 2026-03-11

⬆️ 309 ups ⚡ Score: 6.1

"I'm happy to report that llama.cpp has another nice and exciting feature that I know a lot of you have been waiting for - real support for reasoning budgets! Until now, \`--reasoning-budget\` was basically a stub, with its only function being setting it to 0 to disable thinking via passing \`enable..."

💬 Reddit Discussion: 63 comments 🐝 BUZZING

🎯 Token budget management • Reasoning heuristics • Ongoing model testing

💬 "But, I expect that reduced thinking time will negatively affect intelligence scores" • "It's worth noting that this ability is not explicitly trained but emerges naturally"

🛠️ TOOLS

Zapcode: A TypeScript interpreter in Rust for AI agents (2µs start, sandbox)

via HackerNews 👤 TheUncharted 📅 2026-03-12

🔺 1 pts ⚡ Score: 6.1

Stories from March 12, 2026

📡 AI NEWS BUT ACTUALLY GOOD

Claude Code builds games from prompts

Meta's MTIA chips announcement

Claude generates interactive charts and visualizations

Perplexity Personal Computer agent