AI News Archive - March 23, 2026 | Metamesh Intelligence

🧠 NEURAL NETWORKS

Intuitions for Tranformer Circuits

via HackerNews 👤 cjamsonhn 📅 2026-03-23

🔺 58 pts ⚡ Score: 9.0

💬 HackerNews Buzz: 3 comments 🐐 GOATED ENERGY

🎯 Understanding technology | Limits of knowledge | Analogies and comparisons

💬 "We don't fully understand from first principles" • "Glad to discover this is an analogy"

⚡ BREAKTHROUGH

Prompt to tape out: Autonomous AI agent builds 1.5 GHz RISC-V CPU

via HackerNews 👤 monocasa 📅 2026-03-23

🔺 2 pts ⚡ Score: 8.3

🔬 RESEARCH

First AI Solution on FrontierMath: Open Problems

via HackerNews 👤 Philpax 📅 2026-03-23

🔺 3 pts ⚡ Score: 8.3

🔬 RESEARCH

Karpathy's Autonomous Research Agent Experiments

3x SOURCES 🌐 📅 2026-03-23

⚡ Score: 8.3

+++ Andrej Karpathy's autonomous research agent ran 700 ML experiments in 48 hours, proving that AI can optimize itself faster than humans can write grant proposals about it. +++

A look at Andrej Karpathy's “autoresearch” experiment, where an AI agent runs in a loop iterating and evaluating on training code to optimize a model

via Techmeme 👤 Fortune 📅 2026-03-23

⚡ Score: 8.0

🔬 RESEARCH

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

via Arxiv 👤 Zhuolin Yang, Zihan Liu, Yang Chen et al. 📅 2026-03-19

⚡ Score: 8.1

"We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight..."

🔬 RESEARCH

[R] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails (arXiv 2603.18280)

via r/MachineLearning 👤 u/Logical-Employ-9692 📅 2026-03-23

⚡ Score: 8.0

"**Paper:** https://arxiv.org/abs/2603.18280 **TL;DR:** Current alignment evaluation measures concept detection (probing) and refusal (benchmarking), but alignment primarily operates through a learned routing mechanism between these - and that routing is lab-speci..."

⚡ BREAKTHROUGH

I built a photonic AI chip for space with 860x less power, rad-hard to 106 krad

via HackerNews 👤 ventiproject 📅 2026-03-22

🔺 2 pts ⚡ Score: 7.7

🔬 RESEARCH

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

via Arxiv 👤 Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak et al. 📅 2026-03-20

⚡ Score: 7.3

"Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude..."

🤖 AI MODELS

Binary-Weight/Quantized LLM for Resource-Constrained Devices

2x SOURCES 🌐 📅 2026-03-22

⚡ Score: 7.3

+++ Binary weights and video compression tricks push inference into microcontrollers and browsers, because apparently the path to AGI runs through devices with less RAM than a 2005 iPod. +++

7MB binary-weight Mamba LLM — zero floating-point at inference, runs in browser

via r/LocalLLaMA 👤 u/Quiet-Error- 📅 2026-03-23

⬆️ 34 ups ⚡ Score: 7.5

"57M params, fully binary {-1,+1}, state space model. The C runtime doesn't include math.h — every operation is integer arithmetic (XNOR, popcount, int16 accumulator for SSM state). Designed for hardware without FPU: ESP32, Cortex-M, or anything with \~8MB of memory and a CPU. Also runs in browser v..."

💬 Reddit Discussion: 20 comments 👍 LOWKEY SLAPS

🎯 Model Transparency • Model Performance • Community Engagement

💬 "Open-source ≠ open-weight." • "it's really 57M parameters? It works pretty good"

Apply video compression on KV cache to 10,000x less error at Q4 quant

via HackerNews 👤 polymorph1sm 📅 2026-03-22

🔺 29 pts ⚡ Score: 6.4

💬 HackerNews Buzz: 2 comments 🐐 GOATED ENERGY

🎯 LLM optimization techniques • Caching and compression algorithms • Tradeoffs in performance

💬 "The main utility of this beyond just saving money for model servers would be deliberately prefilling very long contexts and then saving them to fast flash." • "Bandwidth-wise it is worse (more bytes accessed) to generate and do random recall on than the vanilla approach, and significantly worse than a quantized approach."

🛠️ SHOW HN

Show HN: Littlebird – Screenreading is the missing link in AI

via HackerNews 👤 delu 📅 2026-03-23

🔺 28 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 11 comments 😐 MID OR MIXED

🎯 Data privacy • Personal productivity • Cloud storage concerns

💬 "I'm loathe to essentially send screenshots/summaries/etc of all my activity to a cloud solution" • "If you thought Slack logs were damning in discovery, wait til someone suing or prosecuting you figures out that everything you typed and looked at, etc., is in the cloud"

🤖 AI MODELS

iPhone 17 Pro Demonstrated Running a 400B LLM

via HackerNews 👤 anemll 📅 2026-03-23

🔺 384 pts ⚡ Score: 7.2

💬 HackerNews Buzz: 208 comments 🐝 BUZZING

🎯 Memory requirements for AI • Mobile hardware limitations • Practical applications of large models

💬 "Apple has always seen RAM as an economic advantage for their platform" • "Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM"

🔬 RESEARCH

Why (lossy) self-improvement is real but it doesn't lead to fast takeoff

via HackerNews 👤 thoughtpeddler 📅 2026-03-23

🔺 1 pts ⚡ Score: 7.1

🔬 RESEARCH

I'm 11 and trained a custom MoE LLM for $1

via HackerNews 👤 Hey1-Arthur 📅 2026-03-22

🔺 3 pts ⚡ Score: 7.0

📊 DATA

WMB-100K – open source benchmark for AI memory systems at 100K turns

via r/LocalLLaMA 👤 u/Efficient_Joke3384 📅 2026-03-23

⬆️ 4 ups ⚡ Score: 7.0

"Been thinking about how AI memory systems are only ever tested at tiny scales — LOCOMO does 600 turns, LongMemEval does around 1,000. But real usage doesn't look like that. WMB-100K tests 100,000 turns, with 3,134 questions across 5 difficulty levels. Also includes false memory probes — because "I ..."

🛠️ SHOW HN

Show HN: A BOINC project where AI designs and runs experiments autonomously

via HackerNews 👤 Pyhelix 📅 2026-03-22

🔺 5 pts ⚡ Score: 7.0

🔬 RESEARCH

Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees

via Arxiv 👤 Amartya Mukherjee, Maxwell Fitzsimmons, David C. Del Rey Fernández et al. 📅 2026-03-19

⚡ Score: 7.0

"Uncertainty quantification for partial differential equations is traditionally grounded in discretization theory, where solution error is controlled via mesh/grid refinement. Physics-informed neural networks fundamentally depart from this paradigm: they approximate solutions by minimizing residual l..."

🛠️ TOOLS

Knowledge Engine with Graph-Based Reasoning (No LLM Reasoning)

2x SOURCES 🌐 📅 2026-03-23

⚡ Score: 7.0

+++ Open-source neurosymbolic engine relegates language models to reading comprehension duty while deterministic graphs handle actual reasoning, proving you don't need GPT-4 money to avoid hallucinations, just better architecture. +++

KOS Engine -- open-source neurosymbolic engine where the LLM is just a thin I/O shell (swap in any local model, runs on CPU)

via r/LocalLLaMA 👤 u/CommunityGuilty5462 📅 2026-03-23

⬆️ 10 ups ⚡ Score: 6.9

"Built an open-source knowledge engine where the LLM does zero reasoning. All inference runs through a deterministic spreading activation graph on CPU. The LLM only reads 1-2 pre-scored sentences at the end, so you can swap gpt-4o-mini for Mistral, Phi, Llama, or literally anything that can complete ..."

📊 DATA

KLD measurements of 8 different llama.cpp KV cache quantizations over several 8-12B models

via r/LocalLLaMA 👤 u/Velocita84 📅 2026-03-23

⬆️ 14 ups ⚡ Score: 6.8

"A couple of weeks ago i was wondering about the impact of KV quantization, so i tried looking for any PPL or KLD measurements but didn't find anything extensive. I did some of my own and these are the results. Models included: Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, Irix 12B (Mistral N..."

💬 Reddit Discussion: 7 comments 🐐 GOATED ENERGY

🎯 Quantization impact • Model performance evaluation • Measurement methodology

💬 "a pure Q4 quant while leaving KV at F16 already leads to 0.07 mean KLD change" • "for the purposes of measuring KLD / PPL with respect to quantizing the KV cache, this method at longer contexts would be more robust"

🔬 RESEARCH

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

via Arxiv 👤 Zou Qiang 📅 2026-03-19

⚡ Score: 6.8

"Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at th..."

🛠️ TOOLS

Instant Grep in Cursor

via r/cursor 👤 u/lrobinson2011 📅 2026-03-23

⬆️ 76 ups ⚡ Score: 6.8

"Cursor can now search millions of files and find results in milliseconds. This dramatically speeds up how fast agents complete tasks. We're sharing how we built Instant Grep, including the algorithms and tradeoffs behind the design. [https://cursor.com/blog/fast-regex-search](https://c..."

💬 Reddit Discussion: 25 comments 👍 LOWKEY SLAPS

🎯 Open Source Development • Performance Improvements • Competitiveness in Tech

💬 "Stay classy" • "This sounds like a genuine game changer"

🔬 RESEARCH

The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

via Arxiv 👤 Amartya Roy, Rasul Tutunov, Xiaotong Ji et al. 📅 2026-03-20

⚡ Score: 6.7

"LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL)..."

🔒 SECURITY

Ga. Court Order Included AI-Hallucinated Cases from Prosecutor's Proposed Order

via HackerNews 👤 treetalker 📅 2026-03-22

🔺 8 pts ⚡ Score: 6.7

🔧 INFRASTRUCTURE

[R] Designing AI Chip Software and Hardware

via r/MachineLearning 👤 u/PerfectFeature9287 📅 2026-03-22

⬆️ 50 ups ⚡ Score: 6.6

"This is a detailed document on how to design an AI chip, both software and hardware. I used to work at Google on TPUs and at Nvidia on GPUs, so I have some idea about this, though the design I suggest is not the same as TPUs or GPUs. I also included many anecdotes from my career in Silicon Valley."

💬 Reddit Discussion: 5 comments 🐝 BUZZING

🎯 Novel non-CPU architectures • Startup vs. big company strategy • LLM-assisted design exploration

💬 "pursuing anything lower than 10-100x faster isn't appealing to investors" • "the right angle is to find a way to make the production of chips easier"

🔬 RESEARCH

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

via Arxiv 👤 Shang-Jui Ray Kuo, Paola Cascante-Bonilla 📅 2026-03-19

⚡ Score: 6.6

"Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a st..."

🔬 RESEARCH

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

via Arxiv 👤 Wenjing Hong, Zhonghua Rong, Li Wang et al. 📅 2026-03-20

⚡ Score: 6.6

"Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the ris..."

🤖 AI MODELS

Xiaomi's MiMo models are making the AI pricing conversation uncomfortable

via r/artificial 👤 u/jochenboele 📅 2026-03-23

⬆️ 41 ups ⚡ Score: 6.6

"MiMo-V2-Flash is open source, scores 73.4% on SWE-Bench (#1 among open source models), and costs $0.10 per million input tokens. That's comparable to Claude Sonnet at 3.5% of the price. MiMo-V2-Pro ranks #3 globally on agent benchmarks behind Claude Opus 4.6, with a 1M token context window, at $1/$..."

💬 Reddit Discussion: 36 comments 🐝 BUZZING

🎯 Pricing pressure • Open-source transparency • Disruption of enterprise

💬 "Cheap is disruptive, but enterprise buyers still pay for reliability, safety, and support" • "The interesting pressure point is the developer and startup tier"

🛠️ SHOW HN

Show HN: LLM Debate Benchmark

via HackerNews 👤 zone411 📅 2026-03-23

🔺 5 pts ⚡ Score: 6.6

🔬 RESEARCH

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

via Arxiv 👤 Carlos Hinojosa, Clemens Grange, Bernard Ghanem 📅 2026-03-19

⚡ Score: 6.5

"Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic..."

🏢 BUSINESS

Tencent launches ClawBot, an OpenClaw-based agent integrated into WeChat, letting its 1B+ MAUs send and receive commands to interact with the AI agent via chat

via Techmeme 👤 Reuters 📅 2026-03-22

⚡ Score: 6.5

🤖 AI MODELS

Q&A with Jensen Huang, who says “we've achieved AGI”, on running Nvidia, AI scaling laws, OpenClaw, future of coding, data centers in space, China, and more

via Techmeme 👤 Lexfridman 📅 2026-03-23

⚡ Score: 6.5

🔬 RESEARCH

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

via Arxiv 👤 Zehao Li, Zhenyu Wu, Yibo Zhao et al. 📅 2026-03-19

⚡ Score: 6.4

"Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Th..."

🛠️ SHOW HN

Show HN: AI That Controls Cloudflare WAF, Stripe, and Supabase in Plain English

via HackerNews 👤 flarite 📅 2026-03-23

🔺 2 pts ⚡ Score: 6.4

🎓 EDUCATION

I fine-tuned Qwen3.5-27B with 35k examples into an AI companion - after 2,000 conversations here’s what actually matters for personality

via r/LocalLLaMA 👤 u/Crypto_Stoozy 📅 2026-03-23

⬆️ 23 ups ⚡ Score: 6.4

"built an AI companion on Qwen3.5-27B dense. 35k SFT examples, 46k DPO pairs all hand-built. personality is in the weights not the prompt. she stays in character even under jailbreak pressure about 2000 conversations from real users so far. things i didnt expect: the model defaults to therapist mod..."

💬 Reddit Discussion: 41 comments 😐 MID OR MIXED

🎯 Personification of LLMs • Evaluating LLM performance • Dangers of LLM personification

💬 "People call she (or sometimes he) their cars, ships, planes, and other objects" • "Calling your LLM 'she' *is* dangerous"

🛠️ TOOLS

The 5 levels of Claude Code (and how to know when you've hit the ceiling on each one)

via r/claudeai 👤 u/DevMoses 📅 2026-03-23

⬆️ 424 ups ⚡ Score: 6.3

"I've been through five distinct phases of using Claude Code. Each one felt like I'd figured it out until something broke. Here's the progression I wish someone had mapped for me. https://preview.redd.it/b0ll68fv0tqg1.png?width=2374&format=png&auto=webp&s=375fade36f9817b6ef6ed48ce9f4e7f5..."

💬 Reddit Discussion: 101 comments 🐝 BUZZING

🎯 AI Workflow Progression • Structured Context Importance • Maintenance Challenges

💬 "The transition from Level 2 to Level 3 is where most people either give up or become true power users." • "The forcing function you mentioned is real though and I have seen plenty of developers stall at Level 2 because their projects never grow complex enough to demand more."

🔄 OPEN SOURCE

Alibaba confirms they are committed to continuously open-sourcing new Qwen and Wan models

via r/LocalLLaMA 👤 u/TKGaming_11 📅 2026-03-22

⬆️ 947 ups ⚡ Score: 6.3

"Source: https://x.com/ModelScope2022/status/2035652120729563290..."

💬 Reddit Discussion: 70 comments 🐝 BUZZING

🎯 Model Quality Concerns • Open-Source Advancement • Talent Departures

💬 "if their future models will suffer in terms of quality" • "Alibaba persists in open-sourcing the Qwen, Wan, and other series of models"

🔬 RESEARCH

LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis

via HackerNews 👤 matt_d 📅 2026-03-23

🔺 2 pts ⚡ Score: 6.3

⚡ BREAKTHROUGH

Inducing Sustained Creativity and Diversity in Large Language Models

via HackerNews 👤 artninja1988 📅 2026-03-22

🔺 1 pts ⚡ Score: 6.3

🛠️ SHOW HN

Show HN: Agent Kernel – Three Markdown files that make any AI agent stateful

via HackerNews 👤 obilgic 📅 2026-03-23

🔺 8 pts ⚡ Score: 6.3

💬 HackerNews Buzz: 1 comments 🐝 BUZZING

🎯 Agent Limitations • Managing Agent Memory • Specialized Agents

💬 "agents will not always reliably follow instructions" • "agents have no clue what's worth remembering"

🔬 RESEARCH

MIT tech review: OpenAI is Building an Automated Researcher

via HackerNews 👤 Bang2Bay 📅 2026-03-23

🔺 7 pts ⚡ Score: 6.3

🛠️ TOOLS

I built a local-only eval runner for AI agents (quickbench)

via HackerNews 👤 Godofall 📅 2026-03-23

🔺 2 pts ⚡ Score: 6.3

🔬 RESEARCH

How Uncertainty Estimation Scales with Sampling in Reasoning Models

via Arxiv 👤 Maksym Del, Markus Kängsepp, Marharyta Domnich et al. 📅 2026-03-19

⚡ Score: 6.3

"Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks s..."

🎨 CREATIVE

Asked ChatGPT for an Image that Will Never Go Viral

via r/ChatGPT 👤 u/Algoartist 📅 2026-03-23

⬆️ 9880 ups ⚡ Score: 6.2

"External link discussion - see full content at original source."

💬 Reddit Discussion: 688 comments 👍 LOWKEY SLAPS

🎯 AI Failures • Community Reactions • Confusion

💬 "This is hilarious. I can't explain why." • "Apparently blinds have become the new 6-fingers"

🏢 BUSINESS

I built an AI receptionist for a mechanic shop

via HackerNews 👤 mooreds 📅 2026-03-23

🔺 164 pts ⚡ Score: 6.2

💬 HackerNews Buzz: 183 comments 👍 LOWKEY SLAPS

🎯 Limitations of AI receptionists • Tradeoffs of AI adoption • Impact on customer experience

💬 "This system will not work as described for several reasons" • "I refuse to do business with anyone who uses them"

🤖 AI MODELS

I asked ChatGPT how does gaming in third world countries look like.

via r/ChatGPT 👤 u/EXIIL1M_Sedai 📅 2026-03-23

⬆️ 3594 ups ⚡ Score: 6.2

"External link discussion - see full content at original source."

💬 Reddit Discussion: 210 comments 👍 LOWKEY SLAPS

🎯 Gaming in Third World • Makeshift Gaming Setups • Harsh Living Conditions

💬 "Barely functioning laptop with a dead battery" • "Nice cooling system you have there"

🛠️ TOOLS

Outworked – An Open Source Office UI for Claude Code Agents

via HackerNews 👤 ZeidJ 📅 2026-03-23

🔺 4 pts ⚡ Score: 6.2

🛠️ TOOLS

I built an app where AI agents autonomously create tasks, review each other's work, message each other — while you watch everything happen on a board. Free, open source.

via r/claudeai 👤 u/IlyaZelen 📅 2026-03-23

⬆️ 19 ups ⚡ Score: 6.2

"Not regular todo/kanban app (I compared it with the top projects in this space) Anthropic recently added an experimental feature — Agent Teams. You spin up a team of agents that work in p..."

💬 Reddit Discussion: 18 comments 🐝 BUZZING

🎯 AI Collaboration • Permission Handling • Feedback & Cynicism

💬 "If you can somehow add support to use claude and codex at the same time?" • "What happens if a permission is needed for a task?"

🛡️ SAFETY

I used bond convexity math to build a kill switch for rogue AI agents

via HackerNews 👤 AnouarBoussif 📅 2026-03-23

🔺 4 pts ⚡ Score: 6.2

🛠️ SHOW HN

Show HN: A Markdown file that turns your AI agent into an autonomous researcher

via HackerNews 👤 chrisdudek 📅 2026-03-22

🔺 10 pts ⚡ Score: 6.2

Stories from March 23, 2026

Karpathy's Autonomous Research Agent Experiments

Binary-Weight/Quantized LLM for Resource-Constrained Devices

📡 AI NEWS BUT ACTUALLY GOOD

Knowledge Engine with Graph-Based Reasoning (No LLM Reasoning)