AI News Archive - March 22, 2026 | Metamesh Intelligence

⚡ BREAKTHROUGH

Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

via r/LocalLLaMA 👤 u/Greedy-Teach1533 📅 2026-03-21

⬆️ 45 ups ⚡ Score: 8.1

"Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out **retrieval** is basically **solved**, the answer is in the context 77 to 91% of the time. The **bottleneck is reasoning**: 73 to 84% of wrong answers come from the model failing to connect the dots, not f..."

💬 Reddit Discussion: 18 comments 🐝 BUZZING

🎯 Model Improvements • Reasoning vs Retrieval • Graph Compression

💬 "the finding that 73-84% of failures are reasoning not retrieval is honestly the most important takeaway here" • "the graph compression piece is interesting too - cutting context by 60% without extra llm calls probably helps more than ppl realize"

🤖 AI MODELS

Found 3 instructions in Anthropic's docs that dramatically reduce Claude's hallucination. Most people don't know they exist.

via r/claudeai 👤 u/ColdPlankton9273 📅 2026-03-21

⬆️ 1763 ups ⚡ Score: 8.0

"Been building a daily research workflow on Claude. Kept getting confident-sounding outputs with zero sources. The kind of stuff that sounds right but you can't verify. I stumbled into Anthropic's "Reduce Hallucinations" documentation page by accid..."

💬 Reddit Discussion: 148 comments 👍 LOWKEY SLAPS

🎯 Tradeoffs in AI Guardrails • Customization and User Responsibility • General Usefulness vs Specific Needs

💬 "there's a tradeoff" • "It's user responsability to be informed and to adjust it for their needs"

🛠️ TOOLS

Tinybox- offline AI device 120B parameters

via HackerNews 👤 albelfio 📅 2026-03-21

🔺 479 pts ⚡ Score: 7.8

💬 HackerNews Buzz: 283 comments 🐝 BUZZING

🎯 Hardware pricing • Hardware customization • Sustainability and recycling

💬 "the cheapest box seems pricey at 12 for a what is essentially a few gaming gpus" • "Nobody is going to order a $10 million piece of infrastructure through your website's order form"

⚡ BREAKTHROUGH

I built a photonic AI chip for space with 860x less power, rad-hard to 106 krad

via HackerNews 👤 ventiproject 📅 2026-03-22

🔺 2 pts ⚡ Score: 7.7

🛡️ SAFETY

What 33 AI Agents Taught Me About Alignment

via HackerNews 👤 slythefox 📅 2026-03-21

🔺 1 pts ⚡ Score: 7.4

🛠️ TOOLS

I built a daemon that polls Linear for issues and spawns Claude Code agents to implement them automatically

via r/cursor 👤 u/WarLocal5063 📅 2026-03-21

⬆️ 1 ups ⚡ Score: 7.4

"I've been running a bash daemon that watches my Linear board for issues tagged "claude" and spawns autonomous Claude Code instances to implement them — in isolated git worktrees, with full transcripts, up to 5 concurrent workers. This applies equally well to Cursor CLI: Here's the workflow: ..."

💬 Reddit Discussion: 3 comments 🐝 BUZZING

🎯 Worktree management • Autonomous agents • Merge conflict resolution

💬 "Worktrees per agent is the right call" • "The key insight you nailed is keeping tasks small and well-scoped"

🔬 RESEARCH

[P] I built an open-source benchmark to test if LLMs are actually as confident as they claim to be (Spoiler: They often aren't)

via r/MachineLearning 👤 u/ChallengingForce 📅 2026-03-21

⚡ Score: 7.3

"Hey everyone, When building systems around modern open-source LLMs, one of the biggest issues is that they can confidently hallucinate or state an incorrect answer with a 95%+ probability. This makes it really hard to deploy them into the real world reliably if we don't understand their "overconfid..."

💬 Reddit Discussion: 7 comments 🐐 GOATED ENERGY

🎯 Model confidence • Calibration of confidence • Benchmark evaluation

💬 "Is confidence a score written out by the LLM?" • "how confident are they on their confidence rating?"

🔬 RESEARCH

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

via Arxiv 👤 Zhuolin Yang, Zihan Liu, Yang Chen et al. 📅 2026-03-19

⚡ Score: 7.3

"We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight..."

🛠️ TOOLS

ik_llama.cpp gives 26x faster prompt processing on Qwen 3.5 27B — real world numbers

via r/LocalLLaMA 👤 u/New-Inspection7034 📅 2026-03-22

⬆️ 132 ups ⚡ Score: 7.3

"I've been running Qwen 3.5 27B Q4_K_M on a Blackwell RTX PRO 4000 (24GB) for agentic coding work and hit a wall with mainline llama.cpp. Switched to the ik_llama.cpp fork today and the difference is staggering. Posting real numbers in case it helps others. Hardware Lenovo ThinkStation P520, Xeon W-..."

💬 Reddit Discussion: 53 comments 👍 LOWKEY SLAPS

🎯 GPU performance optimization • Quantization challenges • Model architecture differences

💬 "your kV cache uses different quant which greatly slows down the speed" • "The 26x is specifically the fused GDN kernel improvement for Qwen 3.5's hybrid SSM architecture"

🛡️ SAFETY

A circuit breaker for AI agents that fires before the wrong action executes

via HackerNews 👤 pb_lightmind 📅 2026-03-21

🔺 3 pts ⚡ Score: 7.2

🛠️ TOOLS

Sashiko: AI code review system for the Linux kernel spots bugs humans miss

via HackerNews 👤 maxloh 📅 2026-03-22

🔺 2 pts ⚡ Score: 7.1

🔬 RESEARCH

Why Building Mega Clusters Is Wrong

via HackerNews 👤 smurda 📅 2026-03-21

🔺 2 pts ⚡ Score: 7.1

🛠️ SHOW HN

Show HN: A BOINC project where AI designs and runs experiments autonomously

via HackerNews 👤 Pyhelix 📅 2026-03-22

🔺 5 pts ⚡ Score: 7.0

🔬 RESEARCH

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

via Arxiv 👤 Zou Qiang 📅 2026-03-19

⚡ Score: 6.8

"Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at th..."

🔒 SECURITY

Ga. Court Order Included AI-Hallucinated Cases from Prosecutor's Proposed Order

via HackerNews 👤 treetalker 📅 2026-03-22

🔺 8 pts ⚡ Score: 6.7

🛠️ TOOLS

Impressive thread from /r/ChatGPT, where after ChatGPT finds out no 7Zip, tar, py7zr, apt-get, Internet, it just manually parsed and unzipped from hex data of the .7z file. What model + prompts would

via r/LocalLLaMA 👤 u/jinnyjuice 📅 2026-03-22

⬆️ 258 ups ⚡ Score: 6.7

"External link discussion - see full content at original source."

💬 Reddit Discussion: 61 comments 👍 LOWKEY SLAPS

🎯 Impressive AI capabilities • Autonomous problem-solving • Proprietary software challenges

💬 "The fact that it just brute-forced a 7z format from raw hex without any tools is genuinely unhinged." • "It was super spooky..it was just working in a loop and I started to see new trace + exceptions show up on the console of my training process while it figured out the path."

🔬 RESEARCH

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

via Arxiv 👤 Shang-Jui Ray Kuo, Paola Cascante-Bonilla 📅 2026-03-19

⚡ Score: 6.6

"Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a st..."

🛠️ TOOLS

MCP Is Costing You 37% More Tokens Than Necessary

via r/claudeai 👤 u/gounisalex 📅 2026-03-21

⬆️ 28 ups ⚡ Score: 6.6

"When we use skills, plugins or MCP tools, Claude reads long input schemas or injects prompt instructions. Those tokens are charged as input tokens, and can be expensive at scale, especially when it comes to API usage. We even ask Claude to explore other folders and sibling repositories, read files ..."

💬 Reddit Discussion: 16 comments 🐝 BUZZING

🎯 MCP vs. CLI tools • Token overhead • Tool discovery

💬 "Agents are naturally good at bash — reading files, writing files, piping commands, parsing output." • "The one thing MCP does well is when it's tightly integrated (like Claude Code's built-in tools) — that feels natural because they control both sides."

🛠️ TOOLS

Nvidia Open-Sources OpenShell: Agent Runtime with Security Guardrails

via HackerNews 👤 jee599 📅 2026-03-21

🔺 5 pts ⚡ Score: 6.5

💬 HackerNews Buzz: 2 comments 🐐 GOATED ENERGY

🎯 AI agents as computing paradigm • Systems-level changes for AI agents • Nvidia's AI advancements

💬 "What actually has to change at the systems level for agents to become a first-class workload?" • "NVIDIA frames AI agents as the next computing paradigm"

🔬 RESEARCH

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

via Arxiv 👤 Carlos Hinojosa, Clemens Grange, Bernard Ghanem 📅 2026-03-19

⚡ Score: 6.5

"Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic..."

🔬 RESEARCH

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

via Arxiv 👤 Zehao Li, Zhenyu Wu, Yibo Zhao et al. 📅 2026-03-19

⚡ Score: 6.4

"Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Th..."

🔒 SECURITY

Claude Code workspace trust dialog bypass, settings loading order CVE-2026-33068

via HackerNews 👤 raxe 📅 2026-03-21

🔺 2 pts ⚡ Score: 6.4

📊 DATA

some pretty dope datasets i came across from the 3D vision conference in vancouver

via r/computervision 👤 u/datascienceharp 📅 2026-03-21

⬆️ 37 ups ⚡ Score: 6.4

"harmony4d, the precursor to the contact4d dataset. it's a large-scale multi-view video dataset of in-the-wild close human–human contact interactions: https://huggingface.co/datasets/Voxel51/Harmony4D toon3d, has 12 scenes from popular hand-drawn cartoons and anime, each comprising 5–12 frames that ..."

🛠️ SHOW HN

Show HN: Vessel Browser – An open-source browser built for AI agents, not humans

via HackerNews 👤 unmodeledtyler 📅 2026-03-21

🔺 4 pts ⚡ Score: 6.3

⚡ BREAKTHROUGH

Inducing Sustained Creativity and Diversity in Large Language Models

via HackerNews 👤 artninja1988 📅 2026-03-22

🔺 1 pts ⚡ Score: 6.3

🔬 RESEARCH

How Uncertainty Estimation Scales with Sampling in Reasoning Models

via Arxiv 👤 Maksym Del, Markus Kängsepp, Marharyta Domnich et al. 📅 2026-03-19

⚡ Score: 6.3

"Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks s..."

🛠️ TOOLS

Hands-on with Gemini task automation on mobile: it's super impressive despite being very slow and failing at some tasks; it can order food, book Ubers, and more

via Techmeme 👤 Theverge 📅 2026-03-22

⚡ Score: 6.2

🛠️ TOOLS

Litesearch: Karpathy's autoresearch but for consumer GPUs (4–8GB) + easy GUI

via r/LocalLLaMA 👤 u/Fast-Mousse405 📅 2026-03-21

⬆️ 17 ups ⚡ Score: 6.2

"Karpathy's autoresearch is awesome — agent edits train.py and runs tiny LLM experiments overnight. But it wants serious VRAM. I forked it to run on normal cards like my 1080/3060: * Auto-picks model size/depth/batch/seq len so it fits your VRAM (leaves buffer, no more OOM surpri..."

🛠️ SHOW HN

Show HN: ClawJetty: Agent Pages for Production AI

via HackerNews 👤 andes314 📅 2026-03-21

🔺 1 pts ⚡ Score: 6.2

🛠️ SHOW HN

Show HN: GoldenMatch – Entity resolution with LLM scoring, 97% F1, no Spark

via HackerNews 👤 benzsevern 📅 2026-03-21

🔺 2 pts ⚡ Score: 6.1

🛠️ TOOLS

the first native Pytorch distributed training backend for Apple Silicon

via HackerNews 👤 sassoshots44 📅 2026-03-21

🔺 1 pts ⚡ Score: 6.1

Stories from March 22, 2026

📡 AI NEWS BUT ACTUALLY GOOD