AI News Archive - December 05, 2025 | Metamesh Intelligence

⚡ BREAKTHROUGH

AI-Generated CUDA Kernels Beat NVIDIA Library

2x SOURCES 🌐 📅 2025-12-04

⚡ Score: 8.4

+++ Researchers used reinforcement learning to auto-generate GPU kernels that outpace cuBLAS, proving that brute-force search plus compute beats decades of expert optimization (and making every performance engineer slightly nervous). +++

AI-Written CUDA Kernels Outperforms Nvidia's Best Matmul Library

via HackerNews 👤 dzign 📅 2025-12-04

🔺 3 pts ⚡ Score: 8.4

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

via HackerNews 👤 dzign 📅 2025-12-04

🔺 112 pts ⚡ Score: 8.0

💬 HackerNews Buzz: 11 comments 🐐 GOATED ENERGY

🎯 Matrix dimensions optimization • Novel matrix multiplication techniques • Code performance and evaluation

💬 "You can find the nearest neighbor configuration (larger than yours) and pad with zeros." • "Escaping the distribution and actually creating novel sequences of instructions or even patterns seems difficult to say the least."

🛠️ TOOLS

Google Titans Architecture for Long Context

2x SOURCES 🌐 📅 2025-12-05

⚡ Score: 8.3

+++ Google ships an RNN/transformer hybrid that handles 2M token contexts without sacrificing speed, proving that sometimes the answer to "can we have it all" is actually yes, not another research paper. +++

Google debuts Titans, an architecture combining RNN speed with transformer performance for real-time learning, able to scale effectively to a 2M+ context window

via Techmeme 👤 Research 📅 2025-12-05

⚡ Score: 8.8

🛠️ TOOLS

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

via Techmeme 👤 Alignmentforum 📅 2025-12-04

⚡ Score: 8.0

🔬 RESEARCH

In-Context Representation Hijacking

via Arxiv 👤 Itay Yona, Amir Sarid, Michael Karasik et al. 📅 2025-12-03

⚡ Score: 7.9

"We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, pr..."

🤖 AI MODELS

Some AI Systems May Be Impossible to Compute

via HackerNews 👤 tesserato 📅 2025-12-05

🔺 5 pts ⚡ Score: 7.8

🛠️ SHOW HN

Show HN: USST – A protocol to reduce LLM context redundancy by 98.5%

via HackerNews 👤 mgopanna 📅 2025-12-05

🔺 2 pts ⚡ Score: 7.5

🤖 AI MODELS

Gemini 3 Deep Think Rollout

2x SOURCES 🌐 📅 2025-12-04

⚡ Score: 7.4

+++ Google's delayed reasoning model finally arrives for paying subscribers, suggesting those November safety concerns either resolved themselves or simply needed better PR timing to land. +++

Google rolls out Gemini 3 Deep Think to Google AI Ultra subscribers in the Gemini app, after saying in November it needed “extra time for safety evaluations”

via Techmeme 👤 9To5Google 📅 2025-12-04

⚡ Score: 7.5

Gemini 3 Pro: the frontier of vision AI

via HackerNews 👤 xnx 📅 2025-12-05

🔺 187 pts ⚡ Score: 7.0

💬 HackerNews Buzz: 80 comments 🐝 BUZZING

🎯 AI model capabilities • Computer vision applications • Automation potential

💬 "Gemini 3 Pro with Code Execution is able to one-shot the problem" • "Maybe not quite a transformer, but interesting that it could properly interpret 'dog leg' and ID them"

🤖 AI MODELS

A tiny 4B model you can run on your laptop now hits ~80–85% of full GPT‑4.1 ability

via r/ChatGPT 👤 u/Impossible-Power6989 📅 2025-12-05

⬆️ 128 ups ⚡ Score: 7.4

"I wanted to share some (rough) numbers comparing a small, on-device language model (Qwen3-VL-4B Instruct; multi-modal) which I have been playing around with. We've been discussing it over on r/LocalLLM, but we're pretty nerdcore over there, and I figure there are people here who might like to know. ..."

💬 Reddit Discussion: 37 comments 🐝 BUZZING

🎯 Local LLM Performance • Practical LLM Applications • Excitement for Local LLM

💬 "this is a *baby* llm" • "Even though I'm not personally switching over to local, that's great for (a) people on underpowered hardware willing to sacrifice that performance for privacy/control and (b) for future prospects of better local LLM"

📊 DATA

State of AI: An Empirical 100T Token Study with OpenRouter

via HackerNews 👤 anjneymidha 📅 2025-12-04

🔺 180 pts ⚡ Score: 7.3

💬 HackerNews Buzz: 81 comments 🐝 BUZZING

🎯 AI adoption trends • Data privacy concerns • Infrastructure requirements

💬 "the weekly token consumption keeps on rising, and it's already in trillions" • "we may well see multiple companies hit six, seven, or even eight trillion dollars in market cap"

🔬 RESEARCH

The Universal Weight Subspace Hypothesis

via Arxiv 👤 Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya et al. 📅 2025-12-04

⚡ Score: 7.3

"We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization..."

🔬 RESEARCH

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

via Arxiv 👤 Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang et al. 📅 2025-12-03

⚡ Score: 7.1

"Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer method..."

🔬 RESEARCH

Polarization by Design: How Elites Could Shape Mass Preferences as AI Reduces Persuasion Costs

via Arxiv 👤 Nadav Kunievsky 📅 2025-12-03

⚡ Score: 7.1

"In democracies, major policy decisions typically require some form of majority or consensus, so elites must secure mass support to govern. Historically, elites could shape support only through limited instruments like schooling and mass media; advances in AI-driven persuasion sharply reduce the cost..."

🔬 RESEARCH

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

via Arxiv 👤 Oren Rachmil, Roy Betser, Itay Gershon et al. 📅 2025-12-03

⚡ Score: 7.1

"Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mecha..."

🔬 RESEARCH

Kimina-Prover: Applying Test-Time RL Search on Large Formal Reasoning Models

via HackerNews 👤 ibobev 📅 2025-12-04

🔺 1 pts ⚡ Score: 7.1

🔬 RESEARCH

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

via Arxiv 👤 Jingyang Ou, Jiaqi Han, Minkai Xu et al. 📅 2025-12-03

⚡ Score: 7.0

"Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token..."

🔬 RESEARCH

The Amazon scientist using automated reasoning to kill AI hallucinations

via HackerNews 👤 xjparker 📅 2025-12-05

🔺 3 pts ⚡ Score: 7.0

🛠️ TOOLS

[D] We stress-tested the idea of “LLMs with thousands of tools.” The results challenge some assumptions.

via r/MachineLearning 👤 u/Ok-Classic6022 📅 2025-12-05

⬆️ 44 ups ⚡ Score: 7.0

"Anthropic released a new *Tool Search* feature intended to solve the “too many tools in context” problem by letting models discover tools just-in-time instead of loading thousands of definitions. We wanted to see how it behaves in a realistic agent environment, so we ran a small but systematic benc..."

💬 Reddit Discussion: 14 comments 👍 LOWKEY SLAPS

🎯 Task Decomposition • Tool Integration • Limitations of LLMs

💬 "letting the LM figure out necessary subtasks and then looking for appropriate tools" • "the fix isn't just planning; you need a tight intent layer and a smaller, well-tagged tool catalog"

🔬 RESEARCH

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

via Arxiv 👤 Xiaolong Li, Youping Gu, Xi Lin et al. 📅 2025-12-03

⚡ Score: 7.0

"Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard..."

🔬 RESEARCH

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

via Arxiv 👤 Zayne Sprague, Jack Lu, Manya Wadhwa et al. 📅 2025-12-03

⚡ Score: 7.0

"Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforc..."

🔬 RESEARCH

A smarter way for large language models to think about hard problems

via HackerNews 👤 pedriquepacheco 📅 2025-12-04

🔺 2 pts ⚡ Score: 7.0

🤖 AI MODELS

LLM inference is nearly deterministic. We use this to audit providers

via HackerNews 👤 seraine 📅 2025-12-05

🔺 1 pts ⚡ Score: 7.0

🛠️ TOOLS

smallevals - Tiny 0.6B Evaluation Models and a Local LLM Evaluation Framework

via r/LocalLLaMA 👤 u/mburaksayici 📅 2025-12-04

⬆️ 9 ups ⚡ Score: 6.9

"Hi r/LocalLLaMA , you may know me from the latest blogs I've shared on mburaksayici.com/ , discussing LLM and RAG systems, and RAG Boilerplates. When I study evaluation frameworks on LLMs, I've seen they require lots of API calls to generate golden datasets, open-ended ..."

🔬 RESEARCH

AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

via Arxiv 👤 Ying Wang, Zhen Jin, Jiexiong Xu et al. 📅 2025-12-03

⚡ Score: 6.9

"As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maxim..."

🔬 RESEARCH

Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

via Arxiv 👤 Hang Xu, Linjiang Huang, Feng Zhao 📅 2025-12-03

⚡ Score: 6.9

"Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of..."

🔬 RESEARCH

Efficient Public Verification of Private ML via Regularization

via Arxiv 👤 Zoë Ruha Bell, Anvith Thudi, Olive Franzese-McLaughlin et al. 📅 2025-12-03

⚡ Score: 6.9

"Training with differential privacy (DP) provides a guarantee to members in a dataset that they cannot be identified by users of the released model. However, those data providers, and, in general, the public, lack methods to efficiently verify that models trained on their data satisfy DP guarantees...."

🔬 RESEARCH

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

via Arxiv 👤 Zexin Lin, Hawen Wan, Yebin Zhong et al. 📅 2025-12-03

⚡ Score: 6.8

"Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are criti..."

🔬 RESEARCH

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

via Arxiv 👤 Yizhou Zhao, Zhiwei Steven Wu, Adam Block 📅 2025-12-03

⚡ Score: 6.8

"Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforc..."

🛠️ TOOLS

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript

via r/claudeai 👤 u/cheetguy 📅 2025-12-05

⬆️ 96 ups ⚡ Score: 6.8

"Some of you might have seen my post here about my open-source implementation of ACE (agents that learn from execution feedback). I connected the framework to Claude Code and let it run in a continuous loop..."

💬 Reddit Discussion: 21 comments 🐝 BUZZING

🎯 Source code analysis • Prompt engineering • AI capabilities

💬 "It's clear that the prompts in Claude, Codex and Antigravity were all carefully human-authored." • "How much value do you think came from the particular methodologies embodied in these prompts?"

🛠️ TOOLS

speed optimizations for Qwen Next on CUDA have been merged into llama.cpp

via r/LocalLLaMA 👤 u/jacek2023 📅 2025-12-04

⬆️ 179 ups ⚡ Score: 6.8

"Open source code repository or project related to AI/ML."

💬 Reddit Discussion: 44 comments 🐝 BUZZING

🎯 LLM Benchmarking • LLM Model Comparison • LLM Performance Tuning

💬 "Qwen3-next is more of a tech demo rather than a good model for general use" • "The last 10% is the 90% of the work"

🔬 RESEARCH

Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions

via Arxiv 👤 Kazi Abrab Hossain, Jannatul Somiya Mahmud, Maria Hossain Tuli et al. 📅 2025-12-03

⚡ Score: 6.8

"While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have dif..."

🔬 RESEARCH

Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

via Arxiv 👤 Michael Staniek, Artem Sokolov, Stefan Riezler 📅 2025-12-03

⚡ Score: 6.7

"Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to fo..."

🛠️ TOOLS

Hugging Face details how it used its new tool, Skills, to fine tune an LLM using Claude, including for writing scripts, submitting jobs to cloud GPUs, and more

via Techmeme 👤 Huggingface 📅 2025-12-05

⚡ Score: 6.7

🔬 RESEARCH

Eval Factsheets: A Structured Framework for Documenting AI Evaluations

via Arxiv 👤 Florian Bordes, Candace Ross, Justine T Kao et al. 📅 2025-12-03

⚡ Score: 6.7

"The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack syst..."

🔬 RESEARCH

Algorithmic Thinking Theory

via Arxiv 👤 MohammadHossein Bateni, Vincent Cohen-Addad, Yuzhou Gu et al. 📅 2025-12-04

⚡ Score: 6.7

"Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought..."

🤖 AI MODELS

Structured Outputs Now Available for Haiku 4.5

via r/claudeai 👤 u/ClaudeOfficial 📅 2025-12-04

⬆️ 121 ups ⚡ Score: 6.7

"A few weeks ago we launched Structured Outputs in public beta for Claude Sonnet 4.5 and Opus 4.1—giving you 100% schema compliance and perfectly formatted responses on every request. Today, we'..."

💬 Reddit Discussion: 7 comments 🐝 BUZZING

🎯 Structured output support • Tool-building and integrations • LLM performance and engineering

💬 "Structured outputs are lowkey what is powering this entire agentic revolution." • "You write some guardrails around it… claude is very good at sticking to your desired format."

🔬 RESEARCH

Jina-VLM: Small Multilingual Vision Language Model

via Arxiv 👤 Andreas Koukounas, Georgios Mastrapas, Florian Hönicke et al. 📅 2025-12-03

⚡ Score: 6.6

"We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient pr..."

🔬 RESEARCH

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

via Arxiv 👤 Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu et al. 📅 2025-12-04

⚡ Score: 6.6

"Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference..."

🔧 INFRASTRUCTURE

At What Point Does Owning GPUs Become Cheaper Than LLM APIs ? I

via r/LocalLLaMA 👤 u/Chimchimai 📅 2025-12-04

⬆️ 78 ups ⚡ Score: 6.5

"Hi all, I often see people say that using APIs is always cheaper and that running models locally is mainly for other reasons like privacy or control. I am choosing infrastructure for my company with LLM features and I am trying to decide between frontier model APIs, AWS GPU rentals, or buying and s..."

💬 Reddit Discussion: 102 comments 🐝 BUZZING

🎯 Hardware infrastructure costs • API vs. self-hosting trade-offs • Scalability and maintenance challenges

💬 "Never, we just like burning money :)" • "Local inference is sick. It's awesome and unlocks so many possibilities."

🔬 RESEARCH

David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

via Arxiv 👤 Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari et al. 📅 2025-12-04

⚡ Score: 6.5

"Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated age..."

🛠️ TOOLS

[D] Embedding Drift hurt our Agentic AI more than model choice

via r/MachineLearning 👤 u/coolandy00 📅 2025-12-04

⚡ Score: 6.5

"Most quality loss wasn’t from model or retriever choice it was from embedding drift: * Inconsistent preprocessing * Mixed embeddings from partial refreshes * Chunk-boundary drift upstream * Vector-norm shifts across versions * Index rebuild variance This caused unpredictable NN recall and unstable..."

🤖 AI MODELS

OpenAI's Stargate project to consume up to 40% of global DRAM output

via HackerNews 👤 tormeh 📅 2025-12-05

🔺 5 pts ⚡ Score: 6.4

🔒 SECURITY

PromptPwnd: Prompt Injection Vulnerabilities in GitHub Actions Using AI Agents

via HackerNews 👤 devy 📅 2025-12-05

🔺 2 pts ⚡ Score: 6.4

🎨 CREATIVE

Will Smith Eating Spaghetti 2.9 Years Later

via r/ChatGPT 👤 u/memerwala_londa 📅 2025-12-04

⬆️ 4300 ups ⚡ Score: 6.2

"This will always be the most iconic video forever for AI,will smith will be the best test subject for every new tool in market , this time I made this on Kling 2.6 on Higgsfield and prompt generated using ChatGPT..."

💬 Reddit Discussion: 297 comments 👍 LOWKEY SLAPS

🎯 AI Realism • AI Progress • Community Response

💬 "This is getting too real" • "And it's still going to get better"

🤖 AI MODELS

Sources: Beijing-based Cambricon plans to more than triple its AI chip production to 500K units in 2026, including 300K of its advanced Siyuan 590 and 690 chips

via Techmeme 👤 Bloomberg 📅 2025-12-04

⚡ Score: 6.2

🛠️ SHOW HN

Show HN: A SOTA chart-extraction system combining traditional CV and LVMs

via HackerNews 👤 raunakchowdhuri 📅 2025-12-04

🔺 1 pts ⚡ Score: 6.2

🛠️ TOOLS

Free Beta: Fine-tuning SDK for LLMs, comments welcome

via HackerNews 👤 CrazyLLM 📅 2025-12-05

🔺 5 pts ⚡ Score: 6.1

🔄 OPEN SOURCE

Ellora: Enhancing LLMs with LoRA - Standardized Recipes for Capability Enhancement

via r/LocalLLaMA 👤 u/asankhs 📅 2025-12-05

⬆️ 15 ups ⚡ Score: 6.1

"Hugging Face model, dataset, or community resource."

💬 Reddit Discussion: 3 comments 🐐 GOATED ENERGY

🎯 Self-generated data • Quantization recovery • Large context models

💬 "By using accuracy-recovery LoRA adapters" • "I'd love to see quality loss recovery numbers"

🛠️ TOOLS

The real reason most RAG systems “mysteriously break”

via r/artificial 👤 u/coolandy00 📅 2025-12-05

⬆️ 1 ups ⚡ Score: 6.1

"We sometimes think RAG breaks because the model isn’t good enough. But the failures are almost always systemic. Here’s the uncomfortable bit: RAG collapses because the preprocessing pipeline is unmonitored, not because the LLM lacks intelligence. We use this checklist before you change anything ..."

🛠️ TOOLS

Claude can now run ML research experiments for you

via HackerNews 👤 amberjcjj 📅 2025-12-05

🔺 1 pts ⚡ Score: 6.1

Stories from December 05, 2025

AI-Generated CUDA Kernels Beat NVIDIA Library

Google Titans Architecture for Long Context

Gemini 3 Deep Think Rollout

📡 AI NEWS BUT ACTUALLY GOOD