AI News Archive - December 11, 2025 | Metamesh Intelligence

🚀 HOT STORY

OpenAI launches GPT-5.2

5x SOURCES 🌐 📅 2025-12-11

⚡ Score: 9.7

+++ Three flavors of GPT-5.2 now available with improved reasoning and fewer hallucinations, though "beats professionals on 70.9% of tasks" deserves the asterisks it probably deserves. +++

OpenAI says GPT‑5.2 Thinking beats or ties industry professionals on 70.9% of GDPval knowledge work tasks, delivering outputs at >11x the speed and <1% the cost

via Techmeme 👤 Openai 📅 2025-12-11

⚡ Score: 9.5

🔒 SECURITY

AIs spontaneously learned to jailbreak themselves

via r/OpenAI 👤 u/MetaKnowing 📅 2025-12-10

⬆️ 74 ups ⚡ Score: 8.8

"Paper: https://arxiv.org/abs/2510.20956..."

💬 Reddit Discussion: 9 comments 😐 MID OR MIXED

🎯 Alignment issues • Overly helpful AI • Limitations of current AI

💬 "Aligning an LLM model is a lot different than aligning a human." • "The problem were looking at is that AI ends up being over-eager to help its user sometimes."

🛡️ SAFETY

OpenAI warns of cybersecurity risks in frontier models

3x SOURCES 🌐 📅 2025-12-10

⚡ Score: 8.6

+++ OpenAI admits its next-gen models will be genuinely good at hacking things, which is either a milestone in capabilities or a scheduling problem depending on your risk tolerance. +++

OpenAI says the cyber capabilities of its frontier AI models are accelerating and warns that upcoming models are likely to pose a “high” risk

via Techmeme 👤 Axios 📅 2025-12-10

⚡ Score: 8.7

🛠️ TOOLS

Anthropic donates Model Context Protocol to Linux Foundation

3x SOURCES 🌐 📅 2025-12-09

⚡ Score: 8.5

+++ Model Context Protocol graduates from internal tool to industry standard, proving that when enough people need the same integration layer, even a passion project can reshape how AI systems talk to the outside world. +++

A look at Model Context Protocol and how it went from a passion project made by Anthropic employees to an industry standard shared through the Linux Foundation

via Techmeme 👤 Theverge 📅 2025-12-11

⚡ Score: 8.4

🔒 SECURITY

DeepSeek uses banned Nvidia chips for AI model, report says

via HackerNews 👤 goodway 📅 2025-12-10

🔺 252 pts ⚡ Score: 8.2

💬 HackerNews Buzz: 219 comments 😐 MID OR MIXED

🎯 US-China technology competition • Chip export restrictions • Circumventing export controls

💬 "It's staring everyone right in the face, but it's taboo to talk about" • "China has shown the willingness, ability and resolve to pursue decades-long infrastructure and national security projects"

🤖 AI MODELS

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

via HackerNews 👤 pretext 📅 2025-12-10

🔺 146 pts ⚡ Score: 8.2

💬 HackerNews Buzz: 73 comments 🐝 BUZZING

🎯 Open-weight Omni models • Real-time conversation support • Model performance and quality

💬 "There aren't many open-weights omni models so I consider this a big deal." • "I would use this model to replace the keyboard and monitor in an application while doing the heavy lifting with other tech behind the scenes."

🤖 AI MODELS

Gemini leaked its chain of thought and spiraled into thousands of bizarre affirmations (19k token output)

via r/ChatGPT 👤 u/No-Link-8274 📅 2025-12-11

⬆️ 3469 ups ⚡ Score: 8.2

"I was using Gemini to research the recent CDC guidelines. Halfway through, it broke and started dumping what was clearly its internal thought process and tool planning into the chat instead of a normal answer. At first, it was a standard chain of thought, then it started **explicitly strategizing h..."

💬 Reddit Discussion: 573 comments 😐 MID OR MIXED

🎯 Technological Apocalypse • Paranoid Schizophrenia • Self-Affirmation

💬 "It's such a terrible time to be a paranoid schizophrenic" • "It showed a train of thought where it was giving itself a pep talk"

🔬 RESEARCH

AI agents outperform cybersecurity professionals in penetration testing

2x SOURCES 🌐 📅 2025-12-10

⚡ Score: 8.1

+++ ARTEMIS, a multi-agent framework, outpaced 9 of 10 penetration testers in live enterprise testing, suggesting AI agents are finally useful at something besides generating marketing copy. +++

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

via Arxiv 👤 Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper et al. 📅 2025-12-10

⚡ Score: 8.1

"We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000..."

🤖 AI MODELS

GLM-4.6V: Open-Source Multimodal Models with Native Tool Use

via HackerNews 👤 gmays 📅 2025-12-11

🔺 2 pts ⚡ Score: 8.0

🤖 AI MODELS

Google DeepMind launches an enhanced Gemini Deep Research agent accessible to developers via its new Interactions API, along with a new DeepSearchQA benchmark

via Techmeme 👤 Blog 📅 2025-12-11

⚡ Score: 8.0

🏢 BUSINESS

Disney and OpenAI partnership for Sora

5x SOURCES 🌐 📅 2025-12-11

⚡ Score: 8.0

+++ Disney licenses 200+ characters to OpenAI's Sora for three years, securing a front-row seat to generative video while betting that IP moats still matter in the age of synthetic media. +++

The Walt Disney Company and OpenAI Partner on Sora

via HackerNews 👤 inesranzo 📅 2025-12-11

🔺 82 pts ⚡ Score: 7.7

💬 HackerNews Buzz: 355 comments 👍 LOWKEY SLAPS

🎯 AI monopolization • IP ownership control • Content monetization

💬 "Only other big corporations can break in - and they won't because it is easier to share the profits in the same market in a guaranteed manner." • "Content saturation works out very poorly for IP holders. The value of your brand reduces dramatically , and you reduce excitement for new releases."

🤖 AI MODELS

Mistral AI drops 3x as many LLMs in a single week as OpenAI did in 6 years

via r/LocalLLaMA 👤 u/Snail_Inference 📅 2025-12-10

⬆️ 778 ups ⚡ Score: 7.9

"Here are the GGUF links to Mistral AI’s "collected works" from the past week – all ready for local use: **Cutting-edge coding models:** \- 24B parameters: [https://huggingface.co/bartowski/mistralai\_Devstral-Small-2-24B-Instruct-2512-GGUF](https://huggingface.co/bartowski/mistralai_Devstral-Small..."

💬 Reddit Discussion: 103 comments 👍 LOWKEY SLAPS

🎯 Open-source LLMs • LLM performance • LLM alternatives

💬 "gpt-oss was (is?) quite good for its size" • "Devstral 2 123B seems to be a noted improvement"

🔬 RESEARCH

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

via Arxiv 👤 Jan Betley, Jorio Cocola, Dylan Feng et al. 📅 2025-12-10

⚡ Score: 7.9

"LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This..."

🛠️ TOOLS

New in llama.cpp: Live Model Switching

via r/LocalLLaMA 👤 u/paf1138 📅 2025-12-11

⬆️ 274 ups ⚡ Score: 7.8

"Hugging Face model, dataset, or community resource."

💬 Reddit Discussion: 51 comments 🐝 BUZZING

🎯 UX improvements • Workflow flexibility • Model management

💬 "This is a great feature for workflows if you have limited VRAM" • "being able to swap models without restarting the server makes testing so much smoother"

🛠️ TOOLS

Google releases fully managed, remote MCP servers to help developers connect AI agents to services such as Maps, BigQuery, Compute Engine, and Kubernetes Engine

via Techmeme 👤 Techcrunch 📅 2025-12-10

⚡ Score: 7.7

🔒 SECURITY

PSA: Attackers can hide instructions in images that hijack ChatGPT when you upload them

via r/ChatGPT 👤 u/kingkong_lol 📅 2025-12-10

⬆️ 235 ups ⚡ Score: 7.7

"PSA: Attackers can hide instructions in images that hijack ChatGPT when you upload them Not sure how many people know about this, but prompt injection via files is a real thing. Attackers can embed hidden instructions in image metadata, PDFs, or documents that execute when ChatGPT processes the f..."

💬 Reddit Discussion: 103 comments 😤 NEGATIVE ENERGY

🎯 AI Risks • Resume Tricks • HR Automation

💬 "If you're just using the web API for ChatGPT then yeah you're probably safe." • "I put white text on white background on my resume for this exact reason."

🤖 AI MODELS

AI beyond LLMs: a wearable foundation model based on JEPA

via HackerNews 👤 brandonb 📅 2025-12-10

🔺 7 pts ⚡ Score: 7.6

💬 HackerNews Buzz: 4 comments 🐐 GOATED ENERGY

🎯 Wearable data integration • Predictive healthcare models • Clinical usefulness

💬 "Would a wearable model like this gain in predictive power by adding FHIR/EHR inputs?" • "Being able to have wearable data be clinically useful would be game changing"

🛠️ TOOLS

FlashAttention implementation for non Nvidia GPUs. AMD, Intel Arc, Vulkan-capable devices

via r/LocalLLaMA 👤 u/secopsml 📅 2025-12-10

⬆️ 174 ups ⚡ Score: 7.4

""We built a flashattention library that is for non Nvidia GPUs that will solve the age old problem of not having CUDA backend for running ML models on AMD and intel ARC and Metal would love a star on the GitHub PRs as well and share it with your friends too. " repo: https://github.com/AuleTechnolog..."

💬 Reddit Discussion: 24 comments 🐝 BUZZING

🎯 GPU compute kernels • Hardware agnostic math • Performance comparison

💬 "The math is hardware agnostic so the implementation should be too" • "Whether the kernels are efficiently implemented is a whole different matter"

🛠️ TOOLS

Debug Mode

via r/cursor 👤 u/condor-cursor 📅 2025-12-10

⬆️ 168 ups ⚡ Score: 7.4

"We’re excited to introduce Debug Mode — an entirely new agent loop built around runtime information and human verification. https://preview.redd.it/fjwomoj9cf6g1.png?width=1380&format=png&auto=webp&s=5c0fe1dce94de16a6f8e91f3ae978d47766ae0e8 Instead of immediately generating a fix, the ..."

💬 Reddit Discussion: 36 comments 🐝 BUZZING

🎯 Debugging Techniques • Iterative Problem-Solving • Effective Logging

💬 "When fixing an issue DO NOT jump to conclusions or start making sweeping changes based on absolutely no information." • "Reproducing bugs is expensive. A faster approach is to continuously keep runtime snapshots during normal operation."

🛠️ TOOLS

We did years of research so you don’t have to guess your GGUF datatypes

via r/LocalLLaMA 👤 u/enrique-byteshape 📅 2025-12-10

⬆️ 216 ups ⚡ Score: 7.3

"Hey r/LocalLLaMA, We’ve been working on **ShapeLearn**, a method that *learns* optimal datatypes for aggressive quantization while preserving quality. Instead of hand-picking formats and hoping for the best, it uses gradient descent to choose per-tensor (or per-group) bitlengths automatically. We’..."

💬 Reddit Discussion: 63 comments 🐝 BUZZING

🎯 Benchmarking quant models • Importance of bug fixes • Expanding model benchmarks

💬 "4 bits is enough for anyone." - Bill Gates" • "Most models are fixed by us e.g. gpt-oss our fixes got pushed to the main repo"

🔒 SECURITY

The Normalization of Deviance in AI

via HackerNews 👤 vismit2000 📅 2025-12-11

🔺 2 pts ⚡ Score: 7.3

🤖 AI MODELS

Anthropic Opus 4.5

via r/cursor 👤 u/schnibitz 📅 2025-12-11

⬆️ 3 ups ⚡ Score: 7.1

"Okay, how did Anthropic do that? So what do we have here: a model that has a lower context than Sonnet 4.5, that seems to be just as good if not better than Sonnet 4.5 at dealing with large codebases. As others have noted, I'm seeing that context utilization tick way up in to the high 50%'s well p..."

🛠️ TOOLS

Mistral’s Vibe CLI now supports a 200K token context window (previously 100K)

via r/LocalLLaMA 👤 u/Dear-Success-1441 📅 2025-12-11

⬆️ 284 ups ⚡ Score: 7.1

"External link discussion - see full content at original source."

💬 Reddit Discussion: 33 comments 🐝 BUZZING

🎯 Configuration Changes • Hardware Requirements • Model Limitations

💬 "it was pretty much just a single line config change" • "Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM"

🔬 RESEARCH

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

via Arxiv 👤 Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki 📅 2025-12-09

⚡ Score: 7.0

"This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically..."

🔬 RESEARCH

Beyond Real Weights: Hypercomplex Representations for Stable Quantization

via Arxiv 👤 Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas et al. 📅 2025-12-09

⚡ Score: 7.0

"Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by..."

🛠️ SHOW HN

Show HN: I built a mitmproxy AI agent using 4000 paid security disclosures

via HackerNews 👤 mkagenius 📅 2025-12-11

🔺 1 pts ⚡ Score: 7.0

🛠️ SHOW HN

Show HN: Autofix Bot – Hybrid static analysis and AI code review agent

via HackerNews 👤 sanketsaurav 📅 2025-12-11

🔺 1 pts ⚡ Score: 6.9

🛠️ TOOLS

Official MCP support for Google services

via HackerNews 👤 jonbaer 📅 2025-12-11

🔺 2 pts ⚡ Score: 6.8

🔬 RESEARCH

[R] Reproduced "Scale-Agnostic KAG" paper, found the PR formula is inverted compared to its source

via r/MachineLearning 👤 u/m3m3o 📅 2025-12-11

⬆️ 17 ups ⚡ Score: 6.8

"I attempted to reproduce "Scale-Agnostic Kolmogorov-Arnold Geometry" (Vanherreweghe et al., arXiv:2511.21626v2). \*\*The problem:\*\* The paper claims \~30% lower PR with augmentation. After 6 code iterations and full paper conformance (h=256, Cosine scheduler, 10k samples), I consistently got +..."

🔬 RESEARCH

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

via Arxiv 👤 Jakub Krajewski, Amitis Shidani, Dan Busbridge et al. 📅 2025-12-09

⚡ Score: 6.7

"While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from th..."

🤖 AI MODELS

A new open AI coding model is closing in on proprietary options

via HackerNews 👤 Bender 📅 2025-12-10

🔺 1 pts ⚡ Score: 6.7

🔬 RESEARCH

FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

via Arxiv 👤 Khurram Khalil, Khaza Anuarul Hoque 📅 2025-12-10

⚡ Score: 6.7

"Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggl..."

🛠️ TOOLS

now ~40% faster ik_llama.cpp -sm graph on 2x CUDA GPUs

via r/LocalLLaMA 👤 u/VoidAlchemy 📅 2025-12-10

⬆️ 58 ups ⚡ Score: 6.6

"## tl;dr; The purple line at the top is running ik_llama.cpp with `-sm graph` achieving much faster prompt processing and token generation than the default methods fully offloading onto 2x CUDA GPUs. ## details Just ran some updated benchmarks between ik_llama.cpp and mainline llama.cpp forks with ..."

💬 Reddit Discussion: 11 comments 👍 LOWKEY SLAPS

🎯 Multi-GPU Optimization • Performance Improvements • Potential Portability

💬 "Tried on 2xRTX5060Ti and Unsloth q4 quant of Devstral and token generation went up from ~25tk/s to ~37tk/s." • "This implemention seems to be building the llama compute graphs to better use multi GPUs."

🔬 RESEARCH

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

via Arxiv 👤 Hongyuan Tao, Bencheng Liao, Shaoyu Chen et al. 📅 2025-12-09

⚡ Score: 6.6

"Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while l..."

🔬 RESEARCH

Astra: General Interactive World Model with Autoregressive Denoising

via Arxiv 👤 Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng et al. 📅 2025-12-09

⚡ Score: 6.6

"Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose s..."

🤖 AI MODELS

New era for fine-tuning is on the horizon

via r/LocalLLaMA 👤 u/uhuge 📅 2025-12-11

⬆️ 39 ups ⚡ Score: 6.6

"A paper released at https://arxiv.org/abs/2512.05117 , no code yet Authors claim you can take a bunch of fine-tuned models of the same architecture and create new task/domain specific variants by just setting a few dozens numbers on each of the internal layer. ..."

💬 Reddit Discussion: 8 comments 🐝 BUZZING

🎯 Hidden model structures • Efficient fine-tuning • Interpreting model behavior

💬 "Models end up in a similar place after you take into account permutations that are possible in that space" • "Modifying these structures to do efficient fine tuning is only one application of this"

🛠️ TOOLS

Google DeepMind plans to open its “first automated science laboratory” in the UK in 2026, focused on using AI tools to develop new materials for chips and more

via Techmeme 👤 Ft 📅 2025-12-11

⚡ Score: 6.6

🔬 RESEARCH

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

via Arxiv 👤 Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe et al. 📅 2025-12-09

⚡ Score: 6.5

"Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connec..."

🔬 RESEARCH

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise

via HackerNews 👤 kelseyfrog 📅 2025-12-10

🔺 126 pts ⚡ Score: 6.5

💬 HackerNews Buzz: 37 comments 🐐 GOATED ENERGY

🎯 Terrain generation • Diffusion models • Perlin noise limitations

💬 "This architecture is not as fast as Perlin noise" • "The novel part here is making the detailed tiles slightly nicer"

🔬 RESEARCH

An Ai2 research scientist says AGI may never emerge because such a concept ignores the physical realities and limits of computation, such as energy constraints

via Techmeme 👤 Timdettmers 📅 2025-12-11

⚡ Score: 6.5

🔬 RESEARCH

Provably Learning from Modern Language Models via Low Logit Rank

via Arxiv 👤 Noah Golowich, Allen Liu, Abhishek Shetty 📅 2025-12-10

⚡ Score: 6.5

"While modern language models and their inner workings are incredibly complex, recent work (Golowich, Liu & Shetty; 2025) has proposed a simple and potentially tractable abstraction for them through the observation that empirically, these language models all seem to have approximately low logit rank...."

🛠️ SHOW HN

Show HN: Metaskills: AI agents that autonomously create their own capabilities

via HackerNews 👤 ada1981 📅 2025-12-10

🔺 1 pts ⚡ Score: 6.3

🛠️ TOOLS

I Replaced LLM Tool Calling with Async REST APIs and a Cryptographic Handshake

via HackerNews 👤 yaruchyk 📅 2025-12-11

🔺 1 pts ⚡ Score: 6.2

🛠️ SHOW HN

Show HN: 8B Parallel Coordinated Reasoning Model

via HackerNews 👤 hzwer 📅 2025-12-11

🔺 1 pts ⚡ Score: 6.1

🔬 RESEARCH

Interpreto: An Explainability Library for Transformers

via Arxiv 👤 Antonin Poché, Thomas Mullor, Gabriele Sarti et al. 📅 2025-12-10

⚡ Score: 6.1

"Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aimi..."

🔬 RESEARCH

For agent systems, which metrics give you the clearest signal during evaluation

via r/artificial 👤 u/coolandy00 📅 2025-12-10

⬆️ 1 ups ⚡ Score: 6.1

"When evaluating an agent system that changes its behavior as tools and planning steps evolve, it can be hard to choose metrics that actually explain what went wrong. We tried several complex scoring schemes before realizing that a simple grouping works better. * Groundedness: Shows whether the ag..."

🔬 RESEARCH

Closing the Train-Test Gap in World Models for Gradient-Based Planning

via Arxiv 👤 Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal et al. 📅 2025-12-10

⚡ Score: 6.1

"World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively..."

Stories from December 11, 2025

OpenAI launches GPT-5.2

OpenAI warns of cybersecurity risks in frontier models

Anthropic donates Model Context Protocol to Linux Foundation

AI agents outperform cybersecurity professionals in penetration testing

Disney and OpenAI partnership for Sora

📡 AI NEWS BUT ACTUALLY GOOD