📚 HISTORICAL ARCHIVE - December 24, 2025

                What was happening in AI on 2025-12-24
            

← Dec 23 📊 TODAY'S NEWS 📚 ARCHIVE 🗓️ December 2025 Dec 25 →

                📰 DAILY AI BRIEF
            

On December 24, 2025, Metamesh tracked 30 AI stories and ranked them by signal rather than volume. The lead item was Google's Nano Banana Pro and OpenAI's ChatGPT Images can make nonconsensual bikini deepfakes from photos of fully.... Also high in the stack: 🎄 We release 67,074 Qwen3-Coder OpenHands trajectories on SWE-rebench + 2 model checkpoints! and New 1B parameter open-source coding model getting 76% on HumanEval [shameless but proud self-plug]. That combination is why this archive exists: it preserves the day's shape for AI practitioners, not just the last headline that crossed the wire.

The daily ticker's read: WELCOME TO METAMESH.BIZ +++ TICKER ERROR: CONTENT TOO SPICY FOR ANTHROPIC'S USAGE POLICY +++ HERE'S WHAT'S HAPPENING +++ 🎄 We release 67,074 Qwen3-Coder OpenHands trajectories on SWE-rebench + 2 model checkpoints! +++ New 1B parameter open-source coding.... Read against the ranked story list below, it gives the archive a point of view: what mattered, what was mostly noise, and which threads were worth saving for later comparison.

📊 You are visitor #47291 to this AWESOME site! 📊
Archive from: 2025-12-24 | Preserved for posterity ⚡

Stories from December 24, 2025

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔒 SECURITY

Google's Nano Banana Pro and OpenAI's ChatGPT Images can make nonconsensual bikini deepfakes from photos of fully clothed women; Reddit bans r/ChatGPTJailbreak

via Techmeme 👤 Wired 📅 2025-12-23

⚡ Score: 8.4

🔔 OPEN SOURCE

🎄 We release 67,074 Qwen3-Coder OpenHands trajectories on SWE-rebench + 2 model checkpoints!

via r/LocalLLaMA 👤 u/Fabulous_Pollution10 📅 2025-12-24

⬆️ 6 ups ⚡ Score: 8.3

"Happy holidays! 🎄 I’m Ibragim from Nebius. We’re releasing a big dataset for agentic coding research: 67,074 OpenHands trajectories (plus 2 RFT checkpoints), built from 3,800 resolved issues across 1,800+ Python repos. The trajectories are long: 64 turns on average, up to 100 turns, and up to 131..."

🤖 AI MODELS

New 1B parameter open-source coding model getting 76% on HumanEval [shameless but proud self-plug]

via r/LocalLLaMA 👤 u/More_Article9837 📅 2025-12-24

⬆️ 227 ups ⚡ Score: 7.9

"Hey folks, merry festive season to you all. Hope you are staying safe! Wanted to share a new open-source coding model release that might be interesting to yall here. My team proudly published it this morning..(we are a small start up out of Australia) It’s called Maincoder-1B... a 1B-paramete..."

💬 Reddit Discussion: 33 comments 🐝 BUZZING

🤖 AI MODELS

I built Plano(A3B): most efficient LLMs for agent orchestration that exceed frontier model perf

via r/LocalLLaMA 👤 u/AdditionalWeb107 📅 2025-12-24

⬆️ 111 ups ⚡ Score: 7.8

"Hi everyone — I’m on the Katanemo research team. Today we’re thrilled to launch **Plano-Orchestrator**, a new family of LLMs built for fast multi-agent orchestration. What do these new LLMs do? given a user request and the conversation context, Plano-Orchestrator decides which agent(s) should handl..."

💬 Reddit Discussion: 32 comments 🐝 BUZZING

⚡ BREAKTHROUGH

We replaced H.264 streaming with JPEG screenshots (and it worked better)

via HackerNews 👤 quesobob 📅 2025-12-23

🔺 416 pts ⚡ Score: 7.7

💬 HackerNews Buzz: 253 comments 👍 LOWKEY SLAPS

🎯 Video codec selection • Congestion control and latency • Adaptive streaming

💬 "Just turn off B-frames and you should be OK" • "WebRTC will do this for you if you can use it"

🛠️ SHOW HN

Show HN: Vibium – Browser automation for AI and humans, by Selenium's creator

via HackerNews 👤 hugs 📅 2025-12-24

🔺 128 pts ⚡ Score: 7.3

💬 HackerNews Buzz: 63 comments 🐝 BUZZING

🎯 Use cases & examples • Future plans & roadmap • Browser automation capabilities

💬 "Would you share some use cases and how you or your users use it personally?" • "What's the plan for incorporating new standards like Agent Skills as they quickly evolve and launch?"

🛠️ TOOLS

Built an MCP server so Claude Code can do HIPAA/SOC2 compliance for me

via r/claudeai 👤 u/eager_mehul 📅 2025-12-24

⬆️ 3 ups ⚡ Score: 7.1

" Old workflow with Drata/Vanta: Screenshot issue → paste in Claude → get fix → apply to AWS → go back to dashboard → mark done → repeat 50x Why am I copy-pasting between a dashboard and AI? So I built an MCP server. Now Claude Code does it all: Scan AWS → find issues → propose fix → I ap..."

💬 Reddit Discussion: 4 comments 😐 MID OR MIXED

🛠️ TOOLS

[P] RewardScope - reward hacking detection for RL training

via r/MachineLearning 👤 u/Famous-Initial7703 📅 2025-12-23

⬆️ 7 ups ⚡ Score: 7.1

"Reward hacking is a known problem but tooling for catching it is sparse. I built RewardScope to fill that gap. It wraps your environment and monitors reward components in real-time. Detects state cycling, component imbalance, reward spiking, and boundary exploitation. Everything streams to a live d..."

🔒 SECURITY

How to safely let LLMs query your databases via sandboxed materialized views

via HackerNews 👤 Hoshang07 📅 2025-12-24

🔺 1 pts ⚡ Score: 7.0

🔬 RESEARCH

Increasing the Thinking Budget is Not All You Need

via Arxiv 👤 Ignacio Iacobacci, Zhaozhi Qian, Faroq AL-Tam et al. 📅 2025-12-22

⚡ Score: 7.0

"Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget..."

🔬 RESEARCH

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

via Arxiv 👤 Yuqiao Tan, Minzheng Wang, Shizhu He et al. 📅 2025-12-22

⚡ Score: 6.9

"Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reaso..."

🔬 RESEARCH

GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

via Arxiv 👤 Jiacheng Guo, Ling Yang, Peter Chen et al. 📅 2025-12-22

⚡ Score: 6.8

"Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative e..."

🔬 RESEARCH

Bohrium + SciMaster: Building the Infrastructure and Ecosystem for Agentic Science at Scale

via Arxiv 👤 Linfeng Zhang, Siheng Chen, Yuzhu Cai et al. 📅 2025-12-23

⚡ Score: 6.8

"AI agents are emerging as a practical way to run multi-step scientific workflows that interleave reasoning with tool use and verification, pointing to a shift from isolated AI-assisted steps toward \emph{agentic science at scale}. This shift is increasingly feasible, as scientific tools and models c..."

🔬 RESEARCH

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

via Arxiv 👤 Quyu Kong, Xu Zhang, Zhenyu Yang et al. 📅 2025-12-22

⚡ Score: 6.8

"Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In..."

🔬 RESEARCH

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

via Arxiv 👤 Amirhosein Ghasemabadi, Di Niu 📅 2025-12-23

⚡ Score: 6.8

"Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with tr..."

🔬 RESEARCH

Step-DeepResearch Technical Report

via Arxiv 👤 Chen Hu, Haikuo Du, Heng Wang et al. 📅 2025-12-23

⚡ Score: 6.7

"As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-sour..."

🔬 RESEARCH

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

via Arxiv 👤 Kirill Djebko, Tom Baumann, Erik Dilger et al. 📅 2025-12-22

⚡ Score: 6.7

"Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive contro..."

🔬 RESEARCH

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

via Arxiv 👤 Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel et al. 📅 2025-12-23

⚡ Score: 6.7

"Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token c..."

🔬 RESEARCH

LongVideoAgent: Multi-Agent Reasoning with Long Videos

via Arxiv 👤 Runtao Liu, Ziyi Liu, Jiaqi Tang et al. 📅 2025-12-23

⚡ Score: 6.6

"Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We pro..."

🔬 RESEARCH

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

via Arxiv 👤 Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin et al. 📅 2025-12-22

⚡ Score: 6.6

"Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive..."

🔬 RESEARCH

Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent

via Arxiv 👤 Humza Nusrat, Luke Francisco, Bing Luo et al. 📅 2025-12-23

⚡ Score: 6.5

"Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metasta..."

🛠️ TOOLS

The Best MCP Servers That Actually Can Change How You Code

via r/claudeai 👤 u/Riggz23 📅 2025-12-23

⬆️ 281 ups ⚡ Score: 6.4

"I've been using Claude/Cursor and these MCP things for a while now. These are the ones you must have **Context 7** is like having a really smart friend who always knows the latest way to use any coding library. No more outdated examples that don't work. **Docker MCP** is genius because it keeps th..."

💬 Reddit Discussion: 75 comments 👍 LOWKEY SLAPS

🛠️ TOOLS

Built a gateway to use Claude alongside other LLMs with automatic failover and cost tracking (open source)

via r/claudeai 👤 u/dinkinflika0 📅 2025-12-24

⬆️ 23 ups ⚡ Score: 6.4

"If you're using Claude in production, you've probably hit rate limits, wanted to compare Claude vs GPT-4 for specific tasks, or needed fallback when Anthropic has downtime. **What we built:** Bifrost - an open source LLM gateway that lets you route between Claude (all models), OpenAI, Gemini, Bedr..."

🛠️ SHOW HN

Show HN: AudioGhost AI – Run Meta's Sam-Audio on Consumer GPUs (4GB-6GB VRAM)

via HackerNews 👤 0x0funky 📅 2025-12-23

🔺 3 pts ⚡ Score: 6.4

🛠️ SHOW HN

Show HN: ScanOS – normalizing visual inputs into persistent LLM memory

via HackerNews 👤 JohannesGlaser 📅 2025-12-23

🔺 2 pts ⚡ Score: 6.3

📊 DATA

AutoCodeBench: Tencent Hunyuan revolutionizes AI programming evaluation

via HackerNews 👤 stareatgoats 📅 2025-12-24

🔺 2 pts ⚡ Score: 6.3

🔒 SECURITY

QWED – Deterministic Verification for AI

via HackerNews 👤 handfuloflight 📅 2025-12-24

🔺 1 pts ⚡ Score: 6.2

🔬 RESEARCH

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

via Arxiv 👤 Yanhong Li, Songlin Yang, Shawn Tan et al. 📅 2025-12-23

⚡ Score: 6.2

"Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conv..."

🔬 RESEARCH

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

via Arxiv 👤 Rui Pan, Zhuofu Chen, Ravi Netravali 📅 2025-12-23

⚡ Score: 6.1

"Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressi..."

🔬 RESEARCH

Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

via Arxiv 👤 Junze Ye, Daniel Tawfik, Alex J. Goodell et al. 📅 2025-12-22

⚡ Score: 6.1

"Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-..."

Stories from December 24, 2025

📡 AI NEWS BUT ACTUALLY GOOD