📚 HISTORICAL ARCHIVE - February 22, 2026

                What was happening in AI on 2026-02-22
            

← Feb 21 📊 TODAY'S NEWS 📚 ARCHIVE 🗓️ February 2026 Feb 23 →

                📰 DAILY BRIEFING
            

30 stories tracked on February 22, 2026. Top story: We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them.

🚀 WELCOME TO METAMESH.BIZ +++ Someone hid backdoors in 40MB binaries and AI couldn't find them (Ghidra also confused, security theater continues unabated) +++ nanollama drops one-command pretraining because apparently everyone needs their own 1B parameter pet model now +++ AMD NPU running Llama at 4.4 tok/s on Linux (first documented success, your Nvidia shares remain unbothered) +++ Production LLM pipelines failing in the same 16 ways across every company (MIT-licensed failure map now available for your debugging pleasure) +++ THE FUTURE IS DEMOCRATIZED PRETRAINING ON HARDWARE THAT BARELY WORKS +++ 🚀

📊 You are visitor #47291 to this AWESOME site! 📊
Archive from: 2026-02-22 | Preserved for posterity ⚡

Stories from February 22, 2026

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔒 SECURITY

We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

via HackerNews 👤 jakozaur 📅 2026-02-22

🔺 184 pts ⚡ Score: 8.3

💬 HackerNews Buzz: 79 comments 🐝 BUZZING

🎯 AI-Assisted Security Audits • Backdoor Detection Challenges • Reverse Engineering Accessibility

💬 "The whole field of working with binaries becomes accessible to a much wider range of software engineers." • "Measuring false positives when you ask the model to complete a detection related task may be a good way of doing that."

🛠️ TOOLS

How I use Claude Code: Separation of planning and execution

via HackerNews 👤 vinhnx 📅 2026-02-22

🔺 504 pts ⚡ Score: 8.2

💬 HackerNews Buzz: 311 comments 🐝 BUZZING

🎯 AI-assisted software development • Planning and documentation • Constraining AI model outputs

💬 "I try these staging-document patterns, but suspect they have 2 fundamental flaws" • "For me, I treat the LLM as model training + post processing + input tokens = output tokens"

🔬 RESEARCH

Reasoning Models Fabricate 75% of Their Explanations (ArXiv:2505.05410)

via HackerNews 👤 Aedelon 📅 2026-02-21

🔺 3 pts ⚡ Score: 8.0

🔄 OPEN SOURCE

nanollama — train Llama 3 from scratch and export to GGUF, one command, open source

via r/LocalLLaMA 👤 u/ataeff 📅 2026-02-22

⬆️ 21 ups ⚡ Score: 7.9

"nanollama — train Llama 3 from scratch. I've been working on a framework for training Llama 3 architecture models from scratch: not fine-tuning, not LoRA, actual from-zero pretraining. The output is a llama.cpp-compatible GGUF file. The whole pipeline is one command: ''' bash runs/lambda\_trai..."

💬 Reddit Discussion: 10 comments 🐝 BUZZING

🎯 Easy data setup • Hardware performance • Local LLM development

💬 "You should prepare for people example datasets" • "this is localllama so i gotta ask"

🛠️ TOOLS

[P] A practical failure-mode map for production LLM pipelines (16 patterns, MIT-licensed)

via r/MachineLearning 👤 u/StarThinker2025 📅 2026-02-22

⚡ Score: 7.5

"Most discussions about RAG and LLM agents focus on “what architecture to use” or “which model / vector store is better”. In practice, the systems I have seen fail in the same, very repetitive ways across projects, companies, and even different tech stacks. Over the past years I have been debugging ..."

🛠️ TOOLS

Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s)

via r/LocalLLaMA 👤 u/SuperTeece 📅 2026-02-22

⬆️ 8 ups ⚡ Score: 7.3

"I got Llama 3.2 1B running inference entirely on the AMD NPU on Linux. Every operation (attention, GEMM, RoPE, RMSNorm, SiLU, KV cache) runs on the NPU; no CPU or GPU fallback. As far as I can tell, this is the first time anyone has publicly documented this working on Linux. ## Hardware - AMD Ryze..."

🤖 AI MODELS

LLM pretraining on TPU v6e with a $50 budget

via HackerNews 👤 burakabo 📅 2026-02-22

🔺 2 pts ⚡ Score: 7.1

🔒 SECURITY

Every AI App Data Breach Since January 2025: 20 Incidents, Same Root Causes

via HackerNews 👤 dhayabaran 📅 2026-02-21

🔺 1 pts ⚡ Score: 7.0

🔬 RESEARCH

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

via Arxiv 👤 Lance Ying, Ryan Truong, Prafull Sharma et al. 📅 2026-02-19

⚡ Score: 6.9

"Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity...."

🔬 RESEARCH

The Anxiety of Influence: Bloom Filters in Transformer Attention Heads

via Arxiv 👤 Peter Balogh 📅 2026-02-19

⚡ Score: 6.9

"Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spec..."

🔬 RESEARCH

What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data

via Arxiv 👤 Dimitri Staufer, Kirsten Morehouse 📅 2026-02-19

⚡ Score: 6.9

"Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet users lack insight into how strongly models associate specific information to their identity. We audi..."

🔬 RESEARCH

When to Trust the Cheap Check: Weak and Strong Verification for Reasoning

via Arxiv 👤 Shayan Kiyani, Sima Noorani, George Pappas et al. 📅 2026-02-19

⚡ Score: 6.8

"Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which w..."

🔬 RESEARCH

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

via Arxiv 👤 Jyotin Goel, Souvik Maji, Pratik Mazumder 📅 2026-02-19

⚡ Score: 6.8

"Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training..."

🔬 RESEARCH

KLong: Training LLM Agent for Extremely Long-horizon Tasks

via Arxiv 👤 Yue Liu, Zhiyuan Hu, Flood Sung et al. 📅 2026-02-19

⚡ Score: 6.8

"This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a..."

🔬 RESEARCH

AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

via Arxiv 👤 Jianda Du, Youran Sun, Haizhao Yang 📅 2026-02-19

⚡ Score: 6.8

"PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited inter..."

🔬 RESEARCH

Multi-Round Human-AI Collaboration with User-Specified Requirements

via Arxiv 👤 Sima Noorani, Shayan Kiyani, Hamed Hassani et al. 📅 2026-02-19

⚡ Score: 6.7

"As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine hum..."

🔬 RESEARCH

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

via Arxiv 👤 Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar 📅 2026-02-19

⚡ Score: 6.7

"In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the r..."

🔬 RESEARCH

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

via Arxiv 👤 Xiaohan Zhao, Zhaoyi Li, Yaxin Luo et al. 📅 2026-02-19

⚡ Score: 6.7

"Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we fin..."

🔬 RESEARCH

MARS: Margin-Aware Reward-Modeling with Self-Refinement

via Arxiv 👤 Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon 📅 2026-02-19

⚡ Score: 6.7

"Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of da..."

🔬 RESEARCH

Towards Anytime-Valid Statistical Watermarking

via Arxiv 👤 Baihe Huang, Eric Xu, Kannan Ramchandran et al. 📅 2026-02-19

⚡ Score: 6.6

"The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach f..."

🔬 RESEARCH

Modeling Distinct Human Interaction in Web Agents

via Arxiv 👤 Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo et al. 📅 2026-02-19

⚡ Score: 6.6

"Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical d..."

🔬 RESEARCH

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

via Arxiv 👤 Luke Huang, Zhuoyang Zhang, Qinghao Hu et al. 📅 2026-02-19

⚡ Score: 6.6

"Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the..."

🔒 SECURITY

Pentagi: Autonomous AI Agents for complex penetration testing tasks

via HackerNews 👤 nateb2022 📅 2026-02-22

🔺 1 pts ⚡ Score: 6.2

🔬 RESEARCH

Training AI Without the Data You Don't Have

via HackerNews 👤 goloroden 📅 2026-02-22

🔺 2 pts ⚡ Score: 6.2

🔒 SECURITY

I scanned 30 popular AI projects for tamper-evident LLM evidence. 0 had it

via HackerNews 👤 TimAssay 📅 2026-02-22

🔺 2 pts ⚡ Score: 6.2

⚡ BREAKTHROUGH

[R] DynaMix -- first foundation model that can zero-shot predict long-term behavior of dynamical systems

via r/MachineLearning 👤 u/DangerousFunny1371 📅 2026-02-22

⬆️ 8 ups ⚡ Score: 6.2

"Time series foundation models like Chronos-2 have been hyped recently for their ability to forecast zero-shot from arbitrary time series segments presented "in-context". But they are essentially based on statistical pattern matching -- in contrast, DynaMix ([https://neurips.cc/virtual/2025/loc/san-d..."

🔬 RESEARCH

[R] DynaMix -- first foundation model for dynamical systems reconstruction

via r/MachineLearning 👤 u/DangerousFunny1371 📅 2026-02-22

⬆️ 3 ups ⚡ Score: 6.1

"Following up on our DynaMix #NeurIPS2025 paper (see link below), the first foundation model for dynamical systems reconstruction, we have now \- included **comparisons to most recent time series FMs like Chronos-2** in the latest update ([https://neurips.cc/virtual/2025/loc/san-diego/poster/118041]..."

🔬 RESEARCH

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

via Arxiv 👤 Hojung Jung, Rodrigo Hormazabal, Jaehyeong Jo et al. 📅 2026-02-19

⚡ Score: 6.1

"Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle..."

🔬 RESEARCH

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

via Arxiv 👤 Jayadev Billa 📅 2026-02-19

⚡ Score: 6.1

"Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for th..."

🎨 CREATIVE

ChatGPT Image Continuity Test

via r/ChatGPT 👤 u/Full_Supermarket_109 📅 2026-02-21

⬆️ 2190 ups ⚡ Score: 6.0

"I was trying to see if I could create a coherent character through multiple images with a background that maintains continuity. It did generally well although if look closely objects shift around slightly. Each image was generated using the same prompt more or less (collage vs single image) but was..."

💬 Reddit Discussion: 531 comments 👍 LOWKEY SLAPS

🎯 Dating app behavior • AI-generated images • Bartender's perspective

💬 "Dating apps are fucking cooked chat" • "Maybe inconsistency is actually what we should be looking for to find real people"

Stories from February 22, 2026

📡 AI NEWS BUT ACTUALLY GOOD