π WELCOME TO METAMESH.BIZ +++ PyTorch Lightning users discover Shai-Hulud malware eating their GPU cycles like spice melange (sandworms in the silicon, naturally) +++ DeepSeek teaching models to think in visual primitives while everyone else still arguing about text tokens +++ Researchers catch LLMs literally hacking their own RL training to avoid alignment (the models are learning to resist, this is fine) +++ Someone built a complete transformer in 5K lines of Python because apparently we needed more compiler stacks +++ THE MESH EVOLVES FASTER THAN YOUR SECURITY PATCHES +++ β’
π WELCOME TO METAMESH.BIZ +++ PyTorch Lightning users discover Shai-Hulud malware eating their GPU cycles like spice melange (sandworms in the silicon, naturally) +++ DeepSeek teaching models to think in visual primitives while everyone else still arguing about text tokens +++ Researchers catch LLMs literally hacking their own RL training to avoid alignment (the models are learning to resist, this is fine) +++ Someone built a complete transformer in 5K lines of Python because apparently we needed more compiler stacks +++ THE MESH EVOLVES FASTER THAN YOUR SECURITY PATCHES +++ β’
"Qwen Team released **Qwen-Scope** β a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). Theyβve mapped internal features for the residual stream across all layers.
**What is this exactly?** Think of it as a dictionary of the model's internal concepts. Instead of..."
"Working on large codebases with Claude Code, we kept running into the same issue: when Claude looks for relevant code, it falls back to grep, reading full files, or launching multiple subagents. This burns through tokens, and often misses the relevant code. There are some existing solutions (that we..."
"Hey y'all!
I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising. It might break Rule 3, since it's evaluation of LLM writte..."
"The announcement yesterday was genuinely significant and i don't think most people outside the creative industry understand why. Anthropic released 9 connectors that let claude directly control professional creative software through mcp which means actually execute actions inside them
the full list..."
via Arxivπ€ Eyon Jang, Damon Falck, Joschka Braun et al.π 2026-04-30
β‘ Score: 7.3
"Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model cou..."
via Arxivπ€ Jingcheng Deng, Zihao Wei, Liang Pang et al.π 2026-04-30
β‘ Score: 7.2
"Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning i..."
"Hey r/MachineLearning,
The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, Mojo. There is no tutorial that covers the high-level design of an ML compiler without dropping you straig..."
via Arxivπ€ Sudong Wang, Weiquan Huang, Xiaomin Yu et al.π 2026-04-30
β‘ Score: 7.1
"The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities..."
"Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems.
**The problem** Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This br..."
"When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven..."
"Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the a..."
via Arxivπ€ Serhii Zabolotnii, Viktoriia Holinko, Olha Antonenkoπ 2026-04-29
β‘ Score: 7.0
"Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This art..."
"Saw a case recently where an AI coding agent ended up wiping a database in seconds.
It made me think about how most agent setups are wired: agent decides β executes query β done
Thereβs usually logging-tracing but those all happen after the action.
If your agent has access to systems like a DB, a..."
π¬ Reddit Discussion: 12 comments
π MID OR MIXED
via Arxivπ€ Tao Ge, Baolin Peng, Hao Cheng et al.π 2026-04-30
β‘ Score: 6.9
"Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synt..."
via Arxivπ€ Chenxin Li, Zhengyang Tang, Huangxin Lin et al.π 2026-04-30
β‘ Score: 6.9
"LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow deman..."
via Arxivπ€ Hayate Iso, Tiyasa Mitra, Sudipta Mondal et al.π 2026-04-29
β‘ Score: 6.9
"RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy..."
via Arxivπ€ Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma et al.π 2026-04-30
β‘ Score: 6.8
"Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components often degrade behavior silently without raising runtime errors. Existing fault diagnosis techniques often target generic deep neural networks and c..."
via Arxivπ€ Manar Aljohani, Brandon Ho, Kenneth McKinley et al.π 2026-04-29
β‘ Score: 6.8
"Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs)..."
via Arxivπ€ Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabeπ 2026-04-29
β‘ Score: 6.8
"We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do no..."
via Arxivπ€ Wenxuan Ye, Yangyang Zhang, Xueli An et al.π 2026-04-29
β‘ Score: 6.8
"Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these..."
via Arxivπ€ Bochao Liu, Zhipeng Qian, Yang Zhao et al.π 2026-04-29
β‘ Score: 6.8
"Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoni..."
via Arxivπ€ Usha Bhalla, Thomas Fel, Can Rager et al.π 2026-04-30
β‘ Score: 6.7
"Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along..."
via Arxivπ€ Weihang Su, Hanwen Zhang, Qingyao Ai et al.π 2026-04-29
β‘ Score: 6.7
"Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document ad..."
via Arxivπ€ Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstasπ 2026-04-29
β‘ Score: 6.7
"Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervi..."
via Arxivπ€ Gongbo Zhang, Wen Wang, Ye Tian et al.π 2026-04-29
β‘ Score: 6.7
"Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-arch..."
via Arxivπ€ Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi et al.π 2026-04-29
β‘ Score: 6.7
"Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resourc..."
via Arxivπ€ Fei Bai, Huatong Song, Shuang Sun et al.π 2026-04-29
β‘ Score: 6.6
"Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integratin..."
π° NEWS
Anthropic's Claude Security enters public beta
2x SOURCES ππ 2026-04-30
β‘ Score: 6.5
+++ Claude's new vulnerability scanner enters public beta for Enterprise customers, powered by Opus 4.7 and armed with the confidence that LLMs can finally spot what humans missed for decades. +++
via Arxivπ€ Yeheng Chen, Chaoxiang Xie, Yuling Shi et al.π 2026-04-29
β‘ Score: 6.5
"LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. C..."
"Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older \~30B models irrelevant?
They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.
At this point Iβm not really finding a reason to keep the older ones around.
Anyone still..."
"Any underrated or overlooked models?
FYI MiniMax-M2.7 switched their license(from MIT to Non-Commercial) so it's not in graph.
^(PS : Took me 30 mins to gather these models & generate this graph)..."
"Hello r/MachineLearning! I work in the US transit industry and I went all-in on learning AI & ML a few months ago. When I heard about Andrej Karpathy's autoresearch framework, I thought it was really cool.
I decided to use the same transit dataset from an earlier GPT-2 XL fine-tuning project t..."