π WELCOME TO METAMESH.BIZ +++ China mandating 50% domestic chip equipment while carefully not writing it down anywhere official (plausible deniability as industrial policy) +++ Meta training AI lab assistants by having them grade each other's homework using rubrics extracted from actual papers (peer review automation speedrun any%) +++ PhD student visualizing LLM hidden states as electromagnetic field trajectories because apparently we needed one more way to not understand what these things are doing +++ SOMEONE BENCHMARKED 26 SPEECH MODELS ON MEDICAL DIALOGUE AND THE WINNERS ARE EXACTLY WHO YOU'D EXPECT +++ π β’
π WELCOME TO METAMESH.BIZ +++ China mandating 50% domestic chip equipment while carefully not writing it down anywhere official (plausible deniability as industrial policy) +++ Meta training AI lab assistants by having them grade each other's homework using rubrics extracted from actual papers (peer review automation speedrun any%) +++ PhD student visualizing LLM hidden states as electromagnetic field trajectories because apparently we needed one more way to not understand what these things are doing +++ SOMEONE BENCHMARKED 26 SPEECH MODELS ON MEDICAL DIALOGUE AND THE WINNERS ARE EXACTLY WHO YOU'D EXPECT +++ π β’
via Arxivπ€ Hannah Atmer, Yuan Yao, Thiemo Voigt et al.π 2025-12-26
β‘ Score: 8.1
"Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill..."
"Hello everyone! Iβm building a fully local AI-Scribe for clinicians and just pushed an end-of-year refresh of our medical dialogue STT benchmark.
I ranΒ **26 open + closed source STT models**Β onΒ **PriMock57**Β (55 files, 81,236 words) and ranked them byΒ **average WER**. I also loggedΒ **avg seconds..."
π¬ Reddit Discussion: 9 comments
π BUZZING
π― Medical speech-to-text evaluation β’ Model performance comparison β’ Licensing and commercial use
π¬ "how do you or your clients usually process the transcripts further"
β’ "to me, these WERs still seem kind of 'high"
π― Critique of Linguistic Patterns β’ AI "Discovering Physics" β’ Academic Writing Styles
π¬ "It breaks my immersion in any text/video now."
β’ "What this paper is actually showing isn't that AI is 'discovering physics"
π¬ RESEARCH
Training AI Co-Scientists using Rubric Rewards
2x SOURCES ππ 2025-12-29
β‘ Score: 7.2
+++ Researchers figured out how to train AI assistants on real scientific constraints by extracting rubrics from papers, suggesting language models might finally do something useful in wet labs. +++
"Research released today by Meta: A general, scalable recipe to train AI to assist scientists in achieving their open-ended research goals:
1. Extract research goals and goal-specific grading rubrics from the large corpus of existing scientific papers with an LLM, and use them for RL training.
2. ..."
via Arxivπ€ Shashwat Goel, Rishi Hazra, Dulhan Jayalath et al.π 2025-12-29
β‘ Score: 6.6
"AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be imp..."
π― Type checking β’ Test coverage β’ Automated tooling
π¬ "Entire categories of illegal states and transitions can be eliminated."
β’ "Either you're writing code to solve a defined problem (valuable) or you're doing something else that may mimic that to some degree but is not accurate (bugs)."
π¬ RESEARCH
End-to-End Test-Time Training for Long Context
2x SOURCES ππ 2025-12-29
β‘ Score: 7.1
+++ Researchers reframe long-context modeling as a continual learning problem, letting standard Transformers compress context into weights at inference time instead of chasing yet another architectural glow-up. +++
"https://test-time-training.github.io/e2e.pdf
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture β a Transformer with sliding-windo..."
via Arxivπ€ Arnuv Tandon, Karan Dalal, Xinhao Li et al.π 2025-12-29
β‘ Score: 6.9
"We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on..."
"Iβm building a local interpretability tool that lets me visualize hidden-state activity and **intervene on individual hidden dimensions during inference** (via forward hooks). While scanning attn\_out, I identified a persistent hidden dimension (dim 3039) that appeared repeatedly across prompts. I'l..."
π¬ "It functions as a global commitment / epistemic certainty gain"
β’ "Ablation of the dim did nothing, so I'm looking at ways to trace distributed mechanisms now"
"Hi everyone,
I'm a PhD student in **Electromagnetics**. In my daily work, I deal with fields, waves, and trajectories. When I started playing with Local LLMs, I felt something was missing: we usually look at the *output* text or the *loss curves*, but we rarely see **how** the model gets from A to ..."
π¬ Reddit Discussion: 14 comments
π BUZZING
π― LLM Interpretability β’ Geometric Reasoning Control β’ Multi-Model Systems
π¬ "This is the kind of tool that could actually change how people debug and tune models"
β’ "Closing the loop from geometry β intervention is exactly the direction I'm interested in exploring"
via Arxivπ€ Yuwen Li, Wei Zhang, Zelong Huang et al.π 2025-12-29
β‘ Score: 6.8
"Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilin..."
via Arxivπ€ Qi Fan, An Zou, Yehan Maπ 2025-12-26
β‘ Score: 6.8
"Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. H..."
via Arxivπ€ Jichen Feng, Yifan Zhang, Chenggong Zhang et al.π 2025-12-29
β‘ Score: 6.8
"Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the e..."
via Arxivπ€ Sahil Kale, Antonio Luca Alfeoπ 2025-12-29
β‘ Score: 6.7
"Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucina..."
via Arxivπ€ Iris Xu, Guangtao Zeng, Zexue He et al.π 2025-12-29
β‘ Score: 6.7
"Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting..."
via Arxivπ€ Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai. -Dossπ 2025-12-29
β‘ Score: 6.7
"Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML a..."
"We anticipate getting a lot of push back from the community on this, and that's why we've uploaded the repo and have open sourced everything - we want people to verify these results. We are very excited!!
We (Bitterbot AI) have just dropped the repo for **TOPAS-DSPL**. Itβs a tiny recursive model ..."
via Arxivπ€ Sachin Pawar, Manoj Apte, Kshitij Jadhav et al.π 2025-12-26
β‘ Score: 6.6
"Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words..."
via Arxivπ€ Baixuan Li, Jialong Wu, Wenbiao Yin et al.π 2025-12-29
β‘ Score: 6.6
"Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While fu..."
via Arxivπ€ Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moatyπ 2025-12-29
β‘ Score: 6.6
"Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the..."
π SECURITY
MCP Guard for Database Access
2x SOURCES ππ 2025-12-30
β‘ Score: 6.5
+++ Developer builds safety layer for AI agents accessing databases, because apparently letting language models run raw queries against production felt like a bad idea worth solving for everyone. +++
via Arxivπ€ Shengyi Hua, Jianfeng Wu, Tianle Shen et al.π 2025-12-29
β‘ Score: 6.5
"Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence..."
via Arxivπ€ Sky CH-Wang, Justin Svegliato, Helen Appel et al.π 2025-12-29
β‘ Score: 6.5
"We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them...."
"I looked into how llama.cpp optimizes top-k sampling, and the trick is surprisingly simple.
Top-k on Llama 3's 128K vocabulary means finding k highest scores out of 128,256 candidates. std::partial\_sort does this at O(n log k), but llama.cpp noticed that token logits cluster in a narrow range (-10..."
π¬ Reddit Discussion: 13 comments
π GOATED ENERGY
π― LLM Optimization β’ Token Sampling β’ Model Performance
π¬ "I love how llama.cpp keeps optimizing the shit out of LLMs!"
β’ "It's used for token generation - sampling top-k tokens from vocabulary for inference."
"Hugging face: https://huggingface.co/collections/tencent/hy-mt15
Highlights:
πΉ 1.8B On-Device Power: Optimized for consumer hardware with a 1GB memory footprint. Using on-policy distillation to align with larger models, it delivers 0.18s latency..."
π¬ Reddit Discussion: 6 comments
π GOATED ENERGY
π― Model performance β’ Model comparisons β’ User enthusiasm
π¬ "Unbelievable results for Hindi"
β’ "This is the cool stuff AI can do"
"Hey r/LocalLLaMA ! If you're passionate about squeezing every last bit of performance out of older hardware for local large language models, I've got something exciting to share. I managed to get GLM-4.7 β that's the massive 355B parameter Mixture of Experts model β running in Q8\_0 quantization on ..."
via Arxivπ€ Jing Huang, Shujian Zhang, Lun Wang et al.π 2025-12-29
β‘ Score: 6.1
"Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in sin..."
"We are excited to open-source Tencent HY-Motion 1.0, a billion-parameter text-to-motion model built on the Diffusion Transformer (DiT) architecture and flow matching. Tencent HY-Motion 1.0 empowers developers and individual creators alike by transforming natural language into high-fidelity, fluid, a..."