🚀 WELCOME TO METAMESH.BIZ +++ Claude 3.5 Sonnet quietly taking the crown on real GitHub PR fixes while everyone's busy arguing about AGI timelines +++ Anthropic discovers you can backdoor any model with like 12 bad examples (size doesn't matter after all) +++ AMD securing 6-gigawatt GPU deals with OpenAI because Sam needs a trillion dollars and Jensen can't supply everyone +++ Microsoft casually drops homegrown image model MAI-1 because depending on OpenAI for everything is apparently passé +++ THE FUTURE RUNS ON POISONED WEIGHTS AND VENTURE DEBT +++ 🚀 •
🚀 WELCOME TO METAMESH.BIZ +++ Claude 3.5 Sonnet quietly taking the crown on real GitHub PR fixes while everyone's busy arguing about AGI timelines +++ Anthropic discovers you can backdoor any model with like 12 bad examples (size doesn't matter after all) +++ AMD securing 6-gigawatt GPU deals with OpenAI because Sam needs a trillion dollars and Jensen can't supply everyone +++ Microsoft casually drops homegrown image model MAI-1 because depending on OpenAI for everything is apparently passé +++ THE FUTURE RUNS ON POISONED WEIGHTS AND VENTURE DEBT +++ 🚀 •
"We're excited to share **Nanonets-OCR2**, a state-of-the-art suite of models designed for advanced image-to-markdown conversion and Visual Question Answering (VQA).
🔍 **Key Features:**
* **LaTeX Equation Recognition:** Automatically converts mathematical equations and formulas into properly format..."
💬 Reddit Discussion: 69 comments
🐝 BUZZING
🎯 Model comparison • Handwritten data performance • Benchmark evaluations
💬 "Can we have some comparison and benchmark between the two?"
• "Tested with my handwritten diary (that none other model could parse anything at all) - and all text was extracted!"
+++ DeepSeek and friends have apparently figured out how to train capable models without spending a billion dollars per run, topping open benchmarks. +++
"We ran code models on **last-month GitHub PR bug-fix tasks** (like SWE-bench, real repos, real tests). **Claude Sonnet 4.5** led with **pass@5 55.1%** and several unique solves (check **Insights** button) no other model cracked. ..."
💬 Reddit Discussion: 54 comments
👍 LOWKEY SLAPS
🎯 Model performance comparisons • Open-source language models • Multi-turn evaluation
💬 "GLM 4.6 is the current best open weights coder now"
• "Gemini-2.5-Pro has difficulty with multi-turn, long-context toll-calling agentic evaluations"
"DGX Spark systems deliver up to 1 petaflop of AI performance, accelerated by a NVIDIA GB10 Grace Blackwell Superchip, NVIDIA ConnectX^(®)\-7 200 Gb/s networking and NVIDIA NVLink™-C2C technology, providing 5x the bandwidth of fifth-generation PCIe with 128GB of CPU-GPU coherent memory.
The NVIDIA A..."
via Arxiv👤 Raoyuan Zhao, Yihong Liu, Hinrich Schütze et al.📅 2025-10-10
⚡ Score: 8.0
"Large reasoning models (LRMs) increasingly rely on step-by-step
Chain-of-Thought (CoT) reasoning to improve task performance, particularly in
high-resource languages such as English. While recent work has examined
final-answer accuracy in multilingual settings, the thinking traces themselves,
i.e.,..."
via Arxiv👤 Chengyu Wang, Paria Rashidinejad, DiJia Su et al.📅 2025-10-10
⚡ Score: 8.0
"Diffusion large language models (dLLMs) are emerging as an efficient
alternative to autoregressive models due to their ability to decode multiple
tokens in parallel. However, aligning dLLMs with human preferences or
task-specific rewards via reinforcement learning (RL) is challenging because
their i..."
"A new study from Anthropic shows that poisoning AI models is much easier than we thought.
The key finding: It only takes a **small, fixed number of malicious examples** to create a hidden backdoor in a model. This number **does not increase** as the model gets larger and is trained on more data.
I..."
📡 AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms • Unsubscribe anytime
"## TL;DR
Implemented Google's ReasoningBank paper on small models (1.7B params). Built a memory system that extracts reasoning strategies from successful solutions and retrieves them for similar problems. **Result: 1.7B model went from 40% → 48% accuracy on MATH Level 3-4 problems (+20% relative imp..."
💬 Reddit Discussion: 10 comments
🐝 BUZZING
🎯 Memory formation • Incremental learning • Model experimentation
💬 "harvest all the successful strategies"
• "failed strategies would also be harvested"
via Arxiv👤 Ruyi Xu, Guangxuan Xiao, Yukang Chen et al.📅 2025-10-10
⚡ Score: 7.6
"Vision-language models (VLMs) could power real-time assistants and autonomous
agents, but they face a critical challenge: understanding near-infinite video
streams without escalating latency and memory usage. Processing entire videos
with full attention leads to quadratic computational costs and poo..."
"# TL;DR — Best model by real-life file QA tasks (Tested on 16GB Macbook Air M2)
>**Disclosure:** ***I’m building*** ***this local file agent for RAG - Hyperlink.*** *The idea of this test is to really* ***understand how models perform*** *in* ***privacy-concerned real-life tasks***\*, instead of..."
via Arxiv👤 Kaijian Zou, Aaron Xiong, Yunxiang Zhang et al.📅 2025-10-10
⚡ Score: 7.1
"Competitive programming problems increasingly serve as valuable benchmarks to
evaluate the coding capabilities of large language models (LLMs) due to their
complexity and ease of verification. Yet, current coding benchmarks face
limitations such as lack of exceptionally challenging problems, insuffi..."
via Arxiv👤 Xiao Yu, Baolin Peng, Michel Galley et al.📅 2025-10-10
⚡ Score: 7.0
"Reasoning models have recently shown remarkable progress in domains such as
math and coding. However, their expert-level abilities in math and coding
contrast sharply with their performance in long-horizon, interactive tasks such
as web navigation and computer/phone-use. Inspired by literature on hu..."
"**Companies & Business**
- OpenAI signed a multi-year deal with Broadcom to produce up to 10 GW of custom AI accelerators, projected to cut data-center costs by 30-40% and reduce reliance on Nvidia.
- Brookfield and Bloom Energy announced a strategic partnership worth up to $5 billion to pro..."
via Arxiv👤 Albert Belenguer-Llorens, Carlos Sevilla-Salcedo, Janaina Mourao-Miranda et al.📅 2025-10-10
⚡ Score: 7.0
"Real-world clinical problems are often characterized by multimodal data,
usually associated with incomplete views and limited sample sizes in their
cohorts, posing significant limitations for machine learning algorithms. In
this work, we propose a Bayesian approach designed to efficiently handle the..."
"External link discussion - see full content at original source."
💬 Reddit Discussion: 6 comments
👍 LOWKEY SLAPS
🎯 Intellectual property rights • Legality of data scraping • Whistleblowers and data leaks
💬 "Non-disclosure agreements aren't valid against illegal activities"
• "Data scraping is perfectly legal as long as you're not circumventing TOS restrictions"
via Arxiv👤 Donghang Wu, Haoyang Zhang, Jun Chen et al.📅 2025-10-10
⚡ Score: 7.0
"Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought
(CoT) reasoning due to the prohibitive latency of generating the entire thought
process sequentially. Enabling SLMs to think while speaking, similar to humans,
is attracting increasing attention. We present, for the first..."
via Arxiv👤 Gavriel Di Nepi, Federico Siciliano, Fabrizio Silvestri📅 2025-10-10
⚡ Score: 6.8
"By the end of 2024, Google researchers introduced Titans: Learning at Test
Time, a neural memory model achieving strong empirical results across multiple
tasks. However, the lack of publicly available code and ambiguities in the
original description hinder reproducibility. In this work, we present a..."
via Arxiv👤 Zhenhailong Wang, Jiateng Liu, Amin Fazel et al.📅 2025-10-10
⚡ Score: 6.8
"Modern conversational agents like ChatGPT and Alexa+ rely on predefined
policies specifying metadata, response styles, and tool-usage rules. As these
LLM-based systems expand to support diverse business and user queries, such
policies, often implemented as in-context prompts, are becoming increasing..."
via Arxiv👤 Qiguang Chen, Hanjing Li, Libo Qin et al.📅 2025-10-10
⚡ Score: 6.8
"Recently, Diffusion Large Language Models (DLLMs) have offered high
throughput and effective sequential reasoning, making them a competitive
alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which
enables simultaneous token updates, conflicts with the causal order often
require..."
🎯 Memory bandwidth • AI hardware performance • Local AI development
💬 "It isn't that good for local LLM inferencing. It's not designed to be as such."
• "Nvidia always short changes its own products and stunts them in some way."
via Arxiv👤 Feifan Song, Shaohang Wei, Bofei Gao et al.📅 2025-10-10
⚡ Score: 6.5
"Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier
Reward (RLVR) have shown great power in problem solving, yet they often cause
overthinking: excessive, meandering reasoning that inflates computational cost.
Prior designs of penalization in RLVR manage to reduce token con..."
"I wrote a blog article to better help myself understand how OpenAI's Apps SDK work under the hood. Hope folks also find it helpful!
Under the hood, Apps SDK is built on top of the Model Context Protocol (MCP). MCP provides a way for LLMs to connect to external tools and resources.
There are two ma..."
🎯 AI's impact on programming • Satisfaction in programming • Proper use of AI tools
💬 "The entire premise of AI coding tools is to automate the thinking, not just the typing."
• "Keep writing useless programs by hand. Implement a hash table in C or assembly if you want. Write a parser for a data format you use. Make a Doom clone. Keep learning and having fun."
via Arxiv👤 Sondos Mahmoud Bsharat, Zhiqiang Shen📅 2025-10-10
⚡ Score: 6.1
"Large language models (LLMs) have demonstrated impressive reasoning
capabilities when provided with chain-of-thought exemplars, but curating large
reasoning datasets remains laborious and resource-intensive. In this work, we
introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective
infer..."