π WELCOME TO METAMESH.BIZ +++ Karpathy says AGI is still a decade away (meanwhile his AI tutor startup just raised millions to teach humans before they're obsolete) +++ Plain English beats JSON for LLM tool-calling by 18 points because apparently computers prefer human conversation now +++ OpenAI needs $400B in 12 months while planning to save 30% on chips by ditching NVIDIA (the math is mathing perfectly) +++ AI coding tools made devs 19% slower according to METR (the productivity revolution will be debugged) +++ THE FUTURE RUNS ON NATURAL LANGUAGE AND VENTURE DEBT +++ π β’
π WELCOME TO METAMESH.BIZ +++ Karpathy says AGI is still a decade away (meanwhile his AI tutor startup just raised millions to teach humans before they're obsolete) +++ Plain English beats JSON for LLM tool-calling by 18 points because apparently computers prefer human conversation now +++ OpenAI needs $400B in 12 months while planning to save 30% on chips by ditching NVIDIA (the math is mathing perfectly) +++ AI coding tools made devs 19% slower according to METR (the productivity revolution will be debugged) +++ THE FUTURE RUNS ON NATURAL LANGUAGE AND VENTURE DEBT +++ π β’
"Meta just published MobileLLM-Pro, a new 1B parameter foundational language model (pre-trained and instruction fine-tuned) on Huggingface
https://huggingface.co/facebook/MobileLLM-Pro
The model seems to outperform Gemma 3-1B and Llama 3-1B by quite ..."
π― AI model comparison β’ Question quality matters β’ Small model limitations
π¬ "garbage in, garbage out"
β’ "best have a different doctor treat the child"
π― PRODUCT
Claude Skills announcement
2x SOURCES ππ 2025-10-16
β‘ Score: 9.0
+++ Claude Skills let you package instructions and resources for specific tasks, potentially outmaneuvering MCP's token overhead, though early adopters are more excited about the sandboxed dev environment Anthropic mentioned in passing. +++
π¬ "The font-size is microscopic. Everything is so small, only eagles can read."
β’ "These feel like they are just prompt files, like what VS Code has."
"Anthropic just dropped Haiku 4.5 and the numbers are wild:
**Performance:**
* 73.3% on SWE-bench Verified (matches Sonnet 4 from 5 months ago)
* 90% of Sonnet 4.5's agentic coding performance
* 2x faster than Sonnet 4
* 4-5x faster than Sonnet 4.5
**Pricing:**
* $1 input / $5 output per million ..."
π¬ "Since western models and open-source models are on par for day to day usage, the prices for the open-source models should be compared too."
β’ "these numbers are pretty impressive especially the price point."
π¬ "Divide and parallelize...8 ^ 4 toolcalls cover a very large code search space"
β’ "Context Engineering is Actually Very Important. Too important for humans and hardcoded rules"
via Arxivπ€ Xinchen Zhang, Xiaoying Zhang, Youbin Wu et al.π 2025-10-15
β‘ Score: 8.1
"We introduce Generative Universal Verifier, a novel concept and plugin
designed for next-generation multimodal reasoning in vision-language models and
unified multimodal models, providing the fundamental capability of reflection
and refinement on visual outcomes during the reasoning and generation p..."
π οΈ TOOLS
Claude's new built-in development environment
3x SOURCES ππ 2025-10-17
β‘ Score: 7.9
+++ Anthropic slipped a full Linux sandbox with persistent storage past everyone fixating on "Skills" branding, potentially solving what MCP's token bloat never could: actual practical extensibility. +++
"I feel like this whole "Skills" announcement really buried the lede. You also have a full on user\_data directory to instruct Claude to use as you wish. Not to mention that what's installed in Claude's sandbox goes beyond what you might expect. No internet connectivity, but the Python packages insta..."
π¬ "Sadly it lost a lot of it's luster when I realized the filesystem is scoped to the conversation"
β’ "They just nuked that though, no backup or anything"
via Arxivπ€ Shrey Pandit, Austin Xu, Xuan-Phi Nguyen et al.π 2025-10-15
β‘ Score: 7.9
"Large language model (LLM)-based reasoning systems have recently achieved
gold medal-level performance in the IMO 2025 competition, writing mathematical
proofs where, to receive full credit, each step must be not only correct but
also sufficiently supported. To train LLM-based reasoners in such chal..."
via Arxivπ€ Ravi Pandya, Madison Bland, Duy P. Nguyen et al.π 2025-10-15
β‘ Score: 7.8
"Generative AI systems are increasingly assisting and acting on behalf of end
users in practical settings, from digital shopping assistants to
next-generation autonomous cars. In this context, safety is no longer about
blocking harmful content, but about preempting downstream hazards like
financial o..."
via Arxivπ€ Devvrit Khatri, Lovish Madaan, Rishabh Tiwari et al.π 2025-10-15
β‘ Score: 7.8
"Reinforcement learning (RL) has become central to training large language
models (LLMs), yet the field lacks predictive scaling methodologies comparable
to those established for pre-training. Despite rapidly rising compute budgets,
there is no principled understanding of how to evaluate algorithmic..."
via Arxivπ€ Giovanni Monea, Yair Feldman, Shankar Padmanabhan et al.π 2025-10-15
β‘ Score: 7.7
"The scalability of large language models for long-context reasoning is
severely constrained by the linear growth of their Transformer key-value cache,
which incurs significant memory and computational costs. We posit that as a
model generates reasoning tokens, the informational value of past generat..."
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
"**TL;DR:** Tool-call accuracy in LLMs can be significantly improved by using natural language instead of JSON-defined schemas (\~+18 percentage points across 6,400 trials and 10 models), while simultaneously reducing variance by 70% and token overhead by 31%. We introduce Natural Language Tools (NLT..."
via Arxivπ€ Yuxiang Huang, Chaojun Xiao, Xu Han et al.π 2025-10-15
β‘ Score: 7.6
"Trainable sparse attention has emerged as a promising solution to address the
decoding efficiency bottleneck of LLMs in long-context processing,
significantly saving memory accesses while minimally impacting task
performance. However, existing sparse attention methods leave a crucial
limitation unre..."
via Arxivπ€ Zhiqi Huang, Vivek Datla, Chenyang Zhu et al.π 2025-10-15
β‘ Score: 7.6
"We propose a method for confidence estimation in retrieval-augmented
generation (RAG) systems that aligns closely with the correctness of large
language model (LLM) outputs. Confidence estimation is especially critical in
high-stakes domains such as finance and healthcare, where the cost of an
incor..."
via Arxivπ€ Yi Zhang, Bolin Ni, Xin-Sheng Chen et al.π 2025-10-15
β‘ Score: 7.6
"Fully open multimodal large language models (MLLMs) currently lag behind
proprietary counterparts, primarily due to a significant gap in data quality
for supervised fine-tuning (SFT). Existing open-source datasets are often
plagued by widespread noise and a critical deficit in complex reasoning data..."
π¬ "how do you manage auth state conflicts when multiple agents interact with the same logged-in session simultaneously?"
β’ "Are you modifying specific Chromium fingerprinting APIs or taking a different approach?"
via Arxivπ€ Xinyi Chen, Yilun Chen, Yanwei Fu et al.π 2025-10-15
β‘ Score: 7.1
"We introduce InternVLA-M1, a unified framework for spatial grounding and
robot control that advances instruction-following robots toward scalable,
general-purpose intelligence. Its core idea is spatially guided
vision-language-action training, where spatial grounding serves as the critical
link betw..."
via Arxivπ€ Run Luo, Xiaobo Xia, Lu Wang et al.π 2025-10-15
β‘ Score: 7.0
"Next-generation multimodal foundation models capable of any-to-any
cross-modal generation and multi-turn interaction will serve as core components
of artificial general intelligence systems, playing a pivotal role in
human-machine interaction. However, most existing multimodal models remain
constrai..."
via Arxivπ€ Senyu Fei, Siyin Wang, Junhao Shi et al.π 2025-10-15
β‘ Score: 6.9
"Visual-Language-Action (VLA) models report impressive success rates on
robotic manipulation benchmarks, yet these results may mask fundamental
weaknesses in robustness. We perform a systematic vulnerability analysis by
introducing controlled perturbations across seven dimensions: objects layout,
cam..."
via Arxivπ€ Xingyu Tan, Xiaoyang Wang, Xiwei Xu et al.π 2025-10-15
β‘ Score: 6.8
"Large Language Models (LLMs) have achieved impressive reasoning abilities,
but struggle with temporal understanding, especially when questions involve
multiple entities, compound operators, and evolving event sequences. Temporal
Knowledge Graphs (TKGs), which capture vast amounts of temporal facts i..."
via Arxivπ€ Xiuyuan Chen, Tao Sun, Dexin Su et al.π 2025-10-15
β‘ Score: 6.8
"Current benchmarks for AI clinician systems, often based on multiple-choice
exams or manual rubrics, fail to capture the depth, robustness, and safety
required for real-world clinical practice. To address this, we introduce the
GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rou..."
π οΈ TOOLS
Claude with Playwright MCP Browser Testing
2x SOURCES ππ 2025-10-17
β‘ Score: 6.8
+++ Anthropic's Playwright integration lets Claude actually see and interact with live browsers instead of hallucinating test scripts, which is either revolutionary or the bare minimum depending on your tolerance for AI theater. +++
"Iβve been messing around with the new Playwright MCP inside Claude Code and itβs honestly wild.
It doesnβt just simulate tests or spit out scripts β it actually opens a live Chromium browser that you can watch while it runs your flow.
I set it up to test my full onboarding process:
signup β ver..."
π¬ Reddit Discussion: 9 comments
π BUZZING
π― Browser automation tools β’ Playwright vs Chrome DevTools MCP β’ Debugging and testing
π¬ "Playwright is powerful and I was excited to try"
β’ "Playwright MCP feels smoother for full test runs"
via Arxivπ€ Santiago Cuervo, Skyler Seto, Maureen de Seyssel et al.π 2025-10-15
β‘ Score: 6.7
"Large Language Models (LLMs) can be adapted to extend their text capabilities
to speech inputs. However, these speech-adapted LLMs consistently underperform
their text-based counterparts--and even cascaded pipelines--on language
understanding tasks. We term this shortfall the text-speech understandi..."
π¬ "Claude has a denial of reality which it is unable to get through"
β’ "Skills are dependent upon developers writing competent documentationβ¦which most seemingly can't"
via Arxivπ€ Shuyu Wu, Ziqiao Ma, Xiaoxi Luo et al.π 2025-10-15
β‘ Score: 6.6
"Symbol grounding (Harnad, 1990) describes how symbols such as words acquire
their meanings by connecting to real-world sensorimotor experiences. Recent
work has shown preliminary evidence that grounding may emerge in
(vision-)language models trained at scale without using explicit grounding
objectiv..."
via Arxivπ€ Ziqing Lu, Lifeng Lai, Weiyu Xuπ 2025-10-15
β‘ Score: 6.6
"Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged
in many security-related applications, such as autonomous driving, financial
decisions, and drone/robot algorithms. In order to improve the
robustness/defense of RL systems against adversaries, studying various
adversarial..."
"Hello guys, hoping you're having a good day.
As you know, llamacpp has RPC since time ago.
I have 2 PCs in my home:
My "Server":
* AM5 MSI X670E Carbon
* AMD Ryzen 9 9900X
* 192GB DDR5 6000Mhz CL32
* 7 GPUs
* 5090x2
* 4090x2
* A6000
* 3090x2
* MCX314A-BCCT 40Gbps NIC (totally overkil..."
π¬ Reddit Discussion: 28 comments
π GOATED ENERGY
π¬ "X16 split into X8/X4/X4 5.0 from CPU"
β’ "RPC is not without loss. Even if the RPC device is set inside the same machine, you will be losing performance compared to no RPC."
via Arxivπ€ Evan Ellis, Vivek Myers, Jens Tuyls et al.π 2025-10-15
β‘ Score: 6.5
"Assistive agents should not only take actions on behalf of a human, but also
step out of the way and cede control when there are important decisions to be
made. However, current methods for building assistive agents, whether via
mimicking expert humans or via RL finetuning on an inferred reward, oft..."
via Arxivπ€ Thomas van Vuren, Fiona Sloothaak, Maarten G. Wolf et al.π 2025-10-15
β‘ Score: 6.5
"The curse of dimensionality renders Reinforcement Learning (RL) impractical
in many real-world settings with exponentially large state and action spaces.
Yet, many environments exhibit exploitable structure that can accelerate
learning. To formalize this idea, we study RL in Block Markov Decision
Pr..."
"We present the Federated Inference Resource Scheduling Toolkit (FIRST), a
framework enabling Inference-as-a-Service across distributed High-Performance
Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI
models, like Large Language Models (LLMs), on existing HPC infrastructure...."
via Arxivπ€ Ivan Vykopal, MatΓΊΕ‘ Pikuliak, Simon Ostermann et al.π 2025-10-15
β‘ Score: 6.4
"Chat assistants increasingly integrate web search functionality, enabling
them to retrieve and cite external sources. While this promises more reliable
answers, it also raises the risk of amplifying misinformation from
low-credibility sources. In this paper, we introduce a novel methodology for
eval..."
"MoE partial offload, i.e. keeping experts on CPU and the context, attention, etc on GPU, has two benefits:
- The non-sparse data is kept on fast VRAM
- Everything needed to handle context computations is on GPU
For dense models the first point is fairly irrelevant since, well, it's all dense so ho..."
π¬ Reddit Discussion: 4 comments
π GOATED ENERGY
π¬ "Really wish a technique would come out to reduce it to 12 GB or less for the large frontier models without quality loss"
β’ "The interesting arguments are the `-ctk q8_0 -ctv q8_0 -fa 1 -ngl 99` and those should also apply to llama-server"
via Arxivπ€ Nir Goren, Oren Katzir, Abhinav Nakarmi et al.π 2025-10-15
β‘ Score: 6.3
"With the rapid adoption of diffusion models for visual content generation,
proving authorship and protecting copyright have become critical. This
challenge is particularly important when model owners keep their models private
and may be unwilling or unable to handle authorship issues, making third-p..."
via Arxivπ€ Mustafa Munir, Alex Zhang, Radu Marculescuπ 2025-10-15
β‘ Score: 6.3
"Vision graph neural networks (ViG) have demonstrated promise in vision tasks
as a competitive alternative to conventional convolutional neural nets (CNN)
and transformers (ViTs); however, common graph construction methods, such as
k-nearest neighbor (KNN), can be expensive on larger images. While me..."
"Hey `r/LocalLLaMA`! We just released this in beta and would love to get your feedback.
Here: https://github.com/ggml-org/LlamaBarn
What it does:
- Download models from a curated catalog
- Run models with one click β it auto-configures them for your system
- Built-in web UI and REST API (via `llama..."
π¬ Reddit Discussion: 20 comments
π BUZZING
π― Performance improvements β’ Backend configuration β’ Multimodal architectures support
π¬ "now make it use an MLX backend, which is usually quite a bit faster on Mac"
β’ "Still be nice to get mlx in there if only because it's way easier to add new architectures"
"*Disclaimer: I work for* *Inference.net**, creator of the Schematron model family*
Hey everyone, wanted to share something we've been working on at Inference.net: Schematron, a family of small models for web extraction.
Our goal was to make a small, fast model for taking HT..."
π¬ Reddit Discussion: 46 comments
π BUZZING
π― Web scraping automation β’ LLM model applications β’ Tool trade-offs
π¬ "simple and cheap agnostic solution that just receives html and outputs nice json"
β’ "This works for any schema on any page"
"Remote inference allows lightweight devices to leverage powerful cloud
models. However, communication network latency makes predictions stale and
unsuitable for real-time tasks. To address this, we introduce Dedelayed, a
delay-corrective method that mitigates arbitrary remote inference delays,
allow..."