π WELCOME TO METAMESH.BIZ +++ TICKER ERROR: CONTENT TOO SPICY FOR ANTHROPIC'S USAGE POLICY +++ HERE'S WHAT'S HAPPENING +++ 'Western Qwen': IBM Wows with Granite 4 LLM Launch and Hybrid Mamba/Transformer +++ Sora 2: AI Video Generation with Realistic Sound +++ LoRA without regrets implemented in Hugging Face TRL [colab, and python scripts] π β’
π WELCOME TO METAMESH.BIZ +++ TICKER ERROR: CONTENT TOO SPICY FOR ANTHROPIC'S USAGE POLICY +++ HERE'S WHAT'S HAPPENING +++ 'Western Qwen': IBM Wows with Granite 4 LLM Launch and Hybrid Mamba/Transformer +++ Sora 2: AI Video Generation with Realistic Sound +++ LoRA without regrets implemented in Hugging Face TRL [colab, and python scripts] π β’
+++ Big Blue releases enterprise LLM family mixing Mamba and transformers, promising lower RAM usage. Models range from browser-ready 3B to 32B parameters. +++
π― Monetization strategies β’ Competition from Chinese models β’ OpenAI's strategic dilemma
π¬ "That VC loss playbook only works if you can corner the market and squeeze later to make up for the losses."
β’ "The biggest concern IMO is how good the open weight models coming out of China are, on consumer hardware."
+++ Secondary sale values ChatGPT maker at half a trillion dollars, letting employees cash out while Sam Altman's startup officially becomes pricier than rockets. +++
via Arxivπ€ Enxin Song, Wenhao Chai, Shusheng Yang et al.π 2025-10-02
β‘ Score: 8.1
"Video understanding in multimodal language models remains limited by context
length: models often miss key transition frames and struggle to maintain
coherence across long time scales. To address this, we adapt Native Sparse
Attention (NSA) to video-language models. Our method, VideoNSA, adapts
Qwen..."
via Arxivπ€ Yixuan Weng, Minjun Zhu, Qiujie Xie et al.π 2025-09-30
β‘ Score: 8.0
"While previous AI Scientist systems can generate novel findings, they often
lack the focus to produce scientifically valuable contributions that address
pressing human-defined challenges. We introduce DeepScientist, a system
designed to overcome this by conducting goal-oriented, fully autonomous
sci..."
via Arxivπ€ MaΓ«l Macuglia, Paul Friedrich, Giorgia Ramponiπ 2025-09-30
β‘ Score: 8.0
"Deploying reinforcement learning (RL) in robotics, industry, and health care
is blocked by two obstacles: the difficulty of specifying accurate rewards and
the risk of unsafe, data-hungry exploration. We address this by proposing a
two-stage framework that first learns a safe initial policy from a r..."
"# LoRA Without Regret
> [!WARNING]
> I wrote this page for the TRL docs, but thought it's just drop it here in advance for anyone who can't wait.
I also made a colab notebook of this guide.
Recent res..."
via Arxivπ€ Justin Cui, Jie Wu, Ming Li et al.π 2025-10-02
β‘ Score: 7.7
"Diffusion models have revolutionized image and video generation, achieving
unprecedented visual quality. However, their reliance on transformer
architectures incurs prohibitively high computational costs, particularly when
extending generation to long videos. Recent work has explored autoregressive..."
+++ Inference chip startup Groq wants 12+ new data centers in 2026 after building 12 this year, betting big that speed matters more than availability. +++
via Arxivπ€ Ziyin Zhang, Zihan Liao, Hang Yu et al.π 2025-10-02
β‘ Score: 7.1
"We introduce F2LLM - Foundation to Feature Large Language Models, a suite of
state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike
previous top-ranking embedding models that require massive contrastive
pretraining, sophisticated training pipelines, and costly synthetic trainin..."
"Arena evals (e.g., Chatbot Arena) let users pick which model's response is better, or call it a draw. Most leaderboards then shove this into Elo, same as chess. The assumption: a draw = two models are equally strong. The paper ["Drawing Conclusions from Draws: Rethinking Preference Semantics in Aren..."
via Arxivπ€ Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al.π 2025-10-02
β‘ Score: 6.9
"Despite recent rapid progress in AI safety, current large language models
remain vulnerable to adversarial attacks in multi-turn interaction settings,
where attackers strategically adapt their prompts across conversation turns and
pose a more critical yet realistic challenge. Existing approaches tha..."
via Arxivπ€ Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee et al.π 2025-10-02
β‘ Score: 6.9
"Computer-use agents (CUAs) hold promise for automating everyday digital
tasks, but their unreliability and high variance hinder their application to
long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method
that scales over agents by generating multiple rollouts and selecting amo..."
via Arxivπ€ Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen et al.π 2025-10-02
β‘ Score: 6.8
"Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key
method for improving Large Language Models' reasoning capabilities, yet recent
evidence suggests it may paradoxically shrink the reasoning boundary rather
than expand it. This paper investigates the shrinkage issue of RLVR by..."
via Arxivπ€ Chenxi Whitehouse, Sebastian Ruder, Tony Lin et al.π 2025-09-30
β‘ Score: 6.8
"Ensuring native-like quality of large language model (LLM) responses across
many languages is challenging. To address this, we introduce MENLO, a framework
that operationalizes the evaluation of native-like response quality based on
audience design-inspired mechanisms. Using MENLO, we create a datas..."
"We ran one of our hardest computer-use benchmarks on Anthropic Sonnet 4.5, side-by-side with Sonnet 4.
Ask: "Install LibreOffice and make a sales table".
Sonnet 4.5: 214 turns, clean trajectory
Sonnet 4: 316 turns, major detours
The difference shows up in multi-step sequences where errors compou..."
via Arxivπ€ Kyoungjun Park, Yifan Yang, Juheon Yi et al.π 2025-10-02
β‘ Score: 6.8
"With the rapid advancement of AI-generated videos, there is an urgent need
for effective detection tools to mitigate societal risks such as misinformation
and reputational harm. In addition to accurate classification, it is essential
that detection models provide interpretable explanations to ensure..."
via Arxivπ€ Yuyang Liu, Chuan Wen, Yihang Hu et al.π 2025-09-30
β‘ Score: 6.8
"Designing dense rewards is crucial for reinforcement learning (RL), yet in
robotics it often demands extensive manual effort and lacks scalability. One
promising solution is to view task progress as a dense reward signal, as it
quantifies the degree to which actions advance the system toward task
co..."
via Arxivπ€ Siddarth Venkatraman, Vineet Jain, Sarthak Mittal et al.π 2025-09-30
β‘ Score: 6.8
"Test-time scaling methods improve the capabilities of large language models
(LLMs) by increasing the amount of compute used during inference to make a
prediction. Inference-time compute can be scaled in parallel by choosing among
multiple independent solutions or sequentially through self-refinement..."
"Quick paper highlight (adapted from TLDR thread):
Finds no special advantage using an LLM to predict its own correctness (a trend in prior work), instead finding that LLMs benefit from learning to predict the correctness of many other models β becoming a GCM.
\--
Training 1 GCM is strictly mor..."
via Arxivπ€ Yuxiao Qu, Anikait Singh, Yoonho Lee et al.π 2025-10-02
β‘ Score: 6.7
"Reasoning requires going beyond pattern matching or memorization of solutions
to identify and implement "algorithmic procedures" that can be used to deduce
answers to hard problems. Doing so requires realizing the most relevant
primitives, intermediate results, or shared procedures, and building upo..."
via Arxivπ€ Hala Sheta, Eric Huang, Shuyu Wu et al.π 2025-10-02
β‘ Score: 6.6
"We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking,
analysis, and interpretation of vision-language models (VLMs) by supporting the
extraction of intermediate outputs from any layer during the forward pass of
open-source VLMs. VLM-Lens provides a unified, YAML-configurable i..."
via Arxivπ€ Jessica Bader, Mateusz Pach, Maria A. Bravo et al.π 2025-09-30
β‘ Score: 6.5
"Text-to-Image (T2I) generation models have advanced rapidly in recent years,
but accurately capturing spatial relationships like "above" or "to the right
of" poses a persistent challenge. Earlier methods improved spatial relationship
following with external position control. However, as architecture..."
π― AI Regulation β’ Impact of EU Policies β’ Effectiveness of Regulations
π¬ "Regulations are even more important when data from citizens and local companies is being exported"
β’ "The regulations will not protect us, just another way of them to impose giant fines on US companies"
via Arxivπ€ Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen et al.π 2025-10-02
β‘ Score: 6.5
"We introduce AccurateRAG -- a novel framework for constructing
high-performance question-answering applications based on retrieval-augmented
generation (RAG). Our framework offers a pipeline for development efficiency
with tools for raw dataset processing, fine-tuning data generation, text
embedding..."
via Arxivπ€ Raphael Tang, Crystina Zhang, Wenyan Li et al.π 2025-10-02
β‘ Score: 6.3
"In arena-style evaluation of large language models (LLMs), two LLMs respond
to a user query, and the user chooses the winning response or deems the
"battle" a draw, resulting in an adjustment to the ratings of both models. The
prevailing approach for modeling these rating dynamics is to view battles..."
via Arxivπ€ Florian GrΓΆtschla, Longxiang Jiao, Luca A. LanzendΓΆrfer et al.π 2025-09-30
β‘ Score: 6.3
"We introduce Panama, an active learning framework to train parametric guitar
amp models end-to-end using a combination of an LSTM model and a WaveNet-like
architecture. With \model, one can create a virtual amp by recording samples
that are determined through an ensemble-based active learning strate..."
"As large language models (LLMs) begin to saturate existing benchmarks,
automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a
scalable alternative to slow and costly human curation. While these generated
test sets have to potential to cheaply rank models, we demonstrate a crit..."
via Arxivπ€ Litu Rout, Andreas Lugmayr, Yasamin Jafarian et al.π 2025-10-02
β‘ Score: 6.3
"We study the problem of posterior sampling using pretrained discrete
diffusion foundation models, aiming to recover images from noisy measurements
without retraining task-specific models. While diffusion models have achieved
remarkable success in generative modeling, most advances rely on continuous..."
via Arxivπ€ Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour et al.π 2025-09-30
β‘ Score: 6.3
"As language models gain access to external tools via structured function
calls, they become increasingly more capable of solving complex, multi-step
tasks. However, existing benchmarks for tool-augmented language models (TaLMs)
provide insufficient control over factors such as the number of function..."
via Arxivπ€ Yu-Chien Liao, Jr-Jen Chen, Chi-Pin Huang et al.π 2025-10-02
β‘ Score: 6.3
"Updating diffusion models in an incremental setting would be practical in
real-world applications yet computationally challenging. We present a novel
learning strategy of Concept Neuron Selection (CNS), a simple yet effective
approach to perform personalization in a continual learning scheme. CNS
un..."
via Arxivπ€ Anna Kuzina, Maciej Pioro, Paul N. Whatmough et al.π 2025-10-02
β‘ Score: 6.3
"Large Language Models (LLMs) excel at multi-step reasoning problems with
explicit chain-of-thought (CoT), but verbose traces incur significant
computational costs and memory overhead, and often carry redundant, stylistic
artifacts. Latent reasoning has emerged as an efficient alternative that
intern..."
via Arxivπ€ JoΓ£o Vitorino, Eva Maia, Isabel PraΓ§a et al.π 2025-09-30
β‘ Score: 6.3
"Due to the susceptibility of Artificial Intelligence (AI) to data
perturbations and adversarial examples, it is crucial to perform a thorough
robustness evaluation before any Machine Learning (ML) model is deployed.
However, examining a model's decision boundaries and identifying potential
vulnerabi..."
via Arxivπ€ Wen Yang, Junhong Wu, Chong Li et al.π 2025-10-02
β‘ Score: 6.3
"Recent advancements in Reinforcement Post-Training (RPT) have significantly
enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased
interest in the generalization of RL-based reasoning. While existing work has
primarily focused on investigating its generalization across tasks..."
via Arxivπ€ Yaxin Du, Yuanshuo Zhang, Xiyuan Yang et al.π 2025-10-02
β‘ Score: 6.3
"Information seeking is a fundamental requirement for humans. However,
existing LLM agents rely heavily on open-web search, which exposes two
fundamental weaknesses: online content is noisy and unreliable, and many
real-world tasks require precise, domain-specific knowledge unavailable from
the web...."
via Arxivπ€ Runzhe Zhan, Yafu Li, Zhi Wang et al.π 2025-10-02
β‘ Score: 6.3
"Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm
for improving the reasoning ability of large language models. However, standard
on-policy training discards rollout experiences after a single update, leading
to computational inefficiency and instability. While prior work..."
via Arxivπ€ Alexander Fishkov, Kajetan Schweighofer, Mykyta Ielanskyi et al.π 2025-09-30
β‘ Score: 6.3
"Quantifying uncertainty of machine learning model predictions is essential
for reliable decision-making, especially in safety-critical applications.
Recently, uncertainty quantification (UQ) theory has advanced significantly,
building on a firm basis of learning with proper scoring rules. However, t..."
via Arxivπ€ Qin Shi, Amber Yijia Zheng, Qifan Song et al.π 2025-10-02
β‘ Score: 6.3
"We propose the task of knowledge distillation detection, which aims to
determine whether a student model has been distilled from a given teacher,
under a practical setting where only the student's weights and the teacher's
API are available. This problem is motivated by growing concerns about model..."
via Arxivπ€ Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger et al.π 2025-10-02
β‘ Score: 6.3
"Hallucinations are a common issue that undermine the reliability of large
language models (LLMs). Recent studies have identified a specific subset of
hallucinations, known as confabulations, which arise due to predictive
uncertainty of LLMs. To detect confabulations, various methods for estimating
p..."
π¬ "LeCun has correctly identified that LLM is only one type of intelligence"
β’ "This seems like the same exact talk LeCun has been giving for years"
via Arxivπ€ Runqian Wang, Yilun Duπ 2025-10-02
β‘ Score: 6.1
"We introduce Equilibrium Matching (EqM), a generative modeling framework
built from an equilibrium dynamics perspective. EqM discards the
non-equilibrium, time-conditional dynamics in traditional diffusion and
flow-based generative models and instead learns the equilibrium gradient of an
implicit en..."