π You are visitor #55100 to this AWESOME site! π
Last updated: 2026-05-13 | Server uptime: 99.9% β‘
π Filter by Category
Loading filters...
π¬ RESEARCH
via Arxiv
π€ Nikita Kezins, Urbas Ekka, Pascal Berrang et al.
π
2026-05-11
β‘ Score: 8.1
"Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space:..."
π¬ RESEARCH
"Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's u..."
π° NEWS
β¬οΈ 295 ups
β‘ Score: 7.6
"We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led ..."
π¬ RESEARCH
πΊ 5 pts
β‘ Score: 7.5
π° NEWS
β¬οΈ 53 ups
β‘ Score: 7.4
"TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature.
Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabP..."
π° NEWS
β¬οΈ 23 ups
β‘ Score: 7.4
"Hey fellow Llamas, keeping it short.
We just shipped **DFlash** and **PFlash** support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from [the RTX 3090 post a couple weeks back](
https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dfla..."
π° NEWS
β¬οΈ 77 ups
β‘ Score: 7.3
"now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes
*Datasets: AIME, AIME2025, GSM8K, GPQA*..."
π° NEWS
πΊ 2 pts
β‘ Score: 7.2
π¬ RESEARCH
via Arxiv
π€ Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang et al.
π
2026-05-12
β‘ Score: 7.1
"Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated ag..."
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
π¬ RESEARCH
via Arxiv
π€ Guinan Su, Yanwu Yang, Xueyan Li et al.
π
2026-05-12
β‘ Score: 7.0
"The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI..."
π¬ RESEARCH
via Arxiv
π€ Haoyu Wang, Yuliang Song, Tao Li et al.
π
2026-05-12
β‘ Score: 7.0
"Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to..."
π° NEWS
πΊ 1 pts
β‘ Score: 7.0
π¬ RESEARCH
via Arxiv
π€ Shauli Ravfogel, Gilad Yehudai, Joan Bruna et al.
π
2026-05-12
β‘ Score: 6.9
"How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative,..."
π¬ RESEARCH
via Arxiv
π€ Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.
π
2026-05-12
β‘ Score: 6.9
"LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes,..."
π¬ RESEARCH
via Arxiv
π€ Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.
π
2026-05-12
β‘ Score: 6.9
"Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM..."
π° NEWS
πΊ 1 pts
β‘ Score: 6.9
π¬ RESEARCH
via Arxiv
π€ Mohammadreza Armandpour, Fatih Ilhan, David Harrison et al.
π
2026-05-11
β‘ Score: 6.9
"On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context s..."
π° NEWS
πΊ 4 pts
β‘ Score: 6.9
π° NEWS
πΊ 5 pts
β‘ Score: 6.8
π¬ RESEARCH
via Arxiv
π€ Eric Bigelow, RaphaΓ«l Sarfati, Daniel Wurgaft et al.
π
2026-05-12
β‘ Score: 6.8
"Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geome..."
π¬ RESEARCH
via Arxiv
π€ Jacob Fein-Ashley, Paria Rashidinejad
π
2026-05-12
β‘ Score: 6.8
"Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurre..."
π¬ RESEARCH
via Arxiv
π€ Yuanda Xu, Hejian Sang, Zhengze Zhou et al.
π
2026-05-12
β‘ Score: 6.8
"In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often..."
π¬ RESEARCH
via Arxiv
π€ Xuhao Hu, Xi Zhang, Haiyang Xu et al.
π
2026-05-12
β‘ Score: 6.8
"Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal executi..."
π° NEWS
β¬οΈ 57 ups
β‘ Score: 6.8
"I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely lower KLD while droppin..."
π¬ RESEARCH
via Arxiv
π€ Shuangrui Ding, Xuanlang Dai, Long Xing et al.
π
2026-05-11
β‘ Score: 6.8
"Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agen..."
π° NEWS
β¬οΈ 729 ups
β‘ Score: 6.7
"No phone, PC, Wi-Fi, link cable, or cloud inference.
β’ The cartridge boots a ROM, and the GBC runs the model itself.
β’ The model is Andrej Karpathyβs TinyStories-260K, converted to INT8 weights with fixed-point math so it can run without floating point.
β’ Built with GBDK-2020 as an MBC5 Game..."
π¬ RESEARCH
via Arxiv
π€ Tom Sander, Hongyan Chang, TomΓ‘Ε‘ SouΔek et al.
π
2026-05-12
β‘ Score: 6.7
"We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizat..."
π¬ RESEARCH
via Arxiv
π€ Sagi Ahrac, Noya Hochwald, Mor Geva
π
2026-05-12
β‘ Score: 6.7
"Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are f..."
π¬ RESEARCH
via Arxiv
π€ Joel Rorseth, Parke Godfrey, Lukasz Golab et al.
π
2026-05-11
β‘ Score: 6.7
"This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We fur..."
π¬ RESEARCH
via Arxiv
π€ Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang et al.
π
2026-05-11
β‘ Score: 6.7
"Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard po..."
π¬ RESEARCH
via Arxiv
π€ Simon Yu, Derek Chong, Ananjan Nandi et al.
π
2026-05-11
β‘ Score: 6.7
"We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forke..."
π° NEWS
β¬οΈ 391 ups
β‘ Score: 6.6
"r/ClaudeAI β’ also crosspost to r/LocalLLaMA and r/artificial
I lost $187 to this and want to save others the same headache.
**What happened**
I run Claude Code headlessly via Windows Task Scheduler. My project repo has a `.env` file with `ANTHROPIC_API_KEY` set β legitimately, for a separ..."
π¬ RESEARCH
via Arxiv
π€ Yash Akhauri, Mohamed S. Abdelfattah
π
2026-05-11
β‘ Score: 6.6
"Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can o..."
π¬ RESEARCH
via Arxiv
π€ Mingxi Zou, Zhihan Guo, Langzhang Liang et al.
π
2026-05-11
β‘ Score: 6.6
"Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but be..."
π¬ RESEARCH
via Arxiv
π€ Roxana Geambasu, Mariana Raykova, Pierre Tholoniat et al.
π
2026-05-11
β‘ Score: 6.6
"The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, ad..."
π° NEWS
β¬οΈ 116 ups
β‘ Score: 6.5
"External link discussion - see full content at original source."
π° NEWS
β¬οΈ 175 ups
β‘ Score: 6.5
"Morning Everyone!
Big one today (**104 changes!**): Claude Code just went async.
The new `/goal` command lets you set a completion condition ("all tests pass and the PR is ready"), then Claude keeps grinding across turns until it's hit. The new `claude agents` view shows every session you've got r..."
π οΈ SHOW HN
πΊ 39 pts
β‘ Score: 6.5
π¬ RESEARCH
via Arxiv
π€ Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin
π
2026-05-11
β‘ Score: 6.5
"Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers aski..."
π¬ RESEARCH
via Arxiv
π€ Junhao Shen, Teng Zhang, Xiaoyan Zhao et al.
π
2026-05-11
β‘ Score: 6.5
"Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized int..."
π οΈ SHOW HN
πΊ 40 pts
β‘ Score: 6.4
π° NEWS
β¬οΈ 3434 ups
β‘ Score: 6.2
"External link discussion - see full content at original source."
π° NEWS
β¬οΈ 11 ups
β‘ Score: 6.2
"I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers.
I found that the spectral ratio is best kept around 0.5..."
π¬ RESEARCH
β¬οΈ 13 ups
β‘ Score: 6.2
"Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \\textbf{attention drift}: as the drafter generates successive t..."
π° NEWS
β¬οΈ 19 ups
β‘ Score: 6.2
"I've been spending the last several months reading every published psychology paper I can find on AI chatbot use, and I noticed something that genuinely bothers me as both a researcher and a
Claude user.
Almost every empirical study samples one of three populations: ChatGPT users, Character.AI u..."
π° NEWS
πΊ 2 pts
β‘ Score: 6.2
π οΈ SHOW HN
πΊ 8 pts
β‘ Score: 6.2
π° NEWS
β¬οΈ 15 ups
β‘ Score: 6.2
"I was running blind watching Claude Code work, could not tell where my money was going, when it was stuck in a loop, or what it was doing with my filesystem. So i built something open source to make it visible. works with Claude Code, Codex CLI, Gemini CLI, Cursor, and any MCP server.
Β Β
A scan ..."
π¬ RESEARCH
via Arxiv
π€ Alireza Nadali, Patrick Cooper, Ashutosh Trivedi et al.
π
2026-05-12
β‘ Score: 6.1
"We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values..."
π° NEWS
β¬οΈ 29 ups
β‘ Score: 6.1
"Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable.
**Autocomplete**: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L
**Agentic**: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL
---
### Why these models:
Qwen2.5 is still the best model for infill imo..."
π οΈ SHOW HN
πΊ 2 pts
β‘ Score: 6.1
π¬ RESEARCH
via Arxiv
π€ Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur et al.
π
2026-05-11
β‘ Score: 6.1
"Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model..."
π¬ RESEARCH
via Arxiv
π€ Linus Heck, Filip MacΓ‘k, Roman Andriushchenko et al.
π
2026-05-11
β‘ Score: 6.1
"Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is..."