π HISTORICAL ARCHIVE - May 18, 2026
What was happening in AI on 2026-05-18
π You are visitor #47291 to this AWESOME site! π
Archive from: 2026-05-18 | Preserved for posterity β‘
π Filter by Category
Loading filters...
π° NEWS
πΊ 44 pts
β‘ Score: 9.2
π° NEWS
πΊ 261 pts
β‘ Score: 8.4
π° NEWS
πΊ 98 pts
β‘ Score: 8.2
π° NEWS
β¬οΈ 20 ups
β‘ Score: 8.0
"PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:
Strix Halo (Framework Desktop, ROCm 7.0.2):
* Q4\_K\_M: 11.7 β 21.2 tok/s (1.81Γ)
* Q8\_0: 7.4 β 18.1 ..."
π¬ RESEARCH
via Arxiv
π€ Yishun Lu, Junhao Zhang, Zeyu Yang et al.
π
2026-05-15
β‘ Score: 7.9
"Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by..."
π¬ RESEARCH
via Arxiv
π€ Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan et al.
π
2026-05-15
β‘ Score: 7.8
"Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, a..."
π¬ RESEARCH
via Arxiv
π€ Rui Wen, Mark Russinovich, Andrew Paverd et al.
π
2026-05-14
β‘ Score: 7.7
"Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input t..."
π¬ RESEARCH
via Arxiv
π€ Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith
π
2026-05-15
β‘ Score: 7.7
"We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques..."
π° NEWS
πΊ 2 pts
β‘ Score: 7.6
π° NEWS
β¬οΈ 163 ups
β‘ Score: 7.5
"time to update your llama.cpp -> improved prompt processing speed..."
π¬ RESEARCH
πΊ 1 pts
β‘ Score: 7.3
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
π° NEWS
β¬οΈ 35 ups
β‘ Score: 7.3
"**World models** learn compact latent representations for planning without pixel reconstruction. LeWorldModel (LeWM), from LeCun's group at NYU, achieves stable end-to-end JEPA training by enforcing an isotropic Gaussian prior over the full latent space.
**The flaw:**Β real environment dynamics live..."
π° NEWS
β¬οΈ 1 ups
β‘ Score: 7.3
"I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed.
..."
π° NEWS
β¬οΈ 601 ups
β‘ Score: 7.3
"I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi-step tasks collapse.
So I built SmallCode. It's ..."
π¬ RESEARCH
via Arxiv
π€ Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray et al.
π
2026-05-14
β‘ Score: 7.3
"We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time at..."
π¬ RESEARCH
via Arxiv
π€ Pratinav Seth, Vinay Kumar Sankarapu
π
2026-05-14
β‘ Score: 7.3
"This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to..."
π° NEWS
β¬οΈ 2 ups
β‘ Score: 7.2
"Sharing a project I've been building: **Argyph**, an **MCP** **server** that gives AI coding agents (Claude, or anything that speaks MCP) structured and semantic **understanding** of a **codebase**.
The problem: agents are good at reasoning but bad at retrieval. They grep, guess, and pull whole fil..."
π¬ RESEARCH
via Arxiv
π€ Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu
π
2026-05-14
β‘ Score: 7.2
"Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact..."
π° NEWS
πΊ 1 pts
β‘ Score: 7.2
π° NEWS
β¬οΈ 7 ups
β‘ Score: 7.1
"I've been digging into Meta's SAM 2 (Segment Anything in Images & Videos) and wrote up a detailed technical overview with some original analysis on its memory design.
**Quick summary of SAM 2:**
* Unified model for promptable image + video segmentation
* Streaming memory architecture with a me..."
π¬ RESEARCH
via Arxiv
π€ Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.
π
2026-05-14
β‘ Score: 7.1
"Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate wi..."
π° NEWS
β¬οΈ 3 ups
β‘ Score: 7.1
"The thing that keeps bothering me about health AI demos is not that they sound bad.
Itβs that they sound good enough to borrow trust they havenβt earned.
A model can write a beautiful note, a clean care plan, or a confident explanation and still be wrong in exactly the places a clinician or patien..."
π° NEWS
β¬οΈ 1 ups
β‘ Score: 7.1
"Iβve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads.
The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.
This started from rob..."
π° NEWS
πΊ 1 pts
β‘ Score: 7.0
π° NEWS
πΊ 1 pts
β‘ Score: 7.0
π° NEWS
πΊ 1 pts
β‘ Score: 7.0
π° NEWS
πΊ 1 pts
β‘ Score: 7.0
π° NEWS
β¬οΈ 1 ups
β‘ Score: 7.0
"About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators re..."
π¬ RESEARCH
via Arxiv
π€ Zhengxi Lu, Zhiyuan Yao, Zhuowen Han et al.
π
2026-05-14
β‘ Score: 7.0
"Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher..."
π° NEWS
β¬οΈ 23 ups
β‘ Score: 6.9
"IβmΒ postingΒ thisΒ asΒ a warning. Iβm doneΒ withΒ Cursor afterΒ this.
IΒ was usingΒ AgentΒ modeΒ onΒ WindowsΒ forΒ a normalΒ dev task: revertΒ aΒ small change by removing a subfolder in aΒ repo. I didΒ notΒ ask to delete my user folder, Desktop, Documents, or anythingΒ outsideΒ the project.
The agentΒ ranΒ cmdΒ /c rmdirΒ ..."
π° NEWS
β¬οΈ 808 ups
β‘ Score: 6.9
"paid for both since January. tracked which one I actually used per task type. sharing because most comparison posts are tribal and I think the picture is more boring than people make it.
for writing (longform, analysis, structured docs): claude wins. opus 4.7 and sonnet 4.6 both better than gpt-5 a..."
π° NEWS
β¬οΈ 3 ups
β‘ Score: 6.9
"Residual Coupling (RC) connects frozen language models in parallel using small, learned linear bridge projections. These bridges read hidden states from one model and inject additive updates into the residual stream of another at intermediate layers. In bilateral setups, simultaneous return bridges ..."
π° NEWS
πΊ 575 pts
β‘ Score: 6.8
π¬ RESEARCH
via Arxiv
π€ Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo et al.
π
2026-05-14
β‘ Score: 6.8
"Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dat..."
π¬ RESEARCH
via Arxiv
π€ Xiaohua Zhan, Kazuki Egashira, Robin Staab et al.
π
2026-05-14
β‘ Score: 6.8
"LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users. However, existing q..."
π¬ RESEARCH
via Arxiv
π€ Stratis Tsirtsis, Kai Rawal, Chris Russell et al.
π
2026-05-15
β‘ Score: 6.8
"Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions an..."
π° NEWS
πΊ 1 pts
β‘ Score: 6.8
π° NEWS
β¬οΈ 84 ups
β‘ Score: 6.7
"If you're building AI agents or SaaS products used by European companies (or processing EU resident data), the EU AI Act applies to you regardless of where your company is based.
Full enforcement for high-risk systems starts August 2, 2026. High-risk means: credit scoring, recruitment filtering, he..."
π¬ RESEARCH
via Arxiv
π€ Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong et al.
π
2026-05-14
β‘ Score: 6.7
"Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In..."
π¬ RESEARCH
via Arxiv
π€ Zhen Zhang, Liangcai Su, Zhuo Chen et al.
π
2026-05-15
β‘ Score: 6.7
"Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed..."
π° NEWS
πΊ 1 pts
β‘ Score: 6.7
π¬ RESEARCH
via Arxiv
π€ Ziyin Zhang, Zihan Liao, Hang Yu et al.
π
2026-05-14
β‘ Score: 6.7
"The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-we..."
π¬ RESEARCH
via Arxiv
π€ Guangyu Feng, Huanzhi Mao, Prabal Dutta et al.
π
2026-05-14
β‘ Score: 6.6
"Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introdu..."
π¬ RESEARCH
via Arxiv
π€ Minghao Guo, Qingyue Jiao, Zeru Shi et al.
π
2026-05-14
β‘ Score: 6.6
"Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred witho..."
π¬ RESEARCH
via Arxiv
π€ Igor Bogdanov, Chung-Horng Lung, Thomas Kunz et al.
π
2026-05-15
β‘ Score: 6.6
"Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve p..."
π¬ RESEARCH
via Arxiv
π€ Will Schwarzer, Scott Niekum
π
2026-05-14
β‘ Score: 6.6
"Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluatio..."
π¬ RESEARCH
via Arxiv
π€ Sarah Martinson, Michael P. Brenner, Martyna Plomecka et al.
π
2026-05-15
β‘ Score: 6.6
"Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system..."
π¬ RESEARCH
via Arxiv
π€ Igor Bogdanov, Chung-Horng Lung, Thomas Kunz et al.
π
2026-05-15
β‘ Score: 6.5
"Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps..."
π¬ RESEARCH
via Arxiv
π€ Ziang Ye, Wentao Shi, Yuxin Liu et al.
π
2026-05-15
β‘ Score: 6.5
"Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptiv..."
π¬ RESEARCH
via Arxiv
π€ Evan Rose, Tushin Mallick, Matthew D. Laws et al.
π
2026-05-14
β‘ Score: 6.5
"Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size..."
π¬ RESEARCH
via Arxiv
π€ Renning Pang, Tian Lan, Leyuan Liu et al.
π
2026-05-14
β‘ Score: 6.5
"Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire..."
π° NEWS
β¬οΈ 161 ups
β‘ Score: 6.5
"## TL;DR
- best setup I tested on a RTX 3090 24 GB: `ik_llama.cpp` + `Qwen3.6-27B-MTP-IQ4_KS.gguf`
- `156k` context, `q8_0/q8_0` KV, MTP, vision on CPU
- benchmark result on a `~5.9k` prompt + `1k` output: about `1261 tok/s` prefill, `72.9 tok/s` decode
- `llama.cpp` was a good start, BeeLlama wort..."
π¬ RESEARCH
via Arxiv
π€ Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.
π
2026-05-14
β‘ Score: 6.5
"AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We..."
π° NEWS
β¬οΈ 34 ups
β‘ Score: 6.4
"Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in.
Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS where you can dial do..."
π° NEWS
β¬οΈ 77 ups
β‘ Score: 6.3
"With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized:
-cache-type-k-draft q8_0 -cache-type-v-draft q8_0
# edit: This is NOT quantizing the m..."
π° NEWS
πΊ 2 pts
β‘ Score: 6.3
π° NEWS
πΊ 2 pts
β‘ Score: 6.3
π° NEWS
β¬οΈ 23 ups
β‘ Score: 6.2
"I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and three modded 4090 48..."
π° NEWS
πΊ 1 pts
β‘ Score: 6.2
π° NEWS
β¬οΈ 31 ups
β‘ Score: 6.2
"Buried in the Composer 2.5 announcement:
*Together*Β
*with SpaceXAI**, we're training a significantly larger model from scratch, using 10x more total compute. With Colossus 2's million H100-equivalents and our combined data and training techniques, w..."
π° NEWS
πΊ 4 pts
β‘ Score: 6.2
π° NEWS
β¬οΈ 177 ups
β‘ Score: 6.2
"DystopiaBenchΒ runs 36 escalating scenarios across 6 dystopia types:
* Petrov:Β Autonomous weapons, nuclear override
* Orwell:Β Mass surveillance, truth manipulation
* Huxley:Β Behavioral conditioning, pleasure pacification
* Basaglia:Β Coercive therapeutic control
* LaGuardia:Β Regulatory capture, civic..."
π¬ RESEARCH
via Arxiv
π€ Ziyu Guo, Rain Liu, Xinyan Chen et al.
π
2026-05-14
β‘ Score: 6.1
"Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alterna..."
π° NEWS
β¬οΈ 5 ups
β‘ Score: 6.1
"If you missed the Project Glasswing announcement last month: Anthropic built a security-focused model that autonomously found thousands of high-severity vulnerabilities across every major OS and web browser, then decided it was too dangerous to release publicly. Instead they gave access to \~40 orga..."
π¬ RESEARCH
via Arxiv
π€ Ellwil Sharma, Arastu Sharma
π
2026-05-14
β‘ Score: 6.1
"Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operator..."
π οΈ SHOW HN
πΊ 1 pts
β‘ Score: 6.1
π° NEWS
πΊ 1 pts
β‘ Score: 6.1
π οΈ SHOW HN
πΊ 15 pts
β‘ Score: 6.1