π HISTORICAL ARCHIVE - March 18, 2026
What was happening in AI on 2026-03-18
π You are visitor #47291 to this AWESOME site! π
Archive from: 2026-03-18 | Preserved for posterity β‘
π Filter by Category
Loading filters...
π SECURITY
πΊ 199 pts
β‘ Score: 9.2
π― Security challenges β’ Permission models β’ Sandbox limitations
π¬ "Bash + CLI greatly expands what you can do beyond the native SQL capabilities"
β’ "If the model can request execution outside the sandbox, then the sandbox is not really an external boundary"
π οΈ TOOLS
πΊ 79 pts
β‘ Score: 8.9
π― Code review automation β’ False positive concerns β’ Separation of code writing and reviewing
π¬ "This looks like it's doing style and structure changes, which for a codebase this size is going to add drag to existing development"
β’ "Another interesting metric, however, would be the false positive ratio"
π οΈ TOOLS
β¬οΈ 400 ups
β‘ Score: 8.5
"I gave Claude persistent memory across every session by connecting Claude.ai and Claude Code through a custom MCP server on my private VPS. Hereβs the open source code.
I got tired of Claude forgetting everything between sessions. So I built a knowledge base server that sits on my VPS, ingests my O..."
π― Enthusiasm for open-source β’ Concerns about AI writing systems β’ Importance of manually writing notes
π¬ "This is how it felt - superpowers"
β’ "The writing of the note / thought / etc... is what makes it valuable"
π¬ RESEARCH
via Arxiv
π€ Erik Y. Wang, Sumeet Motwani, James V. Roggeveen et al.
π
2026-03-16
β‘ Score: 8.2
"Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 pr..."
π¬ RESEARCH
via Arxiv
π€ Christopher Potts, Moritz Sudhof
π
2026-03-16
β‘ Score: 8.1
"AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisib..."
π€ AI MODELS
πΊ 6 pts
β‘ Score: 8.0
π¬ RESEARCH
πΊ 73 pts
β‘ Score: 8.0
π― Intelligence benchmarks β’ Consciousness and sentience β’ Limitations of current AI
π¬ "To be actually useful the AGI-we-actually-want benchmark should not only include positive indicators but also a list of unwanted behaviors"
β’ "What is the solution? A trillion tokens of system prompt to act as the 'soul /consciousness' of this AI agent?"
π¬ RESEARCH
via Arxiv
π€ Kai Wang, Biaojie Zeng, Zeming Wei et al.
π
2026-03-16
β‘ Score: 7.9
"With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system speci..."
π¬ RESEARCH
via Arxiv
π€ Lingyu Li, Yan Teng, Yingchun Wang
π
2026-03-16
β‘ Score: 7.8
"Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference du..."
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
π SECURITY
πΊ 2 pts
β‘ Score: 7.8
π οΈ TOOLS
πΊ 101 pts
β‘ Score: 7.8
π― HTTP Proxy Deployment β’ Application Capabilities β’ Usability
π¬ "this looks cool"
β’ "it this also deployable as an HTTP proxy?"
π‘οΈ SAFETY
πΊ 271 pts
β‘ Score: 7.5
π― AI-assisted coding β’ Automated code quality assurance β’ Human vs. AI programming
π¬ "The amount of code needed is surprisingly small and your agent can write it!"
β’ "I refuse to release anything it makes for me. I know that it's not good enough, that I won't be able to properly maintain it"
π€ AI MODELS
πΊ 1 pts
β‘ Score: 7.5
π‘οΈ SAFETY
β¬οΈ 1 ups
β‘ Score: 7.4
"Something I kept running into while experimenting with autonomous agents is that most AI safety discussions focus on the wrong layer.
A lot of the conversation today revolves around:
β’ prompt alignment
β’ jailbreaks
β’ output filtering
β’ sandboxing
Those things matter, but once agents can intera..."
π― Execution Layer Risk β’ Authorization Boundaries β’ Idempotent Retries
π¬ "The execution layer risk I keep seeing isn't just tool access β it's retry behavior."
β’ "Authorization boundaries at the execution layer are 10x more important than prompt-level safety."
π DATA
πΊ 1 pts
β‘ Score: 7.3
π οΈ TOOLS
β¬οΈ 565 ups
β‘ Score: 7.3
π― Hardware Estimation β’ Model Performance β’ Tool Limitations
π¬ "I hope it works better than the hardware estimation feature"
β’ "Hey if you like using production grade tools, best in class models...consider....not doing that"
π SECURITY
πΊ 5 pts
β‘ Score: 7.2
β‘ BREAKTHROUGH
πΊ 7 pts
β‘ Score: 7.2
π¬ RESEARCH
πΊ 1 pts
β‘ Score: 7.2
π SECURITY
πΊ 1 pts
β‘ Score: 7.2
π οΈ TOOLS
β¬οΈ 19 ups
β‘ Score: 7.1
"The Pentagon is discussing plans to set up secure environments for generative AI companies to train military-specific versions of their models on classified data, *MIT Technology Review* has learned.Β
AI models like Anthropicβs Claude are already used to answer questions in classified settings; app..."
π οΈ TOOLS
πΊ 1 pts
β‘ Score: 7.1
π€ AI MODELS
β¬οΈ 11 ups
β‘ Score: 7.1
"**Update:** I've removed llama comparisons from the readme and from the body of this post. Llama decode speeds will be highly dependent on CPU especially DRAM speeds and apparently also on non-default flags. In my testing Krasis is substantially faster for larger models that don't fit entirely in ..."
π― Llama.cpp performance β’ Proper usage of flags β’ Comparing inference speeds
π¬ "llama.cpp does like 10x better than on this graph"
β’ "With proper offload it should have 3-4x at least compared to your results"
π οΈ TOOLS
β¬οΈ 107 ups
β‘ Score: 7.0
"Karpathy explains how, over the course of just a few weeks coding in Claude, his workflow flipped almost entirely.Β **What was once mostly handwritten code is now largely driven by LLMs**, guided through natural language."
π― AI's impact on coding β’ Cognitive shift in development β’ Karpathy's perspective on AI
π¬ "The shift isn't just 'AI writes code instead of you"
β’ "The job is now to communicate intent clearly"
π οΈ SHOW HN
πΊ 3 pts
β‘ Score: 7.0
π οΈ SHOW HN
πΊ 7 pts
β‘ Score: 7.0
π οΈ SHOW HN
πΊ 5 pts
β‘ Score: 7.0
π οΈ TOOLS
πΊ 37 pts
β‘ Score: 7.0
π― Containerization and Sandboxing β’ Autonomous AI Agents β’ Controlled Execution Environments
π¬ "I sure love pip install ing every time instead of just baking a single container image with it already installed."
β’ "The problem is getting an existing enterprise project runnable inside the sandbox too, with no access to production keys or data or even test-db-that-is-actually-just-a-copy-of-prod, but with access to mock versions of all the various microservices and api's that the project depends on."
π οΈ SHOW HN
πΊ 6 pts
β‘ Score: 7.0
π¬ RESEARCH
β¬οΈ 43 ups
β‘ Score: 7.0
"
https://preview.redd.it/9hxa34bwhopg1.png?width=3600&format=png&auto=webp&s=909e4e1ba2feebbab94651d125a5c8e7591c4ca6
Zero failures across 300 seeds. 66Γ speedup. 5 lines of code.
We're two independent researchers. **The method:** per-row ββ clipping on decoder weights after every optim..."
π― Weight normalization β’ Memorization vs generalization β’ Optimizers for grokking
π¬ "Weights are also normalized per row, which includes Q,K,V matrices"
β’ "Grad norm contributions for each sample in a batch are normalized by taking the loss as a Gaussian NLL"
π οΈ TOOLS
"Hi everyone,
We recently released AIBuildAI, an agentic system that automatically builds AI models.
GitHub:
https://github.com/aibuildai/AI-Build-AIΒ
On OpenAIβs MLE-Bench benchmark, AIBuildAI ranked #1: [
https://github.com/openai/mle-bench](
https://gi..."
π¬ RESEARCH
β¬οΈ 23 ups
β‘ Score: 6.9
"I came across an interesting writeup from Pathway that I think is more interesting as a reasoning benchmark than as a puzzle result.
They use βSudoku Extremeβ: about 250,000 very hard Sudoku instances. The appeal is that Sudoku here is treated as a pure constraint-satisfaction problem: each solutio..."
π― Limitations of Autoregressive Modeling β’ Need for Paradigm Shift β’ Benchmarking AI Models
π¬ "autoregressive language modeling is just the wrong substrate for reasoning"
β’ "we are very far from AGI, and language use is not all there is to intelligence"
π¬ RESEARCH
πΊ 4 pts
β‘ Score: 6.9
π DATA
πΊ 1 pts
β‘ Score: 6.9
π¬ RESEARCH
via Arxiv
π€ Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.
π
2026-03-16
β‘ Score: 6.9
"Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixtur..."
π¬ RESEARCH
"As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context tha..."
π¬ RESEARCH
πΊ 113 pts
β‘ Score: 6.8
π― Autonomous learning β’ Meta-control systems β’ Hardware limitations
π¬ "Agents can already be set up to use meta-learning skills for skill authoring, introspection, rumination"
β’ "Unless we can move away from this 'outsourced learning' where humans have to fix every domain mismatch, we're just building increasingly expensive parrots"
π οΈ SHOW HN
πΊ 1 pts
β‘ Score: 6.7
π¬ RESEARCH
via Arxiv
π€ Aozhe Wang, Yuchen Yan, Nan Zhou et al.
π
2026-03-16
β‘ Score: 6.7
"Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a singl..."
β‘ BREAKTHROUGH
πΊ 1 pts
β‘ Score: 6.7
π SECURITY
πΊ 2 pts
β‘ Score: 6.7
π€ AI MODELS
πΊ 431 pts
β‘ Score: 6.7
π― Enterprises and internal data β’ Challenges of real-world data β’ Specialized model training approaches
π¬ "I've never seen enterprises which have 'internal knowledge' in proper readable form"
β’ "Proprietary and specialised data could very well be a moat"
π‘οΈ SAFETY
πΊ 2 pts
β‘ Score: 6.7
π οΈ TOOLS
πΊ 1 pts
β‘ Score: 6.6
π οΈ TOOLS
β¬οΈ 101 ups
β‘ Score: 6.6
"One persistent conversation with Claude that runs on your computer. Message it from your phone. Come back to finished work.
**How it works:**
* Download Claude Desktop
* Pair your phone
* Done
Everything Claude can do on your desktop β files, browser, tools, internal dashboards, code β is now re..."
π― AI product usability β’ Technical issues β’ Product comparison
π¬ "Anthropic is the only AI company that's shipping actually useful products"
β’ "the one time links don't work reliably"
π οΈ SHOW HN
πΊ 2 pts
β‘ Score: 6.5
π οΈ TOOLS
πΊ 2 pts
β‘ Score: 6.5
π€ AI MODELS
πΊ 1 pts
β‘ Score: 6.5
π οΈ TOOLS
πΊ 1 pts
β‘ Score: 6.5
π¬ RESEARCH
via Arxiv
π€ Taeyun Roh, Wonjune Jang, Junha Jung et al.
π
2026-03-16
β‘ Score: 6.5
"Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small lang..."
π POLICY
πΊ 1 pts
β‘ Score: 6.5
π¬ RESEARCH
πΊ 1 pts
β‘ Score: 6.5
π οΈ TOOLS
β¬οΈ 5 ups
β‘ Score: 6.5
"What if every AI you use shared the same memory? That's what I built.
A knowledge base server that sits on your VPS (or localhost), ingests everything you want your AI to know, and exposes it through MCP. I connected it to ChatGPT, Claude Code, Codex CLI, and Gemini. All of them search the same bra..."
π¬ RESEARCH
β¬οΈ 15 ups
β‘ Score: 6.5
"Paper:
https://arxiv.org/abs/2603.12288
GitHub (R simulation, Paper Summary, Audio Overview):
https://github.com/tjleestjohn/from-garbage-to-gold
I'm Terry, the first author. This paper has been 2.5 year..."
π― Benign overfitting β’ Predictor-label robustness β’ Model generalization
π¬ "Benign Overfitting (BO) is NOT something I made up or termed"
β’ "The term is stupid, despite the research being not"
π¬ RESEARCH
via Arxiv
π€ Maksim Eren, Eric Michalak, Brian Cook et al.
π
2026-03-17
β‘ Score: 6.3
"Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as s..."
π¬ RESEARCH
via Arxiv
π€ Sahil Sen, Elias Lumer, Anmol Gulati et al.
π
2026-03-17
β‘ Score: 6.3
"Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction an..."
π¬ RESEARCH
via Arxiv
π€ Amirhossein Mollaali, Bongseok Kim, Christian Moya et al.
π
2026-03-17
β‘ Score: 6.3
"Generalizing across disparate physical laws remains a fundamental challenge for artificial intelligence in science. Existing deep-learning solvers are largely confined to single-equation settings, limiting transfer across physical regimes and inference tasks. Here we introduce pADAM, a unified gener..."
π¬ RESEARCH
via Arxiv
π€ Yibo Li, Qiongxiu Li
π
2026-03-17
β‘ Score: 6.3
"Gradient inversion attacks reveal that private training text can be reconstructed from shared gradients, posing a privacy risk to large language models (LLMs). While prior methods perform well in small-batch settings, scaling to larger batch sizes and longer sequences remains challenging due to seve..."
π§ INFRASTRUCTURE
β¬οΈ 68 ups
β‘ Score: 6.3
"So after working on boot AI I had purchased some old bitcoin mining hardware to see if I could run old nvidia card on them. So I built a system that multiplexes 6 GPU dies through a single PCIe slot using a custom Linux kernel module. Switch between loaded models in under a millisecond.
Hardware:
..."
π― GPU hacking β’ Custom ML frameworks β’ Ternary quantization
π¬ "I wrote a Linux kernel module that reprograms PCI Base Address Registers"
β’ "I have my own ML framework I have been building out for the past few months in pure C"
π¬ RESEARCH
via Arxiv
π€ Christian Belardi, Justin Lovelace, Kilian Q. Weinberger et al.
π
2026-03-17
β‘ Score: 6.3
"Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves..."
π¬ RESEARCH
via Arxiv
π€ Xavier Gonzalez
π
2026-03-17
β‘ Score: 6.3
"Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical s..."
π¬ RESEARCH
via Arxiv
π€ Tianyu Xie, Jinfa Huang, Yuexiao Ma et al.
π
2026-03-17
β‘ Score: 6.3
"Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to naviga..."
π¬ RESEARCH
via Arxiv
π€ Ruisi Wang, Zhongang Cai, Fanyi Pu et al.
π
2026-03-17
β‘ Score: 6.3
"Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, w..."
π¬ RESEARCH
via Arxiv
π€ Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar et al.
π
2026-03-17
β‘ Score: 6.3
"Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it p..."
π¬ RESEARCH
via Arxiv
π€ Mattia Rigotti, Nicholas Thumiger, Thomas Frick
π
2026-03-17
β‘ Score: 6.3
"Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate..."
π¬ RESEARCH
via Arxiv
π€ Yelysei Bondarenko, Thomas Hehn, Rob Hesselink et al.
π
2026-03-17
β‘ Score: 6.3
"Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, l..."
π¬ RESEARCH
via Arxiv
π€ Nij Dorairaj, Debabrata Chatterjee, Hong Wang et al.
π
2026-03-17
β‘ Score: 6.3
"Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes in..."
π¬ RESEARCH
via Arxiv
π€ Zhitao Zeng, Mengya Xu, Jian Jiang et al.
π
2026-03-17
β‘ Score: 6.3
"Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language m..."
π¬ RESEARCH
via Arxiv
π€ Rui Ge, Yichao Fu, Yuyang Qian et al.
π
2026-03-17
β‘ Score: 6.3
"Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily opt..."
π€ AI MODELS
β¬οΈ 83 ups
β‘ Score: 6.3
"**Please Note:**
**I have posted an update which has correct numbers** for llama bench on my system in the charts. Previously llama had been built for Ada 2000 GPUs and was missing Blackwell optim..."
π― Performance Benchmarking β’ Hardware Comparison β’ Model Optimization
π¬ "I just don't get these numbers."
β’ "Krasis selectively quantises the model per your run settings"
π οΈ TOOLS
β¬οΈ 288 ups
β‘ Score: 6.3
"Hey everyone!
As the title says - in the past two weeks I built a collection of
design skill files that are basically like themes used to be with websites, but this time it's instructions for Claude or other agentic tools to build a website or application in a..."
π― Design Enhancements β’ AI-Powered Tools β’ Community Curation
π¬ "enhanced skill files which could be like a next level thing"
β’ "it's important to push it into the right direction"
π SECURITY
β¬οΈ 279 ups
β‘ Score: 6.3
"Britannica and Merriam-Webster have filed a lawsuit against OpenAI, alleging that the AI giant has built its $730 billion company on the back of their researched content.
In a filing submitted to the Southern District of New York, the companies accuse OpenAI of cannibalizing the traffic and ad reve..."
π― Intellectual property disputes β’ Copyright and fair use β’ Corporate monopolization
π¬ "Do we want companies to own the definitions of words?"
β’ "If it is of no value, why is it being crawled by OpenAI et al.?"
π¬ RESEARCH
via Arxiv
π€ Yuwen Du, Rui Ye, Shuo Tang et al.
π
2026-03-16
β‘ Score: 6.3
"Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fu..."
π¬ RESEARCH
via Arxiv
π€ Victoria Graf, Valentina Pyatkin, Nouha Dziri et al.
π
2026-03-17
β‘ Score: 6.3
"Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first intro..."
π¬ RESEARCH
via Arxiv
π€ Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys et al.
π
2026-03-17
β‘ Score: 6.3
"Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gend..."
π¬ RESEARCH
via Arxiv
π€ Jian Yang, Wei Zhang, Shawn Guo et al.
π
2026-03-17
β‘ Score: 6.3
"In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through differen..."
π¬ RESEARCH
via Arxiv
π€ Tianzhu Ye, Li Dong, Qingxiu Dong et al.
π
2026-03-17
β‘ Score: 6.3
"The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables..."
π€ AI MODELS
πΊ 174 pts
β‘ Score: 6.2
π― Model performance and pricing β’ Model capabilities and limitations β’ OpenAI's trajectory
π¬ "Mini releases matter much more and better reflect the real progress than SOTA models."
β’ "GPT-5.4 Mini averages about 180-190 t/s on API."
π οΈ TOOLS
πΊ 1 pts
β‘ Score: 6.2
π οΈ SHOW HN
πΊ 4 pts
β‘ Score: 6.1
π€ AI MODELS
β¬οΈ 16 ups
β‘ Score: 6.1
"This post is part of a series I'm working on with a broader goal: understand what one nonlinear "neuron" can do when the nonlinearity is a matrix eigenvalue, and whether that gives a useful middle ground between linear models that are easy to explain and larger neural networks that are more expressi..."
π οΈ SHOW HN
πΊ 1 pts
β‘ Score: 6.1