๐ WELCOME TO METAMESH.BIZ +++ Oxford finds 445 AI benchmarks are basically vibes-based performance theater (construct validity was never invited to this party) +++ DeepMind's AlphaEvolve improves 20 math problems out of 67 which is either revolutionary or Tuesday depending on your priors +++ WavJEPA drops yet another audio foundation model operating on raw waveforms because apparently spectrograms are for quitters +++ THE MESH MEASURES ITSELF WITH BROKEN RULERS AND CALLS IT PROGRESS +++ ๐ โข
๐ WELCOME TO METAMESH.BIZ +++ Oxford finds 445 AI benchmarks are basically vibes-based performance theater (construct validity was never invited to this party) +++ DeepMind's AlphaEvolve improves 20 math problems out of 67 which is either revolutionary or Tuesday depending on your priors +++ WavJEPA drops yet another audio foundation model operating on raw waveforms because apparently spectrograms are for quitters +++ THE MESH MEASURES ITSELF WITH BROKEN RULERS AND CALLS IT PROGRESS +++ ๐ โข
via Arxiv๐ค Geoff McDonald, Jonathan Bar Or๐ 2025-11-05
โก Score: 7.9
"Large Language Models (LLMs) are increasingly deployed in sensitive domains
including healthcare, legal services, and confidential communications, where
privacy is paramount. This paper introduces Whisper Leak, a side-channel attack
that infers user prompt topics from encrypted LLM traffic by analyz..."
๐ BENCHMARKS
Oxford benchmark study on AI testing flaws
2x SOURCES ๐๐ 2025-11-07
โก Score: 7.8
+++ Oxford researchers examined 445 LLM benchmarks and found the field has been measuring vibes instead of actual capabilities, which explains a lot about recent AI claim inflation. +++
"If youโre unfamiliar with the term, โconstruct validityโ is a psychometric term for a measuring the theoretical concept itโs intended to:
> We reviewed 445 LLM benchmarks from the proceedings of top AI conferences. We found many measurement challenges, including vague definitions for target phen..."
"
TL; DR by Claude
OpenAI clarifies three key points:
1. **No government bailouts wanted**: They donโt want government guarantees for their datacenters. They believe governments shouldnโt pick winners/losers or bail out failing companies. However, they support governments building their own AI inf..."
๐ฏ Tabular data challenges โข Foundational models for tabular data โข Automated feature engineering
๐ฌ "The challenge is always that you need to spend a lot of time feature engineering and tweaking the data representation"
โข "The promise of foundation models for tabular data is that there are enough generalizable patterns"
via Arxiv๐ค Xingyao Wang, Simon Rosenberg, Juan Michelini et al.๐ 2025-11-05
โก Score: 6.9
"Agents are now used widely in the process of software development, but
building production-ready software engineering agents is a complex task.
Deploying software agents effectively requires flexibility in implementation
and experimentation, reliable and secure execution, and interfaces for users to..."
via Arxiv๐ค Haofei Yu, Fenghai Li, Jiaxuan You๐ 2025-11-05
โก Score: 6.8
"Large language models (LLMs) achieve strong performance across
benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but
these tests occur in static settings, lacking real dynamics and uncertainty.
Consequently, they evaluate isolated reasoning or problem-solving rather than
deci..."
via Arxiv๐ค Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari et al.๐ 2025-11-06
โก Score: 6.8
"Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist..."
"External link discussion - see full content at original source."
๐ฌ Reddit Discussion: 29 comments
๐ MID OR MIXED
๐ฏ Malware evolution โข AI-powered hacking โข Accessibility of malware
๐ฌ "malware that has to use AI resources sounds easily detected"
โข "Imagine how much faster that would be with a specially trained black market AI sidekick?"
via Arxiv๐ค Guanning Zeng, Zhaoyi Zhou, Daman Arora et al.๐ 2025-11-05
โก Score: 6.7
"Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful paradigm for post-training large reasoning models (LRMs) using
policy-gradient methods such as GRPO. To stabilize training, these methods
typically center trajectory rewards by subtracting the empirical mean for each
pro..."
"Neural networks can approximate solutions to partial differential equations,
but they often break the very laws they are meant to model-creating mass from
nowhere, drifting shocks, or violating conservation and entropy. We address
this by training within the laws of physics rather than beside them...."
via Arxiv๐ค Sitan Chen, Kevin Cong, Jerry Li๐ 2025-11-06
โก Score: 6.6
"A major bottleneck of standard auto-regressive large language models is that
their inference process is inherently sequential, resulting in very long and
costly inference times. To circumvent this, practitioners proposed a class of
language models called diffusion language models, of which the maske..."
via Arxiv๐ค Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy et al.๐ 2025-11-06
โก Score: 6.6
"As the role of Large Language Models (LLM)-based coding assistants in
software development becomes more critical, so does the role of the bugs they
generate in the overall cybersecurity landscape. While a number of LLM code
security benchmarks have been proposed alongside approaches to improve the
s..."
via Arxiv๐ค Ding Chen, Simin Niu, Kehang Li et al.๐ 2025-11-05
โก Score: 6.6
"Memory systems are key components that enable AI systems such as LLMs and AI
agents to achieve long-term learning and sustained interaction. However, during
memory storage and retrieval, these systems frequently exhibit memory
hallucinations, including fabrication, errors, conflicts, and omissions...."
via Arxiv๐ค Joshua Gao, Quoc Huy Pham, Subin Varghese et al.๐ 2025-11-06
โก Score: 6.5
"Retrieval-Augmented Generation (RAG) is a critical technique for grounding
Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in
specialized, safety-critical domains remains a significant challenge. Existing
evaluation frameworks often rely on heuristic-based metrics that f..."
via Arxiv๐ค Roberta Di Marino, Giovanni Dioguardi, Antonio Romano et al.๐ 2025-11-05
โก Score: 6.5
"Medical question answering systems face deployment challenges including
hallucinations, bias, computational demands, privacy concerns, and the need for
specialized expertise across diverse domains. Here, we present SOLVE-Med, a
multi-agent architecture combining domain-specialized small language mod..."
via Arxiv๐ค Yu Feng, Nathaniel Weir, Kaj Bostrom et al.๐ 2025-11-06
โก Score: 6.4
"LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but
they cannot reliably verify their own logic. Even when they reach correct
answers, the underlying reasoning may be flawed, undermining trust in
high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a
neuro-symbo..."
Microsoft AI CEO Mustafa Suleyman superintelligence plans
2x SOURCES ๐๐ 2025-11-06
โก Score: 6.3
+++ Suleyman's new team will pursue AGI while supposedly maintaining human oversight, because nothing says "we've got this" like forming a dedicated department for the thing that might not need us. +++
๐ฏ PyTorch Community โข Soumith's Contributions โข Transition to New Challenges
๐ฌ "He consistently celebrated the contributions of his co-creators Adam and Sam"
โข "PT has a unique level of broad support that few other open source technology can reach"
๐ฏ Impact of AI on Education โข Social Media Addiction โข AI as Tool vs Crutch
๐ฌ "I'm worried about younger folks not knowing how to conduct a traditional Google search."
โข "AI is turning people dumb. I see it all the time with code slop."
๐ฌ HackerNews Buzz: 115 comments
๐ MID OR MIXED
๐ฏ Google privacy concerns โข AI overreach โข Dissatisfaction with Google's practices
๐ฌ "Giving someone a GMail address is like saying 'Yes, I like to be abused, I like to be violated and have no privacy."
โข "Google must have some awful PMs and designers. The worst UX decision I have seen recently is AI auto-dubbing all youtube videos by default with no way to disable this behavior globally."
"This is rather a really exciting news (if you have 2TB of RAM ...)! I know 2TB is huge, but it's still "more manageable" than VRAM (also technically you only need 1TB I think).
Based on this PR (WIP), it seems it's possible to run the **lates..."