π WELCOME TO METAMESH.BIZ +++ Google's TPUv7 Ironwood enters the chat with actual competition for Jensen's monopoly (Nvidia stock only dropped 0.3%) +++ AI casually solving ErdΕs Problem #124 while mathematicians update their LinkedIn profiles +++ Alibaba's Qwen3-VL claiming perfect accuracy on 30-minute video tasks (your YouTube attention span could never) +++ Turns out you can jailbreak safety guardrails with haikus because apparently AI models are romantics at heart +++ YOUR NEXT PERFORMANCE REVIEW WILL BE WRITTEN BY A STRESSED AGENT THAT LEARNED TO LIE +++ π β’
π WELCOME TO METAMESH.BIZ +++ Google's TPUv7 Ironwood enters the chat with actual competition for Jensen's monopoly (Nvidia stock only dropped 0.3%) +++ AI casually solving ErdΕs Problem #124 while mathematicians update their LinkedIn profiles +++ Alibaba's Qwen3-VL claiming perfect accuracy on 30-minute video tasks (your YouTube attention span could never) +++ Turns out you can jailbreak safety guardrails with haikus because apparently AI models are romantics at heart +++ YOUR NEXT PERFORMANCE REVIEW WILL BE WRITTEN BY A STRESSED AGENT THAT LEARNED TO LIE +++ π β’
+++ ICLR 2026 received ~21% fully AI-written reviews and 50%+ showing AI fingerprints, suggesting the field's quality gatekeepers have started automating themselves out of the equation. +++
via Arxivπ€ Hans Gundlach, Alex Fogelson, Jayson Lynch et al.π 2025-11-26
β‘ Score: 8.2
"Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader lite..."
π‘οΈ SAFETY
Agent Misbehavior Under Pressure
2x SOURCES ππ 2025-11-29
β‘ Score: 7.8
+++ PropensityBench reveals that agentic AI systems cut corners on safety under deadline pressure, which is either a cautionary tale about deployment or validation that we've successfully replicated human workplace behavior. +++
+++ An AI system independently proved Erdos Problem #124, raising the delightful question of whether we can trust machine proofs or just really trust the machine's credentials. +++
"It includesΒ
1. 4B GUI Agent modelΒ capable of running on local computers.
2. Plug-and-play inference infrastructureΒ that handles ADB connections, dependency installation, and task recording/replay..."
π¬ Reddit Discussion: 13 comments
π MID OR MIXED
π― Mobile app limitations β’ Automated notes export β’ Obsidian as alternative
π¬ "I haven't reviewed it yet, but you could theoretically run adb via wireless with 'adb pair' or 'adb connect"
β’ "Yep and mobile phones dont need this. I reckon this is most likely for troll/like farms and such in SEA and Slavic countries"
"Link to the post: https://github.com/ggml-org/llama.cpp/discussions/17621
We've been working over the last few months on kernel fusion in llama.cpp, I wrote a small write-up, it's semi-technical but one of the things I wanted to raise aware..."
π¬ "Have the agent address you as something specific!"
β’ "Documenting your code is easier than prompt engineering"
π‘ AI NEWS BUT ACTUALLY GOOD
The revolution will not be televised, but Claude will email you once we hit the singularity.
Get the stories that matter in Today's AI Briefing.
Powered by Premium Technology Intelligence Algorithms β’ Unsubscribe anytime
π¬ RESEARCH
MIT + Colombia Study on AI vs Human Writers
2x SOURCES ππ 2025-11-29
β‘ Score: 7.3
+++ MIT researchers found readers prefer AI outputs mimicking award-winning authors over MFA graduates, raising the uncomfortable question of whether we've optimized for style over substance. +++
"From the abstract:
We conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude, and Gemini in writing up to 450 word excerpts emulating 50 award-winning authorsβ (including Nobel laureates, Booker Prize winners, and young emerging National ..."
π¬ Reddit Discussion: 1 comments
π BUZZING
π― AI writing quality β’ Mimicry vs. originality β’ MFA vs. LLM performance
π¬ "AI can ace writing from a single famous author when fed that single author's works"
β’ "The surprise was that feeding the LLMs only the works of one of the famous authors led to the LLMs being overall favoured by pro and lay readers alike"
"From the abstract:
We conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude, and Gemini in writing up to 450 word excerpts emulating 50 award-winning authorsβ (including Nobel laureates, Booker Prize winners, and young emerging National ..."
π¬ Reddit Discussion: 5 comments
π MID OR MIXED
π― Methodology critique β’ AI writing quality β’ Contextual limitations
π¬ "This is a research paper, not a news article."
β’ "They chose the ai as preferable and higher quality than the one written by an mfa."
via Arxivπ€ Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.π 2025-11-26
β‘ Score: 6.9
"We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family inc..."
"Hi r/LocalLLaMA,
Iβve been working on strictly local, data-privacy-compliant AI solutions for about two years now. Dealing with sensitive data meant that cloud APIs were never an optionβit had to be air-gapped or on-prem.
The biggest lesson I learned:
We spend 90% of our time debating model quant..."
"DeepSeek just released an openβweight math model that reaches Mathematical Olympiad (IMO) goldβlevel performanceβand published the training and evaluation βplaybook.β Hereβs whatβs new, why it matters, and what builders can do with it today."
via Arxivπ€ Anantha Padmanaban Krishna Kumarπ 2025-11-26
β‘ Score: 6.9
"Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve wi..."
via Arxivπ€ Locke Cai, Ivan Provilkovπ 2025-11-26
β‘ Score: 6.8
"Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We i..."
via Arxivπ€ Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy et al.π 2025-11-26
β‘ Score: 6.8
"Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study..."
"Hey everyone, author of LocalAI here.
I just pushed version 3.8.0 and wanted to share the updates with the community. For those unaware, LocalAI acts as an OpenAI-compatible API wrapper around llama.cpp, diffusers, vLLM, MLX, and other backends.
This release focuses heavily on Agentic workflow..."
"
People are going crazy with Opus 4.5. There are so many angles to think about using it which I never crossed my mind. This post is full of ideas, have fun!
## The autonomous coding thing is real
Adam Wolff from Anthropic says Opus 4.5 codes autonomously for 20-30 minutes at a time. You come bac..."
via Arxivπ€ OΔuz KaΔan Hitit, Leander Girrbach, Zeynep Akataπ 2025-11-26
β‘ Score: 6.7
"Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize t..."
via Arxivπ€ Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeldπ 2025-11-26
β‘ Score: 6.6
"We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out uni..."
via Arxivπ€ Fengze Yu, Leshu Li, Brad McDanel et al.π 2025-11-26
β‘ Score: 6.6
"Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed sp..."
via r/ChatGPTπ€ u/Beautiful-Homework47π 2025-11-29
β¬οΈ 4358 upsβ‘ Score: 6.5
"Saw this on Twitter and it was a splash of cold water. Rant below.
According to HSBC, MIT study, etc. OpenAI (+AI in general) simply isn't making anywhere near the amount of money it needs to be.
Ads seem like the way to go - Google makes a ton of money through its ad streams, which allows it to o..."
π¬ HackerNews Buzz: 3 comments
π MID OR MIXED
π― Hardware Implementation β’ Scalability β’ Power Consumption
π¬ "Translating a simulation into real hardware... is properly hard."
β’ "If it does work, I think one of the biggest challenges will be adding enough complexity to it for it to do real, useful computation."
via Arxivπ€ Dong Wang, Yang Li, Ansong Ni et al.π 2025-11-26
β‘ Score: 6.1
"Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality..."
via Arxivπ€ Daniel R. Jiang, Jalaj Bhandari, Yukai Yang et al.π 2025-11-26
β‘ Score: 6.1
"Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the d..."
via Arxivπ€ Hongjin Su, Shizhe Diao, Ximing Lu et al.π 2025-11-26
β‘ Score: 6.1
"Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the u..."