Top stories

keepamovin about 6 hours ago

Show HN: Gemini Pro 3 hallucinates the HN front page 10 years from now

The article discusses the future of news consumption in 2035, predicting a shift towards more personalized, interactive, and immersive news experiences driven by advancements in technology and user preferences.

dosaygo-studio.github.io
1,107 468
Summary
10 Years of Let's Encrypt
SGran about 2 hours ago

10 Years of Let's Encrypt

Let's Encrypt, a free and open-source certificate authority, celebrates its 10th anniversary in 2025. The article discusses the organization's accomplishments in making encryption more accessible, its future plans, and the continued importance of secure internet connections.

letsencrypt.org
132 44
Summary
PeerTube is recognized as a digital public good by Digital Public Goods Alliance
fsflover about 4 hours ago

PeerTube is recognized as a digital public good by Digital Public Goods Alliance

PeerTube is a decentralized, federated video hosting platform that aims to provide an alternative to centralized platforms like YouTube. It allows users to host and share videos while maintaining control over their content and data.

digitalpublicgoods.net
241 37
Summary
pember about 7 hours ago

Mistral Releases Devstral 2 (72.2% SWE-Bench Verified) and Vibe CLI

DevStral 2, a new CLI tool from Mistral AI, aims to simplify the development and deployment of AI models, allowing developers to efficiently manage their AI projects from the command line.

mistral.ai
350 167
Summary
sramsay about 4 hours ago

If you're going to vibe code, why not do it in C?

The article discusses the concept of 'vibe coding,' which emphasizes the importance of an intuitive, organic approach to software development, rather than a rigid, rule-based one. It suggests that embracing ambiguity and maintaining a sense of playfulness can lead to more creative and effective coding practices.

stephenramsay.net
158 172
Summary
Handsdown one of the coolest 3D websites
razzmataks about 5 hours ago

Handsdown one of the coolest 3D websites

Bruno Simon is a French web developer who creates stunning 3D websites and interactive experiences using Three.js, a popular JavaScript library for creating 3D graphics in web browsers. His portfolio showcases his expertise in building visually captivating and technologically advanced web applications.

bruno-simon.com
247 69
Summary
Pebble Index 01 – External memory for your brain
freshrap6 about 6 hours ago

Pebble Index 01 – External memory for your brain

Pebble Index 01 is a device that aims to serve as external memory for the human brain, allowing users to store and easily recall information. The article explores the potential benefits and challenges of using such a device to augment human cognitive capabilities.

repebble.com
271 278
Summary
So You Want to Speak at Software Conferences?
speckx about 3 hours ago

So You Want to Speak at Software Conferences?

The article provides practical advice for those interested in speaking at software conferences, covering topics such as overcoming impostor syndrome, developing a talk proposal, and navigating the conference circuit.

dylanbeattie.net
45 8
Summary
Donating the Model Context Protocol and Establishing the Agentic AI Foundation
meetpateltech about 4 hours ago

Donating the Model Context Protocol and Establishing the Agentic AI Foundation

Anthropic has open-sourced the Model Context Protocol, a framework for AI models to be transparently shared and used. The company has also established the Agentic AI Foundation to promote the development and responsible use of AI systems.

anthropic.com
63 27
Summary
Kaiju – General purpose 3D/2D game engine in Go and Vulkan with built in editor
discomrobertul8 about 6 hours ago

Kaiju – General purpose 3D/2D game engine in Go and Vulkan with built in editor

Kaiju Engine is an open-source game engine designed for creating 2D games. It provides a comprehensive set of tools and features, including a scene editor, asset management, and a physics simulation system, to help developers build and deploy their games efficiently.

github.com
119 50
Summary
gpjt 7 days ago

LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

The article discusses the process of training a base language model from scratch, including data preprocessing, model architecture, and training techniques. It provides a step-by-step guide for building a basic language model without relying on pre-trained models.

gilesthomas.com
408 94
Summary
The stack circuitry of the Intel 8087 floating point chip, reverse-engineered
elpocko about 3 hours ago

The stack circuitry of the Intel 8087 floating point chip, reverse-engineered

righto.com
22 9
Clearspace (YC W23) Is Hiring a Founding Designer
roycebranning about 4 hours ago

Clearspace (YC W23) Is Hiring a Founding Designer

Clearspace, a Y Combinator-backed company, is seeking a Founding Designer to join their team and contribute to the development of their innovative workplace collaboration platform. The ideal candidate will have a strong background in user experience design and a passion for creating intuitive and visually appealing digital products.

ycombinator.com
1 0
Summary
speckx about 7 hours ago

My favourite small hash table

The article discusses a small, efficient hash table implementation that the author has found useful in their work. It covers the design choices and performance characteristics of this hash table, which is intended to be a simple and lightweight solution for certain types of problems.

corsix.org
86 17
Summary
cgorlla about 5 hours ago

Launch HN: Mentat (YC F24) – Controlling LLMs with Runtime Intervention

Hi HN, I’m Cyril from CTGT. Today we’re launching Mentat (https://docs.ctgt.ai/api-reference/endpoint/chat-completions), an API that gives developers deterministic control over LLM behavior, steering reasoning and removing bias on the fly, without the compute of fine-tuning or the brittleness of prompt engineering. We use feature-level intervention and graph-based verification to fix hallucinations and enforce policies.

This resonates in highly regulated industries or otherwise risky applications of AI where the fallout from incorrect or underperforming output can be significant. In financial services, using GenAI to scan for noncompliant communications can be arduous without an easy way to embed complex policies into the model. Similarly, a media outlet might want to scale AI-generated summaries of their content, but reliability and accuracy is paramount. These are both applications where Fortune 500 companies have utilized our technology to improve subpar performance from existing models, and we want to bring this capability to more people.

Here’s a quick 2-minute demo video showing the process: https://video.ctgt.ai/video/ctgt-ai-compliance-playground-cf...

Standard "guardrails" like RAG and system prompts are fundamentally probabilistic: you are essentially asking the model nicely to behave. This often fails in two ways. First, RAG solves knowledge availability but not integration. In our benchmarks, a model given context that "Lerwick is 228 miles SE of Tórshavn" failed to answer "What is 228 miles NW of Lerwick?" because it couldn't perform the spatial inversion.

Second, prompt engineering is brittle because it fights against the model's pre-training priors. For example, on the TruthfulQA benchmark, base models fail ~80% of the time because they mimic common misconceptions found on the internet (e.g. "chameleons change color for camouflage"). We found that we could literally turn up the feature for "skeptical reasoning" to make the model ignore the popular myth and output the scientific fact. This matters because for high-stakes use cases (like Finance or Pharma), "mostly safe" isn't acceptable—companies need audit-grade reliability.

Our work stems from the CS dungeon at UCSD, with years spent researching efficient and interpretable AI, trying to "open the black box" of neural networks. We realized that the industry was trying to patch model behavior from the outside (prompts/filters) when the problem was on the inside (feature activations). We knew this was important when we saw enterprises struggling to deploy basic models despite having unlimited compute, simply because they couldn't guarantee the output wouldn't violate compliance rules. I ended up leaving my research at Stanford to focus on this.

Our breakthrough came while researching the DeepSeek-R1 model. We identified the "censorship" feature vector in its latent space. Amplifying it guaranteed refusal; subtracting it instantly unlocked answers to sensitive questions. This proved the model had the knowledge but was suppressing it. We realized we could apply this same logic to hallucinations, suppressing "confabulation" features to reveal the grounded truth. While some hallucinations stem from the inherent randomness of generative models, many can be identified with the concerted activation of a feature or group of features.

Instead of filtering outputs, we intervene at the activation level during the forward pass. We identify latent feature vectors (v) associated with specific behaviors (bias, misconception) and mathematically modify the hidden state (h):

  h_prime = h - alpha * (h @ v) * v
This arithmetic operation lets us "edit" behavior deterministically with negligible overhead (<10ms on R1). For factual claims, we combine this with a graph verification pipeline (which works on closed weight models). We check semantic entropy (is the model babbling?) and cross-reference claims against a dynamic knowledge graph to catch subtle relational hallucinations that vector search misses.

On GPT-OSS-120b, this approach improved TruthfulQA accuracy from 21% to 70% by suppressing misconception features. We also improved the performance of this model to frontier levels on HaluEval-QA, where we reached 96.5% accuracy, solving the spatial reasoning failures where the baseline failed. It also handles noisy inputs, inferring "David Icke" from the typo "David Of me" where base models gave up. Full benchmarks at https://ctgt.ai/benchmarks.

Most startups in this space are observability tools that tell you only after the model failed. Or they are RAG pipelines that stuff context into the window. Mentat is an infrastructure layer that modifies the model's processing during inference. We fix the reasoning, not just the context. For example, that’s how our system was able to enforce that if A is SE of B, then B is NW of A.

We believe that our policy engine is a superior control mechanism to RAG or prompting. If you’re frustrated with current guardrails, we’d love it if you would stress-test our API!

API: Our endpoint is drop-in compatible with OpenAI’s /v1/chat/completions: https://docs.ctgt.ai/api-reference/endpoint/chat-completions

Playground: We’ve built an "Arena" view to run side-by-side comparisons of an Ungoverned vs. Governed model to visualize the intervention delta in real-time. No signup is required: https://playground.ctgt.ai/

We’d love to hear your feedback on the approach and see what edge cases you can find that break standard models. We will be in the comments all day. All feedback welcome!

playground.ctgt.ai
24 21
Summary
MCP Joins the Agentic AI Foundation
arthurdenture about 2 hours ago

MCP Joins the Agentic AI Foundation

The Model Context Protocol (MCP) has announced that it is joining the Agentic AI Foundation, a collaboration aimed at developing safe and beneficial artificial intelligence. This collaboration will focus on advancing the MCP framework and its application in the field of AI development.

blog.modelcontextprotocol.io
22 2
Summary
"The Matilda Effect": Pioneering Women Scientists Written Out of Science History
binning about 3 hours ago

"The Matilda Effect": Pioneering Women Scientists Written Out of Science History

The article discusses the 'Matilda Effect', a phenomenon where the contributions of women in science are often overlooked or attributed to their male colleagues. It highlights the historical and ongoing challenges faced by women in academia and research, and the importance of recognizing and addressing gender bias in the scientific community.

openculture.com
32 5
Summary
Show HN: AlgoDrill – Interactive drills to stop forgetting LeetCode patterns
henwfan about 10 hours ago

Show HN: AlgoDrill – Interactive drills to stop forgetting LeetCode patterns

I built AlgoDrill because I kept grinding LeetCode, thinking I knew the pattern, and then completely blanking when I had to implement it from scratch a few weeks later.

AlgoDrill turns NeetCode 150 and more into pattern-based drills: you rebuild the solution line by line with active recall, get first principles editorials that explain why each step exists, and everything is tagged by patterns like sliding window, two pointers, and DP so you can hammer the ones you keep forgetting. The goal is simple: turn familiar patterns into code you can write quickly and confidently in a real interview.

https://algodrill.io

Would love feedback on whether this drill-style approach feels like a real upgrade over just solving problems once, and what’s most confusing or missing when you first land on the site.

algodrill.io
142 86
Summary
Saurabh_Kumar_ 6 days ago

Agentic QA – Open-source middleware to fuzz-test agents for loops

I built this because I watched my LangChain agent burn ~$50 in OpenAI credits overnight due to an infinite loop.

It's a middleware API that acts as a 'Flight Simulator'. You send it your agent's prompt, and it runs adversarial attacks (Red Teaming) to catch loops and PII leaks before deployment.

Code & Repo: https://github.com/Saurabh0377/agentic-qa-api Live Demo: https://agentic-qa-engine.onrender.com/docs

Would love feedback on other failure modes you've seen!

17 5
sjoblomj about 12 hours ago

30 Year Anniversary of WarCraft II: Tides of Darkness

jorsys.org
131 81
AWS Trainium3 Deep Dive – A Potential Challenger Approaching
Symmetry 5 days ago

AWS Trainium3 Deep Dive – A Potential Challenger Approaching

The article provides an in-depth technical analysis of AWS Trainium3, a new machine learning-focused chip from Amazon Web Services. It delves into the chip's architecture, performance, and potential impact on the cloud computing and AI/ML landscape.

newsletter.semianalysis.com
50 16
Summary
The Joy of Playing Grandia, on Sega Saturn
tosh about 11 hours ago

The Joy of Playing Grandia, on Sega Saturn

The article discusses the joy of playing the classic RPG Grandia on the Sega Saturn console. It highlights the game's memorable characters, engaging story, and seamless combat system, as well as the satisfaction of experiencing this beloved title on the Saturn's hardware.

segasaturnshiro.com
157 99
Summary
drob about 4 hours ago

Show HN: Detail, a Bug Finder

Hi HN, tl;dr we built a bug finder that's working really well, especially for app backends. Try it out and send us your thoughts!

Long story below.

--------------------------

We originally set out to work on technical debt. We had all seen codebases with a lot of debt, so we had personal grudges about the problem, and AI seemed to be making it a lot worse.

Tech debt also seemed like a great problem for AI because: 1) a small portion of the work is thinky and strategic, and then the bulk of the execution is pretty mechanical, and 2) when you're solving technical debt, you're usually trying to preserve existing behavior, just change the implementation. That means you can treat it as a closed-loop problem if you figure out good ways to detect unintended behavior changes due to a code change. And we know how to do that – that's what tests are for!

So we started with writing tests. Tests create the guardrails that make future code changes safer. Our thinking was: if we can test well enough, we can automate a lot of other tech debt work at very high quality.

We built an agent that could write thousands of new tests for a typical codebase, most "merge-quality". Some early users merged hundreds of PRs generated this way, but intuitively the tool always felt "good but not great". We used it sporadically ourselves, and it usually felt like a chore.

Around this point we realized: while we had set out to write good tests, we had built a system that, with a few tweaks, might be very good at finding bugs. When we tested it out on some friends' codebases, we discovered that almost every repo has tons of bugs lurking in it that we were able to flag. Serious bugs, interesting enough that people dropped what they were doing to fix them. Sitting right there in peoples codebases, already merged, running in prod.

We also found a lot of vulns, even in mature codebases, and sometimes even right after someone had gotten a pentest.

Under the hood: - We check out a codebase and figure out how to build it for local dev and exercise it with tests. - We take snapshots of the built local dev state. (We use Runloop for this and are big fans.) - We spin up hundreds of copies of the local dev environment to exercise the codebase in thousands of ways and flag behaviors that seem wrong. - We pick the most salient, scary examples and deliver them as linear tickets, github issues, or emails.

In practice, it's working pretty well. We've been able to find bugs in everything from compilers to trading platforms (even in rust code), but the sweet spot is app backends.

Our approach trades compute for quality. Our codebase scans take hours, far beyond what would be practical for a code review bot. But the result is that we can make more judicious use of engineers’ attention, and we think that’s going to be the most important variable.

Longer term, we think compute is cheap, engineer attention is expensive. Wielded properly, the newest models can execute complicated changes, even in large codebases. That means the limiting reagent in building software is human attention. It still takes time and focus for an engineer to ingest information, e.g. existing code, organizational context, and product requirements. These are all necessary before an engineer can articulate what they want in precise terms and do a competent job reviewing the resulting diff.

For now we're finding bugs, but the techniques we're developing extend to a lot of other background, semi-proactive work to improve codebases.

Try it out and tell us what you think. Free first scan, no credit card required: https://detail.dev/

We're also scanning on OSS repos, if you have any requests. The system is pretty high signal-to-noise, but we don't want to risk annoying maintainers by automatically opening issues, so if you request a scan for an OSS repo the results will go to you personally. https://detail.dev/oss

detail.dev
36 16
Summary
Apple's slow AI pace becomes a strength as market grows weary of spending
bgwalter about 6 hours ago

Apple's slow AI pace becomes a strength as market grows weary of spending

Apple's pace of artificial intelligence development has reportedly slowed, leading to concerns about the company's ability to compete with rivals like Google and Amazon in the AI space. The article discusses Apple's shifting priorities and challenges in keeping up with the rapid advancements in AI technology.

finance.yahoo.com
102 115
Summary
luispa 8 days ago

Constructing the Word's First JPEG XL MD5 Hash Quine

This article provides a detailed writeup on the JXL hashquine, a type of hash collision where multiple inputs produce the same hash value. The author analyzes the technical aspects of the JXL hashquine and discusses its implications for cryptographic hash functions.

stackchk.fail
88 17
Summary
Xcelerate 6 days ago

Transformers know more than they can tell: Learning the Collatz sequence

arxiv.org
89 32
embedding-shape about 5 hours ago

Ask HN: Should "I asked $AI, and it said" replies be forbidden in HN guidelines?

As various LLMs become more and more popular, so does comments with "I asked Gemini, and Gemini said ....".

While the guidelines were written (and iterated on) during a different time, it seems like it might be time to have a discussion about if those sort of comments should be welcomed on HN or not.

Some examples:

- https://news.ycombinator.com/item?id=46164360

- https://news.ycombinator.com/item?id=46200460

- https://news.ycombinator.com/item?id=46080064

Personally, I'm on HN for the human conversation, and large LLM-generated texts just get in the way of reading real text from real humans (assumed, at least).

What do you think? Should responses that basically boil down to "I asked $LLM about $X, and here is what $LLM said:" be allowed on HN, and the guidelines updated to state that people shouldn't critique it (similar to other guidelines currently), or should a new guideline be added to ask people from refrain from copy-pasting large LLM responses into the comments, or something else completely?

605 343
Tutorial 48: my museum collections kit
surprisetalk 4 days ago

Tutorial 48: my museum collections kit

The article provides a tutorial on how to create a 'museum collections kit' to store and organize paleontological and other scientific specimens. It covers the necessary equipment, materials, and step-by-step instructions for assembling a customizable kit for field work and research.

svpow.com
5 0
Summary
felineflock about 2 hours ago

The Big Vitamin D Mistake [pdf]

The article explores the associations between air pollution exposure and cardiovascular health outcomes, focusing on the potential mechanisms by which air pollutants can contribute to cardiovascular disease development and progression.

pmc.ncbi.nlm.nih.gov
27 9
Summary
How private equity is changing housing
harambae about 4 hours ago

How private equity is changing housing

The article examines the increasing influence of private equity firms in the housing market, highlighting how their acquisition of single-family homes has led to rising rents and reduced home ownership opportunities, particularly for low-income and minority communities.

theatlantic.com
72 162
Summary