AI Progress doesn't feel as fast as we're told
I do understand that I have absolutely no insight into anything. This is just an observation. Consider it a dummy observation.
I was using Claude Code over a year ago. I was the first to request hooks and create a semi-spec for it. Last summer I was out on walks coding from my phone, connected to agents running on my desktop through Steipete’s VibeTunnel. I’ve done spec-driven development. I’ve done extensive hallucinogenic shit & gastowns, like pretending I’m a general and my agents are my soldiers going into battle. I’ve released popular tooling around all this.
I’ve been in the weeds on these things every day.
It feels like it’s slowing. Marginal improvements. I don’t care what the benchmarks say. I’ve been using these tools for over a year. I’m sure somebody on Hacker News will comment about some philosophical bias. Sure, alright. Maybe.
Or maybe the labs have released models that aren’t as good as previous versions. Maybe they are compute-constrained. Maybe they’ve hit the limit for reinforcement-learned transformer models. I don’t know. I’m not an expert.
Tell you what, though.
- I do not trust models to operate overnight.
- I do not trust any fully vibecoded solution to stand longer than a month.
- I do not think we have the cognitive space to manage 20x agents simultaneously.
- I do not think long-running agents can think for us.
- I do not think quasi-RPA really offers much in the automated browser space.
- I do think it’s a bad sign that the labs are just in a cycle of ripping off each other’s interfaces & features. The second a model degrades, none of the tooling matters anyway.
- I do think an opportunity is arising for open source models, and it’s important to protect that. If the frontier models are stalling, there’s going to be a lot of power and effort to restrain what’s coming out of open source because it’s not far behind.
- I do not think SV/FAANG/ex-Meta et al. actually can drive safe AI products that are better for humanity. Regardless of their roots, these labs are trying to become the oracles of tomorrow now, and their product culture is highly apparent in how they hide transparency, do things like A/B tests, or switch flags behind our backs, etc. We’re trying to optimize tokens and spin for your faithful transaction every month. Fight for open source.
The head of security at one of the frontier labs told me they didn’t think security was really that much of a problem. Models would just get better. That’s probably a red flag as well. I think the better models get, the more friction there will be. It doesn’t seem like we’re any closer to solving true alignment or prompt injection, because sycophancy is what also enables these models.
- The better models get, the more capable in either good or bad directions they become. The frontier models of today all suffer from the same science of similar jailbreaks of yesterday.
It’ll be interesting to see who is letting WhateverClaw touch their email a year from now, and whose agent has been phished into leaking data.
Credit where it’s due
- Amazing progress on the arts and media front. I love using Google Flow, ChatGPT Image 2, and Suno.com. No idea what it means for humanity and social implications, but as a Joe I absolutely love these tools. And they seem to be on a rapid incline, with the ceiling for their improvement nowhere in sight.
- Claude Code is still a phenomenal piece of software despite all of its shortcomings. It does give me power user capabilities in ways other harnesses do not. Shout out to Pi, quickly closing the gap or enabling the community to do so.