Vibe Shift

What AI coding assistants still can't do

Mar 13, 2025

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
—Andrej Karpathy

Pace Karpathy, I’m not feeling the vibe. It’s not that I don’t want to. More and more I feel the urge to “retire” from programming in the narrow sense of writing Python code, debugging unit tests, refactoring object hierarchies, etc. I love this stuff. I have loved it since I was a thirteen year-old in the early eighties writing a Light Cycles game in BASIC on my TRS-80 Color Computer. I am better at it than most people, but there’s a ceiling effect: eventually you to get where you have simply mastered the craft and programming becomes a very sophisticated sort of crossword puzzle. Enjoyable, but maybe not the best use of your time. At this point, much of my professional energy goes towards avoiding writing code. I reuse my own work extensively and take great efforts to make it reusable by others. I keep abreast of the open source world and all the solutions it provides for free. I spend an inordinate amount of time reminding junior programmers not to reimplement Elasticsearch.

On the research side, I find myself interested in big ideas about machine learning—in particular underexplored areas of linguistic and psychological significance—but my patience for mastering the necessary tools and technological prerequisites has decreased. There are many things I want to do, but nothing in particular that I want to code.

So having a computer just do all the damn programming for me—yeah that sounds great. It’s an arrangement I actively pursue with my current hybrid Copilot/Zencoder setup. I, for one, welcome our new AI underlings. Like Karpathy, I frequently ask my AI for assistance out of laziness and hit “Accept All” blindly, but I have yet to get to the point where I’m willing to hand over significant control to coding agents, not because I’m afraid to, but because they don’t even try to do the part that I find hard.

A concrete example.

I am interested in placing multiple LLMs into adversarial situations under the assumption that agency only really shows its true colors under pressure. To this end I work on implementing competitive games that have a negotiation component as Multi-Agent Reinforcement Learning problems. The specific games I have in mind have well-defined notions of legal and illegal moves, so I want to make use of action masking.

“Multiplayer competition” = MARL and “illegal moves” = action masking is the level on which I’d like to be thinking: I’d prefer the machine to work out the details. So I started by straight up asking Claude to write me a simple action-masked MARL game. It did, and it didn’t work. This was the beginning of an arduous process which required me to make decisions at various junctures about how best to proceed. Though all of those decisions were deeply informed by my experience with writing code, none entailed much coding in and of itself.

Was getting the initial code to work just a matter of a debugging, or did Claude in fact not know what it was talking about?

Claude’s code looked right. It has the same general shape as examples from the documentation of RLlib, the reinforcement learning framework I was using. However, the error messages Claude’s code produced were arcane—they’d make reference to problems deep within the internals of the framework, or confidently print out warnings that made no sense. It all felt fundamentally wide of the mark, and I couldn’t ask Claude for help because I knew from past experience that LLM agents don’t make good debugging partners: if they don’t know an answer immediately, they will blithely and confidently lead you in circles for as long as you let them. The easy way had ended in an impasse, so now what?

Does RLlib even support action masking in MARL environments?

Nothing said that it didn’t, but nothing said that it did either. I lean towards RLlib because it’s part of the Ray platform, which in turn I trust because it’s mature with a broad user base, and I have successfully used it in the past. There’s good RLlib documentation for MARL, but the only mention of action masking is buried deep in example code. Maybe nobody had tried to get this particular combination of features to work.

Should I give up on RLlib entirely?

I asked for clarification on the RLlib discussion group and Slack channel and was met with silence. Poking around these forums, Reddit, and Stack Overflow, I saw a lot of frustration akin to my own, people blocked on similar features and unable to either work through the problem or get guidance. Reddit posters discussed implementing their own advanced features on top of RLlib, or giving up on RLlib entirely. The silence that met all our questions—even on the generally responsive RLlib discussion board—was further indication that the functionality we desired only provisionally existed and Ray developers were just quietly hoping we’d go away.

On the other hand, I hadn’t found any good alternatives. The various frameworks listed on the PettingZoo site all looked less mature. As for writing my own, I knew that Ray had features—Farama integration, distributed systems—that would take a lot of work to implement. Despite my difficulties RLlib was still probably my best choice.

So how do I write a MARL application with action masing in RLlib?

All that work led back to where I’d started, except now I knew that AI coding agents couldn’t help, so I went into Stack Overflow mode. I shifted my immediate goal from trying to get my simple application working to writing an even simpler toy app that required no more than a half-screenful of source code. My hope was that if I made things simple enough, I’d either get some help or finally figure it out for myself.

I’m oversimplifying of course. The process wasn’t as clean and linear as an enumerated list of questions would imply. For instance, I was writing toy apps all along. This enabled me to finally produce one that was whittled down to the point where a Ray developer was willing to help. We had a bit of a back-and-forth and he led me to the correct implementation. I hadn’t made any conceptual mistakes. Even the original code that Claude wrote was more or less correct. This part of the RLlib interface was just delicate and undocumented, and it was hard to get things exactly right. I concluded the process feeling a mixture of frustration and relief.

My action masking adventure was a particularly difficult and time-consuming process, but it is typical of software development. I run through smaller decision loops all the time. There’s no avoiding it. It’s the hard part of the job. One thing to notice, though, is how little actual code writing is involved.

When Claude’s initial guess failed in (1), I started by debugging the failure, which didn’t involve any coding beyond changing a line here and there. I have yet to find AI assistants helpful for debugging, but they would have been useless here regardless, because what I was not trying to debug a problem so much as determine how difficult it would be to debug. There are two kinds of debugging sessions: the kind where if you’re persistent you’ll eventually fix the problem, and the kind that reveals you don’t understand what’s going on. How do you know the difference? You can feel it. Which scenario was I in? The latter.

After determining that I was at sea in (1), in (2) I tried to determine why this was the case. I wrote no code for this part, but I did read a fair amount of RLlib source and examples, flipping back and forth between this and the documentation. I know how work on big software projects goes: sometimes there’s an advanced but low-priority feature that a particular developer will champion. They’ll get it working, but it’ll be delicate, and there’ll be institutional pressure to call it done and move on. In the parts of the RLlib code I found myself paying the most attention to I kept seeing the same developer’s name next to “TODO” comments. Not a good sign.

The radio silence in (3) was an even worse sign, prompting me to evaluate alternatives. There was absolutely no code writing in this part. It was all reading documentation, skimming examples, gauging how active a given Github project appeared to be, and evaluating the quality of documentation websites. Website quality is a big deal. A professionally written and maintained site signals institutional backing, whereas a patchy one indicates a research project that has grown too big for its britches. When you see an illustration produced by an actual graphic designer instead of a harried coder, your confidence increases fivefold.

When I finally got around to writing code in (4)…Well, I already knew a coding agent wouldn’t be of any use because that’s where I’d started. Besides, the point of writing toys was in large part to ensure I knew how the system worked. A hazy vibe was insufficient. The hazy vibe came for free. I needed clarity.

Karpathy says of his AI coding agent “It's not too bad for throwaway weekend projects, but still quite amusing.” I can’t tell by this if he means to say that AI agents are or are not useful for anything other than throwaway weekend projects, but I suspect the latter since that has been my experience. Sure they are helpful for writing code, but writing code is the easy part.

Not that developers should take the above anecdote as evidence that their jobs are safe from AI. Even today an appropriately-prompted agent could probably make an educated guess as to how well-supported a particular software framework is, or how fully-baked a particular feature within a larger code base appears to be. That’s all still just cognitive labor. But it’s cognitive labor that integrates knowledge from disparate areas: everything from programming to graphic design to a general sense of how people behave in this industry. Well beyond Copilot’s purview.

I suspect this is true of a lot of knowledge work: there’s a narrowly-defined task—the sort of thing you’d put in a job description—and then a holistic task of bringing the narrow task into fruitful contact with other human beings and the rest of the world. This may prove to be the real point of developing AGI1, not to serve as a stepping stone to some ASI God2 but merely to automate office work that is harder than it looks.

Assuming this is a meaningful stage of development.

Assuming such a thing is possible.

Corner Cases

Discussion about this post