Is Claude Getting Worse?

According to some, there is a conspiracy brewing at Anthropic. Their models mysteriously got dumber lately, and they refuse to answer to it. All kinds of anecdotes and explanations are popping up on the ClaudeAI subreddit, including from people who appear to know a lot about how these models work. This drama intensified after someone from the company dropped in to assert that they hadn’t noticed any widespread issues that would result in a global degradation. Everyone loves a good conspiracy, so as the “X-Files” music started playing in my head, I reflected back on my recent experiences with Claude to find some evidence that I was being lied to.

For the past month I’ve been making extensive use of the Projects feaure on Claude.ai. The main project I keep coming back to is a React app which involves around 10 different components. All of it was born from Claude 3.5 Sonnet.

Over that time, I’ve reiterated through all the stages of delight and frustration that is common for anyone using these kinds of tools. I swooned when I told Claude that I wanted my app to ‘look like LCARS from Star Trek: The Next Generation’ and it immediately figured out a decent version of this, using an efficient set of CSS classes from Tailwind. I had originally imagined that GenAI models would spit out the kind of horrific code that you would find if you used a WYSIWYG editor back in the day and dared to look at the raw html. This is not the case. Claude is clean, considerate, and as always incredibly helpful in suggesting improvements.

Eventually, I got to the stage where I had a decent amount of code with some tiny, yet distracting bugs hidden inside the haystack. Claude was not so good at fixing these. For example, one line that parsed a text prefix wasn’t accounting for a space, so it wasn’t grabbing the right number of characters. I didn’t realize this was the root cause of a bug until Claude had suggested a series of changes to my state management, including a total refactor of the pattern I was using in the app. This kind of pitfall is totally expected because AI isn’t magic. It just identifies patterns in data and probabalistically generates text. And to be fair, it probably was a good suggestion, considering the growing complexity of my app. However, in the back of my mind I still docked Claude some points in the ‘coding expert’ department.

This type of experience is common with any new technology. The honeymoon phase where you are amazed by all the possibilities is quickly followed by the realization that this is yet another example of ‘garbage in garbage out.’ It still relies on detailed requirements and quality inputs to do a good job. It still requires practice and patience and never-ending iteration to get it to meet your growing expectations. Our collective expecations are still growing for tools like Claude, even as the overall hype about GenAI (hopefully) dies down.

Personally, I think that this conspiracy is more about the power of suggestion on a social platform like Reddit, combined with the limitations of a probabalistic system that heavily relies on the inputs we provide. Each new version of a particular tech will continue to renew a sense of possibility and excitement about how much more it can do with less human labor. I think this cycle is essential to get our imaginations pumping. I like Ethan Mollick’s principle that “this is the worst AI you will ever use” because it captures both that possibility and groundedness in reality. Either way, we need to be the ones to bridge that gap, not the people who train the models.

Written on August 29, 2024