How AI Boosted Our Productivity and Helped Add Vector Search to Apache Cassandra in 6 Weeks

With the huge demand for vector search functionality that’s required to enable generative AI applications, DataStax set an extremely ambitious goal to add this capability to Apache Cassandra and Astra DB, our managed service built on Cassandra.

Back in April, when I asked our chief vice president of product officer who was going to build it, he said, “Why don’t you do it?”

With two other engineers, I set out to deliver a new vector search implementation on June 7 — in just six weeks.

Could new AI coding tools help us meet that goal? Some engineers have confidently claimed that AI makes so many mistakes that it’s a net negative to productivity:

https://twitter.com/nullvoxpopuli/status/1599239800208302081?ref_src=twsrc^tfw&embedable=true

And more recently:

https://twitter.com/adam___dee/status/1684201152970211328?ref_src=twsrc^tfw&embedable=true

After trying them out on this critical project, I’m convinced that these tools are, in fact, a massive boost to productivity. In fact, I’m never going back to writing everything by hand. Here’s what I learned about coding with ChatGPT, GitHub Copilot, and other AI tools.

Copilot

Copilot is simple: It’s enhanced autocomplete. Most of the time, it will complete a line for you or pattern-match a completion of several lines from context. Here, I’ve written a comment and then started a new line writing neighbors. Copilot offered to complete the rest correctly (with the text following ‘neighbors’ on the second line):

Here’s a slightly more involved example from test code, where I started off writing the loop as a mapToLong but then changed my data structures so that it ended up being cleaner to invoke a method with forEach instead. Copilot had my back:

And occasionally (this is more the exception than the rule), it surprises me by offering to complete an entire method:

Copilot is useful but limited for two reasons. First, it’s tuned to (correctly) err on the side of caution. It can still hallucinate, but it’s rare; when it doesn’t think it knows what to do, it doesn’t offer completions. Second, it is limited by the requirement to be fast enough to seamlessly integrate with a brief pause in human typing, which rules out using a heavyweight model like GPT-4 for now. (See also this tweet from Max Krieger for a “Copilot maximalist” take.)

ChatGPT

You can try to get Copilot to generate code from comments, but for that use case, you will almost always get better results from GPT-4 via paid ChatGPT or API access.

If you haven’t tried GPT-4 yet, you absolutely should. It’s true that it sometimes hallucinates, but it does so much less than GPT-3.5 or Claude. It’s also true that sometimes it can’t figure out simple problems (here I am struggling to get it to understand a simple binary search). But other times, it’s almost shockingly good, like this time when it figured out my race condition on its first try. And even when it’s not great, having a rubber duck debugging partner that can respond with a passable simulacrum of intelligence is invaluable to staying in the zone and staying motivated.

And you can use it for everything. Or at least anything you can describe with text, which is very close to everything, especially in a programming context.

Here are some places I used GPT-4:

Random questions about APIs that I would have had to source dive for. This is the most likely category to result in hallucinations, and I have largely switched to Phind for this use case (see below).
Micro-optimizations. It’s like Copilot but matching against all of Stack Overflow because that’s (part of) what it was trained on.
Involved Stream pipelines because I am not yet very good at turning the logic in my head into a functional chain of Stream method calls. Sometimes, as in this example, the end result is worse than where we started, but that happens a lot in programming. It’s much easier and faster to do that exploration with GPT than one keystroke at a time. And making that time-to-results loop faster makes it more likely that I’ll try out a new idea since the cost of experimenting is lower.
Of course, GPT also knows about git, but maybe you didn’t realize how good it is at building custom tools using git. Like the other bullets in this list, this is stuff I could have done before by hand, but having GPT there to speed things up means that now I’ll create tools like this (before, I usually would have reached for whatever the second-best solution was, instead of spending an hour on a one-off script like this).

Here’s my favorite collaboration with GPT-4. I needed to write a custom class to avoid the garbage collection overhead of the box/unbox churn from a naive approach using ConcurrentHashMap<Integer, Integer>, and this was for Lucene, which has a strict no-external-dependencies policy, so I couldn’t just sub in a concurrent primitives map like Trivago’s fastutil-concurrent-wrapper.

I went back and forth several times with GPT, improving its solution. This conversation illustrates what I think are several best practices with GPT (as of mid-2023):

When writing code, GPT does best with nicely encapsulated problems. By contrast, I have been mostly unsuccessful in trying to get it to perform refactorings that touch multiple parts of a class, even a small one.
Phrase suggestions as questions. “Would it be more efficient to … ?” GPT (and, even more so, Claude) is reluctant to directly contradict its user. Leave it room to disagree, or you may unintentionally force it to start hallucinating.
Don’t try to do everything in the large language model (LLM). The final output from this conversation still needs some tweaks, but it’s close enough to what I wanted that it was easier and faster to just finish it manually instead of trying to get GPT to get it exactly right.
Generally, I am not a believer in magical prompts — it is better to use a straightforward prompt, and if GPT goes off in the wrong direction, correct it — but there are places where the right prompt can indeed help a great deal. Concurrent programming in Java is one of those places. GPT’s preferred solution is to just slap synchronized on everything and call it a day. I found that telling it to think in the style of concurrency wizard Cliff Click helps a great deal. More recently, I’ve also switched to using a lightly edited version of Jeremy Howard’s system prompt.

Looking at this list, it’s striking how well it fits with the rule of thumb that AI is like having infinite interns at your disposal. Interns do best with self-contained problems, are often reluctant to contradict their team lead and frequently it’s easiest to just finish the job yourself rather than explain what you want in enough detail that the intern can do it. (While I recommend resisting the temptation to do that with real interns, with GPT, it doesn’t matter.)

Advanced Data Analysis

Advanced Data Analysis, formerly known as Code Interpreter — also part of ChatGPT — is next level, and I wish it had been available for Java yesterday. It wraps GPT-4 Python code generation into a Juypter or Jupyter-like sandbox, and puts it in a loop to correct its own mistakes. Here’s an example from when I was troubleshooting why my indexing code was building a partitioned graph.

The main problem to watch for is that ADA likes to “solve” problems with unexpected input by throwing the offending lines away, which usually isn’t what you want. And it’s usually happy with its efforts once the code runs to completion without errors – you will need to be specific about the sanity checks that you want it to include. Once you tell it what to look for, it will add that to its “iterate until it succeeds” loop, and you won’t have to keep repeating yourself.

Also worth mentioning: The rumor mill suggests that ADA is now running a more advanced model than regular GPT-4, with (at minimum) a longer context window. I use ADA for everything by default now, and it does seem like an improvement; the only downside is that sometimes it will start writing Python for me when I want Java.

Claude

Claude is a competitor of OpenAI’s GPT from Anthropic. Claude is roughly at GPT 3.5 level for writing code — it’s noticeably worse than GPT-4.

But Claude has a 100,000 token context window, which is over ten times what you get with GPT-4. (OpenAI just announced an Enterprise ChatGPT that increases GPT-4’s context window to 32,000 tokens, which is still only a third of Claude.)

I used Claude for three things:

Pasting in entire classes of Cassandra code to help figure out what they do.
Uploading research papers and asking questions about them.
Doing both at once: Here’s a research paper; here’s my implementation in Java. How are they different? Do those differences make sense, given constraints X and Y?

Bing and Phind

Bing Chat got a bunch of attention when it launched earlier this year, and it’s still a good source of free GPT-4 (select the “Creative” setting), but that’s about it. I have stopped using it almost entirely. Whatever Microsoft did to Bing’s flavor of GPT-4 made it much worse at writing code than the version in ChatGPT.

Instead, when I want an AI-flavored search, I use Phind. It’s what Bing should have been, but for whatever reason, a tiny startup out-executed Microsoft on one of its flagship efforts. Phind has completely replaced Google for my “how do I do X”-type questions in Java, Python, git, and more. Here’s a good example of solving a problem with an unfamiliar library. On this kind of query, Phind almost always nails it — and with relevant sources, too. In contrast, Bing will almost always cite at least one source as saying something different than it actually does.

Bard

I haven’t found anything that Bard is good at yet. It doesn’t have GPT-4’s skill at writing code or Claude’s large context window. Meanwhile, it hallucinates more than either.

Making coding productive — and fun

Cassandra is a large and mature codebase, which can be intimidating to a new person looking to add a feature — even to me, after ten years spent mostly on the management side. If AI is going to help any of us move faster, this is the way. ChatGPT and related AI tooling are good at writing code to solve well-defined problems, both as part of a larger project designed by a human engineer or for one-off tooling. They are also useful for debugging, sketching out prototypes, and exploring unfamiliar code.

In short, ChatGPT and Copilot were key to meeting our deadline. Having these tools makes me 50% to 100% more productive, depending on the task. They have limitations, but they excel at tirelessly iterating on smaller tasks and help their human supervisor stay in the zone by acting as a tireless, uncomplaining partner to bounce ideas off of. Even if you have years of programming experience, you need to do this.

Because finally, even without the productivity aspects, coding with an AI that helps with the repetitive parts is just more fun. It’s given me a second wind and a new level of excitement for building cool things. I look forward to using more advanced versions of these tools as they evolve and mature.

Try building on Astra DB with vector search.

By Jonathan Ellis, DataStax

Also published here.