paint-brush
Navigating Big, Hairy Codebases Should Be Easyby@datastax
740 reads
740 reads

Navigating Big, Hairy Codebases Should Be Easy

by DataStaxMarch 12th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

With the insight that big code is big data, Sourcegraph uses the power of knowledge graphs to help developers search and understand any codebase in the world.

People Mentioned

Mention Thumbnail
featured image - Navigating Big, Hairy Codebases Should Be Easy
DataStax HackerNoon profile picture


We often think that modern computing is divided between code and data. Functionally, this makes sense as we look at any given app. But when we look at a standard microservices architecture, the breadth and depth of the code itself is more than a few files of text — it becomes a dataset of its own. Our ability to manage codebases is limited by our understanding of them, and it is time for us to use the tools built for big data and apply them to the era of big code.


The most famous big data tool is search. The power of search saves precious time. Sourcegraph co-founder and chief technology officer Beyang Liu understood this when he set out to introduce it to the developer world. He knew the pain of entering a new company and learning a new codebase.

Understanding different people’s opinions and styles of code can be overwhelming, and codebases grow over time in unpredictable and confusing ways. So Liu built Sourcegraph, a tool to help developers be more productive. It’s essentially a search engine for code.


With the insight that big code is big data, we can use the power of knowledge graphs to help us to search for and understand any codebase in the world.


I recently spoke with Liu about his journey with Sourcegraph and his long-term goals (to hear the full conversation, listen to the Open Source Data podcast).

What Is Sourcegraph?

Sourcegraph is a free, open source technology that enables you to search across your entire codebase. Its primary goal is to help tackle the most significant part of a software engineer’s job: understanding existing code.


It does so by:


  • Searching everything at once without cloning and locally searching.
  • Easily sharing essential lines of code.
  • Managing with integrated development environment (IDE)-inspired features.


“For most software engineers, the biggest part of your job is not writing new code. It’s making sense and understanding all of the code that already exists,” Liu said.

There are two foundational components to Sourcegraph: the search component and the global reference graph.

Search Component

Like most search engines, Sourcegraph’s search component takes a query and presents the best results. Say that a developer is looking for to-dos in a specific repository. The developer can enter a search query like this repo:facebook/react content:TODO and it will search for any to-dos in the specified directory. You can see a real example of searching the Facebook React-native repository here. One of the key technologies to make this possible is an index format optimized for searching code.


While he was an engineering intern on Google Apps’ backend team in 2010, Liu was inspired by his use of Google Code Search — it’s what led him to employ the index format. Another thing that caught his eye was Russ Cox’s work on the initial implementation of Google’s internal code search and Han-Wen Nienhuys’s re-implementation of it in the form of an open source library called Zoekt.


“The centerpiece of that experience was this Code Search engine that indexed all of the codes at Google and made it accessible to every developer, whether you were an intern or a very senior, Jeff Dean-level engineer,” Liu said.

Global Reference Graph

The global reference graph helps you to understand the codebase and perform functions such as “go to definition” and find references, which requires mapping the entire codebase to take you to the right place.


Sourcegraph uses a range of compiler libraries and open protocols to achieve this and has its own protocols, such as Source Lib and SCIP, which are more suitable for Sourcegraph’s requirements.

“It’s all about providing this language-agnostic interface to these language-specific indexers that use compiler knowledge to build the global reference graph,” Liu said.

From Chaos to Action

Sourcegraph started when Liu got his first job out of school, at Palantir Technologies. He faced one of the problems everyone has faced when starting a new job as a software engineer:


“I got drop-shipped into this large, complex codebase that had been through multiple owners,” Liu recalled. “It was a bit messy, and I remember, at the end of that first month or so, looking back and asking myself, ‘What have I accomplished here? I’ve been spending all my time just trying to make sense of what’s going on in this code and figuring out why it’s written the way it is. It seems like more of my job is just exploring the existing code and figuring out how the relatively small piece I’m trying to add fits into that broader picture.’”


Liu’s time at Google exposed him to a suite of internal developer tools, one of which was Google Code Search, which made all the code at Google accessible. This experience, along with the onboarding pain at Palantir, drove Liu to create something that would help other software engineers avoid the same issues.


Conversations with Quinn Slack, a colleague of Liu’s at Palantir, about creating a tool for a universal code search turned into action, out of which came SourceGraph.

Sourcegraph’s Future

In 2011, Marc Andreessen wrote about how software is eating the world. The signs are everywhere: from the food you order, to booking a ride, to controlling your house’s heating.

But Liu thinks that we’re only seeing the tip of the iceberg. He said that understanding code will become an everyday thing.


He compared it to literacy, saying, “We once lived in a world where being able to read and write was limited to a very small, elite portion of society, limiting the extent to which human civilization could advance.”


When code powers almost everything in our life, understanding it will become a universal requirement, Liu said. This thought is what fueled Liu’s passion for building Sourcegraph. Creating a search engine for code will grant people access to the vast open source ecosystem — all with a simple search query.


By Sam Ramji, DataStax

Sam Ramji is chief strategy officer of DataStax. A 25-year veteran of the Silicon Valley and Seattle technology scenes, Sam has helped build two multibillion dollar markets (API management at Apigee and enterprise service bus at BEA Systems) and redefined Microsoft's open source and Linux strategy from "extinguish" to "embrace." He is nerdy about open source, platform economics, middleware and cloud computing with emphasis on developer experience and enterprise software.


Learn more about DataStax


Also Published Here