Building An In-memory Database in Go

Written by pzinovkin | Published 2022/08/01
Tech Story Tags: golang | tutorial | job-interview | oop | learn-to-code | coding | database | in-memory-database

TLDRAn in-memory database is a great way to understand an interviewee’s approach to coding. This type of question can replicate the work environment very closely: you are under time pressure and there are too many options of what you can be done; you need to make choices fast and produce working code while also making it possible to extend and improve it later. The first question is to ask questions to clarify expectations and reduce the amount of work. If you are new to this question, you may wonder what kind of question it is.via the TL;DR App

One of the questions you may encounter during a design interview for a software engineer position is to design an in-memory database. This question has many variations, but usually, it’s something like this:

Implement an in-memory database using object-oriented design. Examples of the interaction with the database are shown in SQL as a reference, but there is no need to parse SQL. You are free to implement your own API to work with it.

It must support the creation of tables, where columns type can be int or varchar: 
CREATE TABLE cities (id int, name varchar, population int);

The API must allow inserting rows into created table:
INSERT INTO cities (id, name, population) VALUES ("Tokyo", 37);
INSERT INTO cities (id, name, population) VALUES ("Delhi", 28);
INSERT INTO cities (id, name, population) VALUES ("Shanghai", 28);

API must support select operation with the ability to select all columns implicitly (*) or explicitly specified columns. You must also be able to filter rows based on multiple column values combined with AND, where the column is equal (=) or not equal (!=) to a value.
SELECT * FROM cities;
SELECT population FROM cities WHERE name = "Tokyo";
SELECT * FROM cities WHERE name != "Delhi" AND population = 28;

It must also support update and delete operations, with filtering capabilities mentioned above:
UPDATE cities SET population = 25 WHERE name = "Shanghai";
DELETE FROM cities WHERE population != 37;

If you are new to this type of question, you may wonder what kind of question it is. It’s obvious that implementing a production-worthy in-memory database takes a lot of effort. You need to control how memory is allocated and freed, work around thread safety, implement a lot of input validation, cover many edge cases with tests, collect metrics to get better insight into how your database performs, run rigorous stress testing, etc. Just getting answers to all related questions may take far more time than an average interview takes. So, why is this question even asked?

It turns out this sort of question is a great way to understand an interviewee’s approach to coding. It’s exactly the situation when “A picture is worth a thousand words.” And the picture here is basically an open-ended question that gives a wide range of options to explore during the interview. For example, it’s useless to ask a person during the interview, “Do you write tests?” or “Do you validate input data?” - who would say “no”? The same is correct for take-home coding tests or “send us examples of you code.” Usually, what you will get would differ from how a person writes code in work settings. And this type of question can replicate the work environment very closely: you are under time pressure, and there are too many options of what you can be done; you need to make choices fast and produce working code while also making it possible to extend and improve it later.

Let’s try to implement a solution for this question in Go and explore some possibilities. While solving this type of question, I prefer to do it in a set of steps.

Step 0 is to ask questions

Ask questions to clarify expectations and reduce the amount of work. For example, clarifying from the beginning whether you need to handle errors and the interviewer’s expectations about data validation. Sometimes it’s ok just to place TODO in the right places, saving a lot of time. Here are some that come to mind:

  • Do you need to worry about thread safety?
  • Can you omit returning errors and data validation by putting TODOs, for example?
  • Select must return something - is an array of objects is ok, or do you need to print records?
  • Do you need to control the memory?

Obviously, a lot more can come to mind when you start to work with code.

Step 1 is to create a database and a table

If you look closely at the SQL queries in question, you’ll notice that they can be split into two types of SQL: DDL and DML. The first one defines the table and stands aside from data manipulation queries. And before making any queries to data, we need a Database and a Table in it. Let’s implement it.

I prefer to start by writing a test when I work on implementation. It will help you immensely when you hear, “Ok, looks good. Let’s try if it works!” in the last minutes of the interview. It’s not a proper TDD approach - just a simple test that grows with each step of the implementation and replicates some of the queries to the database from the task definition.

import (
	"testing"
	"github.com/google/go-cmp/cmp"
)

type Database struct {
	tables map[string]Table
}

func NewDatabase() *Database {
	return &Database{
		tables: make(map[string]Table),
	}
}

func (d *Database) CreateTable(name string, cols []Column) *Table {
	// TODO: Validation
	t := NewTable(name, cols)
	d.tables[t.Name] = t
	return &t
}

const (
	ColumnInt = 1 << iota
	ColumnVarchar
)

type Column struct {
	Name string
	Type int
}

type Table struct {
	Name    string
	columns []Column
	rows    [][]any
}

func NewTable(name string, cols []Column) Table {
	t := Table{
		Name:    name,
		columns: cols,
	}
	return t
}

func TestDatabase(t *testing.T) {
	db := NewDatabase()

	// CREATE TABLE cities (id int, name varchar, population int);
	tab := db.CreateTable("cities", []Column{
		{"id", ColumnInt},
		{"name", ColumnVarchar},
		{"population", ColumnInt},
	})

	if diff := cmp.Diff(len(db.tables), 1); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}
}

Now we have our Database and Table objects. The Table uses an array as rows storage. Now it’s ready to work with queries that modify data.

Step 2 - insert, select, update, and delete

Again, look closer, and it’s clear that INSERT can be implemented pretty easy, and we need it before the rest of the queries.

//...
func (t *Table) Insert(data ...any) {
	// TODO: Validate data
	t.rows = append(t.rows, data)
}

func TestDatabase(t *testing.T) {
	//...
	// INSERT INTO cities (id, name, population) VALUES ("Tokyo", 37);
	// INSERT INTO cities (id, name, population) VALUES ("Delhi", 28);
	// INSERT INTO cities (id, name, population) VALUES ("Shanghai", 28);
	tab.Insert(1, "Tokyo", 37)
	tab.Insert(2, "Delhi", 28)
	tab.Insert(3, "Shanghai", 28)
}

SELECT, UPDATE and DELETE all have WHERE sections with filtering conditions. Let’s define it as Condition structure. SELECT also requires array columns to return - let’s call this structure Expression. The UPDATE needs to define which fields are updated - it will be Set structure. Now all these additional structs could be replaced with one Column structure containing all fields, but it will make API calls too verbose for my taste (need to name struct fields explicitly: []Set{{"population", 25}} → []Column{{name: "population", value: 25}}).

type Table struct {
	//...
	colIdx map[string]int
}

func NewTable(name string, cols []Column) Table {
	// ...
	t.colIdx = make(map[string]int, len(cols))
	for i, v := range cols {
		t.colIdx[v.Name] = i
	}
	return t
}

// ...
type Condition struct {
	Column string
	Eq     string
	Value  any
}

type Expression struct {
	Column string
}

func (t *Table) Select(cols []Expression, cond []Condition) [][]any {
	// TODO: cols and cond validation
	res := [][]any{}
	for i := 0; i < len(t.rows); i++ {
		if t.rowMatch(cond, i) {
			res = append(res, t.filterColumns(cols, i))
		}
	}
	return res
}

type Set struct {
	Column string
	Value  any
}

func (t *Table) Update(set []Set, cond []Condition) {
	for i := 0; i < len(t.rows); i++ {
		if t.rowMatch(cond, i) {
			for _, s := range set {
				t.rows[i][t.getColumnIndex(s.Column)] = s.Value
			}
		}
	}
}

func (t *Table) Delete(cond []Condition) {
	for i := 0; i < len(t.rows); i++ {
		if t.rowMatch(cond, i) {
			t.rows[i] = nil
		}
	}
}

func (t *Table) getColumnIndex(name string) int {
	return t.colIdx[name]
}

// ... 
func TestDatabase(t *testing.T) {
	// ...
	var got, want [][]any

	// SELECT * FROM cities;
	got = tab.Select([]Expression{{"*"}}, nil)
	want = [][]any{{1, "Tokyo", 37}, {2, "Delhi", 28}, {3, "Shanghai", 28}}
	if diff := cmp.Diff(want, got); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}

	// SELECT population FROM cities WHERE name = "Tokyo";
	got = tab.Select(
		[]Expression{{"population"}},
		[]Condition{{"name", ConditionEq, "Tokyo"}},
	)
	want = [][]any{{37}}
	if diff := cmp.Diff(want, got); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}

	// SELECT * FROM cities WHERE name != "Delhi" AND population = 28;
	got = tab.Select(
		[]Expression{{"*"}},
		[]Condition{{"name", ConditionNe, "Delhi"}, {"population", ConditionEq, 28}},
	)
	want = [][]any{{3, "Shanghai", 28}}
	if diff := cmp.Diff(want, got); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}

	// UPDATE cities SET population = 25 WHERE name = "Shanghai";
	tab.Update(
		[]Set{{"population", 25}},
		[]Condition{{"name", ConditionEq, "Shanghai"}},
	)

	got = tab.Select([]Expression{{"*"}}, nil)
	want = [][]any{{1, "Tokyo", 37}, {2, "Delhi", 28}, {3, "Shanghai", 25}}
	if diff := cmp.Diff(want, got); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}

	// DELETE FROM cities WHERE population != 37;
	tab.Delete([]Condition{{"population", ConditionNe, 37}})

	got = tab.Select([]Expression{{"*"}}, nil)
	want = [][]any{{1, "Tokyo", 37}}
	if diff := cmp.Diff(want, got); diff != "" {
		t.Errorf("Select: (-want +got)\n%s", diff)
	}
}

The test looks like it’s complete. All the code is implemented except rowMatch and filterColumns functions. For Update I need to know the updated column index by its name, hence getColumnIndex and related code. Delete is tricky - I don’t want to delete rows because it will break indices and require iterating from the tail of the array.

Step 3 is filtering

Here I add row matching, condition evaluation, and column filtering. The code can be extended with new types of conditions if needed.

// ...
const (
	ConditionEq = "="
	ConditionNe = "!="
)

func (c Condition) Eval(val any) bool {
	if c.Eq == ConditionEq {
		return c.Value == val
	} else if c.Eq == ConditionNe {
		return c.Value != val
	}
	return false
}

// ...
func (t *Table) rowMatch(cond []Condition, i int) bool {
	// Skip deleted row
	if t.rows[i] == nil {
		return false
	}

	if cond == nil {
		return true
	}

	for _, c := range cond {
		j := t.getColumnIndex(c.Column)
		if !c.Eval(t.rows[i][j]) {
			return false
		}
	}
	return true
}

func (t *Table) filterColumns(cols []Expression, i int) []any {
	row := t.rows[i]

	if cols[0].Column == "*" {
		return append([]any(nil), row...)
	}

	res := make([]any, len(cols))
	for i, c := range cols {
		res[i] = row[t.getColumnIndex(c.Column)]
	}
	return res
}

Now it should work. As you can see, this question is not that difficult with the right approach.

Here are some of my observations and conclusions:

  • It’s not an exercise in showing off your OOP skills. OOP is just a tool; use it wisely.

  • Ask questions. It’s an open-ended question with many routes that are open by asking the right questions.

  • Talk through your ideas with the interviewer. That person is there to understand better your way of thinking, approach to coding, and how knowledgeable you are.

  • Don’t freeze if you are stuck - speak about it. Being stuck eventually is a part of the job. And the ability to reach for help is important for effective teamwork.

  • Preparation is important. An interview is just too short to figure stuff on the run. Being prepared will help you to cover more aspects of the question.

Good luck!


Written by pzinovkin | ...
Published by HackerNoon on 2022/08/01