paint-brush
MarketStore, the financial time series database, is now open sourceby@AlpacaHQ
9,897 reads
9,897 reads

MarketStore, the financial time series database, is now open source

by AlpacaFebruary 12th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We are happy to announce <a href="https://github.com/alpacahq/marketstore" target="_blank">MarketStore</a> is now open source! <a href="https://hackernoon.com/tagged/marketstore" target="_blank">MarketStore</a> is a database server optimized for financial timeseries data written in pure <a href="https://golang.org/" target="_blank">Go</a>, designed and developed by <a href="https://alpaca.markets/" target="_blank">Alpaca</a>. You can think of it as an extensible <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html" target="_blank">DataFrame</a> service that is accessible from anywhere in your system, at higher scalability.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coins Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - MarketStore, the financial time series database, is now open source
Alpaca HackerNoon profile picture

We are happy to announce MarketStore is now open source! MarketStore is a database server optimized for financial timeseries data written in pure Go, designed and developed by Alpaca. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability.

It is designed from the ground up to address scalability issues around handling large amounts of financial market data used in algorithmic trading backtesting, charting, and analyzing price history with data spanning many years, including tick-level for the all US equities or the exploding crypto currencies space. If you are struggling with managing lots of HDF5 files, this is perfect solution to your problem.

The Problem

A few years ago, Alpaca started developing AlpacaAlgo which helps retail traders to transform their own idea into trading algos without writing any lines of code. The platform trains a deep learning model to capture the trading idea using historical price data for each and runs backtesting quickly. You can then turn on the algo to live mode which should perform the same calculation as backtesting.

Then followed AlpacaScan, which offers pre-built algorithm results for all the US equities with quick backtesting results. Running complicated logics against eight thousands symbols instantly in a timeseries manner is not an easy task to do.

In doing so, we found the access to more than ten years of tick-level historical price data with thousands of symbols was the performance bottleneck of the system and a pain point to the user experience. Most of our application layer is written in python using PyData libraries including Pandas DataFrame, but these are more focused on the calculation and analysis, and it was clear that something was necessary to improve fast access to the huge amount of data in different parts.

The Solution

Back then, the easiest way of handling the financial market timeseries data was storing DataFrame in HDF5. It is the defacto storage format for small use cases but it did not fit our case for the scalability reason.

Then we sat together and explored the possibility. The idea popped quickly that we would need some kind of database that serves DataFrame on demand. It should be an HTTP-based API service for easy access, store tera-bytes of price data on disk, provide simple data management on the filesystem, be able to update 10k+ symbol prices every second, and respond to 10k+ clients with sub-second latency.

Looking around github for possible solutions out there, many timeseries databases were there but mainly for general-purpose timeseries data, targeting IoT sensor data or system monitoring metrics and designed for JSON. Financial timeseries data in particular has different requirements, in which the data is pretty dense, more structured, and long history is demanded. We couldn’t find the best solution for this particular use case. It was obvious that not just we but also anyone who works on this financial timeseries data would need this solution. In the coming age where more and more people write automated trading systems or analyze this kind of data, it is inevitable to handle it very efficiently. The database server should be written cleanly and be reusable to be open sourced one day for everyone to use.

That’s how we started developing MarketStore, a pure Go-based database designed for financial timeseries data. We designed it with a true database architecture. Luke, one of the co-founders and CTO of the successful MPP database Greenplum, joined the conversation and designed the storage layer. Hitoshi, who used to work for Greenplum and Pivotal as the architect of Greenplum and major contributor of PostgreSQL, developed the code base around query and plugin architecture.

Features


  • HTTP-based API with MessagePack binary serializationMany of the traditional databases have their proprietary wire protocol. In this cloud age, though, it makes more sense to follow the standard HTTP-based API for MarketStore. HTTP is mostly text-based, but for efficiency MarketStore uses MessagePack as the data serialization format, which allows the optimal data exchange between server and client.


  • Time indexed row orientation storage format for optimal reads and writesEverything stored inside MarketStore is assumed to be timeseries. Rows can be addressed at the precise byte-offset by the timestamp they are associated. In many analytics databases, data is column-oriented to optimize reads; MarketStore’s requirements are continuous write to thousands of symbols at the same time and updates are small, yet the time-index data structure allows the read to be optimized.


  • Utilizing filesystem holes to make the file size minimal with the market open time in mindModern filesystems offer holes in files, meaning you can allocate the file bytes in advance with zero bytes without actually occupying the underlying disk. In other words, it is a virtual allocation. Because MarketStore knows how big each file is going to be in many cases, it allocates the file first and fills the holes to optimize the storage usage.


  • Time aggregate for different time resolution with single source of truthThe most common operation in financial timeseries databases is time aggregation. Having 1 minute bar data, you may want to downsample to 1 hour, or 1 day bars. This is done via the trigger plugin system and you will have pre-aggregated, downsampled data upon the write to the underlying level. The pre-aggregated data gives higher query performance for many use cases.


  • Custom plugins to support different asset classes and marketsDifferent upstream data sources, as well as different types of asset classes, have different requirements when it comes to data integration. MarketStore’s core engine does the best job for common work for financial market data, and it offers custom plugin support to address each different use case.


  • Timezone supportThe timezone is one of the biggest headaches in time-oriented applications. MarketStore supports system-wide timezone to accommodate different markets. For example at Alpaca, US equity market data uses Eastern time and aggregates 1 day bars at the Eastern time boundary, while crypto and FX data are aggregated at UTC.

Client Support

There are native Go as well as Python clients and both perform very well. The Python client easily converts the server response into DataFrame and you notice almost no difference from reading the data from local disk, with the bonus of higher scalability.

Data Ingestion

One of the common challenges in database software is the data import layer. That is no exception for our financial data system as well. Different asset classes have different characteristics, and each one of the upstream data providers offer different data formats. In order to address this issue, MarketStore has the plugin system for the data ingestion layer. It comes with the default plugin for data ingestion both from GDAX API and Slait, which is another open source product of ours. We will discuss Slait in another post. With the GDAX plugin, you can immediately start consuming and storing Bitcoin, Ethereum, Bitcoin Cash, and Litecoin data from the moment you start MarketStore.

Since it is a plugin architecture, you can write your own data ingestion for your own needs. Also, Go and Python clients support writing data from remote.

Availability

MarketStore is available today as an open source project on GitHub and is production ready. Given the ease of the build system in Go, it is pretty straightforward to build it on your own. A Docker container is also built for every release for easy access.

Inside Alpaca, from our trading management system to deep learning modeling to charting, almost all applications use MarketStore as the backend both in development and production.

We hope open sourcing MarketStore can help more people working in a similar domain and contribute to the community. Alpaca’s mission is always to help individuals to have more technology power in the financial markets, and that doesn’t mean only our end-products. By providing technology this way, we wish to help everyone.

Try it now!

Try it today and let us know what works for you and does not. We are also more than happy to receive help around feature development and documentation. There will be more posts here about details around the usage of MarketStore.

Please follow Alpaca and Automation Generation for fresh posts on Financial Market, Algorithmic Trading, Technology.

You can find us @AlpacaHQ, if you use twitter.

If you’re a hacker and can create something cool that works in the financial market, please check out our project “Commission Free Stock Trading API” where we provide simple REST Trading API and real-time market data for free.

Brokerage services are provided by Alpaca Securities LLC (alpaca.markets), member FINRA/SIPC. Alpaca Securities LLC is a wholly-owned subsidiary of AlpacaDB, Inc.