In our last post “How to Setup Bitcoin Historical Price Data for Algo Trading in Five Minutes”, we introduced how to set up bitcoin price data in five minutes and we got a lot of good feedback and contributions to the open source MarketStore.
Today, I wanted to tell you how fast MarketStore is using the same data so that you can see the performance benefit of using the awesome open source financial timeseries database.
Faster data means more backtesting and more training in machine learning for our trading algorithm. We are seeing a number of successful machine learning-based trading algos in the space, but one of the key points we learned is the data speed is really important. It’s important not just for backtesting, but also for training AI-style algorithms since it by nature requires an iterative process.
This is another post to walk you through step by step. But TL;DR, it is really fast.
Last time, we showed how to setup the historical daily bitcoin price data with MarketStore.
This time, we store all the minute-level historical prices using the same mechanism called background worker, but with a slightly different configuration.
root_directory: /project/data/mktsdblisten_port: 5993# timezone: "America/New_York"log_level: infoqueryable: truestop_grace_period: 0wal_rotate_interval: 5enable_add: trueenable_remove: falseenable_last_known: falsetriggers:
module: ondiskagg.soon: "*/1Min/OHLCV"config:destinations:- 5Min- 15Min- 1H- 1Dbgworkers:
module: gdaxfeeder.soname: GdaxFetcherconfig:query_start: "2016-01-01 00:00"base_timefame: “1Min”symbols:- BTC
The difference from last time is that background worker is configured to fetch 1-minute bar data instead of 1-day bar data, starting from 2016–01–01. That is almost 2.5 years with more than 1 million bars. You will need to keep the server up and running for a day or so to fill all the data, since GDAX’s historical price API does not allow you to fetch that many data points quickly.
Again, the data fetch worker carefully controls the data fetch speed in case the API returns “Rate Limit” error. So you just need to sleep on it.
Additional configuration here is something called “on-disk aggregate” trigger. What it does is to aggregate 1-minute bar data for lower resolutions (here 5 minutes, 15 minutes, 1 hour, and 1 day).
In a typical trading strategy, you will need to check the longer time horizon to verify the entry/exit signals even if you are working on the minute level. So it is a pretty important feature. You would need pretty complicated LEFT JOIN query to achieve the same time-windowed aggregate in SQL. But with MarketStore, all you need is this small section in the configuration file.
The machine we are using for this test is a typical Ubuntu virtual machine with 8 of Intel(R) Xeon(R) CPU E5–2673 v3 @ 2.40GHz, 32GB RAM and SSD.
We are going to have a DataFrame object in python which holds all the minute level historical price data of bitcoin since January of 2016 from the server. We compare MarketStore and PostgreSQL.
PostgreSQL is not really meant to be the data store for this type of data, but unfortunately lots of people in this space are using some sort of SQL database for this purpose since there is no other alternative. That’s why we built MarketStore.
The table definition of the bitcoin data in PostgreSQL side looks like this.
btc=# \d pricesTable "public.prices"Column | Type | Modifiers--------+-----------------------------+-----------t | timestamp without time zone |open | double precision |high | double precision |low | double precision |close | double precision |volume | double precision |
The code looks like this.
# For postgresdef get_df_from_pg_one(conn, symbol):tbl = f'"{symbol}"'cur = conn.cursor()
cur.execute(f"SELECT t, open, high, low, close, volume FROM {tbl} ORDER BY t")times = []opens = []highs = []lows = []closes = []volumes = []for t, open, high, low, close, volume in cur.fetchall():times.append(t)opens.append(open)highs.append(high)lows.append(low)closes.append(close)volumes.append(volume)
return pd.DataFrame(dict(open=opens,high=highs,low=lows,close=closes,volume=volumes,), index=times)
# For MarketStoredef get_df_from_mkts_one(symbol):params = pymkts.Params(symbol, '1Min', 'OHLCV')return pymkts.Client('http://localhost:6000/rpc').query(params).first().df()
The input and output is basically the same, in that one symbol name is given, query the remote server over the network, and get one DataFrame. One strong benefit of MarketStore is you don’t need much client code to get the DataFrame object since the wire protocol is designed to give an array of numbers efficiently.
%time df = example.get_df_from_pg_one(conn, 'prices')CPU times: user 8.11 s, sys: 414 ms, total: 8.53 sWall time: 15.3 s
%time df = example.get_df_from_mkts_one('BTC')CPU times: user 109 ms, sys: 69.5 ms, total: 192 msWall time: 291 ms
Both results of course look the same like below.
In [21]: df.head()Out[21]:open high low close volume2016-01-01 00:00:00 430.35 430.39 430.35 430.39 0.07272016-01-01 00:01:00 430.38 430.40 430.38 430.40 0.94782016-01-01 00:02:00 430.40 430.40 430.40 430.40 1.63342016-01-01 00:03:00 430.39 430.39 430.36 430.36 12.56632016-01-01 00:04:00 430.39 430.39 430.39 430.39 1.9530
In [22]: df.shapeOut[22]: (1198274, 5)
A bitcoin was about $430 back then… Anyway, you can see the difference between 0.3 vs 15 seconds which is about 50 times difference. Remember, you may need to get the same data again and again for different kinds of backtesting and optimization as well as ML training.
Also you may want to query not just bitcoins but also other coins, stocks and fiat currencies, since the entire database wouldn’t fit into your main memory usually.
MarketStore can serve multiple symbol/timeframe in one query pretty efficiently, whereas with PostgreSQL and other relational databases you will need to query one table at a time, so there is also scalability advantage in MarketStore when you need multiple instruments.
To give some sense of this power, here is the result of querying 7.7K symbols for US stocks done as an internal testing.
%time dfs = example.get_dfs_from_pg(symbols)CPU times: user 52.9 s, sys: 2.33 s, total: 55.3 sWall time: 1min 26s
%time dfs = example.get_dfs_from_mkts(symbols)CPU times: user 814 ms, sys: 313 ms, total: 1.13 sWall time: 6.24 s
Again, the amount of data is the same, and in this case each DataFrame is not as large as the bitcoin case, yet the difference to expand to large number of instruments is significant (more than 10 times). You can imagine in real life these two (per instrument and multi-instruments) factors multiply the data cost.
Alpaca has been using MarkStore in our production for algo trading use cases both in our proprietary customers and our own purposes. It is actually amazing that this software is available to everyone for free, and we leverage this technology to help our algo trading customers (signup for early access is here).
If you’re a hacker and can create something cool that works in the financial market, please check out our project “Commission Free Stock Trading API” where we provide simple REST Trading API and real-time market data for free.
Brokerage services are provided by Alpaca Securities LLC (alpaca.markets), member FINRA/SIPC. Alpaca Securities LLC is a wholly-owned subsidiary of AlpacaDB, Inc.