How I Power My Trading Bot with TimescaleDB

My first experience with crypto wasn’t under very good circumstances: a friend who takes care of several servers at his job was infected with ransomware – and this malware was demanding he pay the ransom amount in a cryptocurrency called Monero (XMR).

After this not-so-friendly introduction, I started to study how the technology behind cryptocurrencies works, and I fell in love with it. I was already interested in the stock market, so I joined the familiar (stock market) with the novel (crypto). To test the knowledge I’d learned from my stock market books, I started creating a simple Moving Average Convergence Divergence (MACD) crossover bot.

This worked for a while, but I quickly realized that I should - and could - make the bot a lot better.

Now, the project that I started as a hobby has a capital management system, a combination of technical indicators, and sentiment analysis powered by machine learning. Between 10 March 2020 and 10 August 2020, my bot resulted in a success rate of 63.1%, profit factor of 1.74, and cumulative gross result of approximately 588% (you can see a copy of all of my trades during this period in this Google Sheet report).

About my project

I needed a bot that gave me a high-performance, scalable way to calculate technical indicators and process sentiment data in real-time.

To do everything I need in terms of my technical indicators calculation, I collect candlestick chart data and market depth via an always-up websocket connection that tracks every Bitcoin market on the Binance exchange (~215 in total, 182 being tradeable, at this moment).

The machine learning sentiment analysis started as a simple experiment to see if external news affected the market. For example: if a famous person in the crypto ecosystem tweeted that a big exchange was hacked, the price will probably fall and affect the whole market. Likewise, very good news should impact the price in a positive way. I calculated sentiment analysis scores in real-time, as soon as new data was ingested from sources like Twitter, Reddit, RSS feeds, and etc. Then, using these scores, I could determine market conditions at the moment.

Now, I combine these two components with a weighted average, 60% technical indicators and 40% sentiment analysis.

Current trading bot dashboard where I track all my ongoing trades and results; In this particular case, filtered for the period of 10-March-2020 to 10-August-2020.

Quick breakdown of my results and success rates week-over-week; In this particular case, filtered for the period of 10-March-2020 to 10-August-2020.

Using TimescaleDB

At the beginning, I tried to save the collected data in simple files, but quickly realized that wasn’t a good way to store and process this data. I started looking for an alternative: a performant database.

I went through several databases, and all of them always lacked something I wound up needing to continue my project. I tried MongoDB, InfluxDB, and Druid, but none of them 100% met my needs.

Of the databases I tried, InfluxDB was a good option; however, every query that I tried to run was painful, due to their own query language (InfluxQL). As soon as my series started to grow exponentially to higher levels, the server didn't have enough memory to handle them all in real-time. This is because the current InfluxDB TSM storage engine requires more and more allocated memory for each series. I have a large number of unique metrics, so the process ran out of available memory quickly,

I handle somewhat large amounts of data every day, especially on days with many market movements. On average, I’m ingesting around 20k records/market, or 3.6 million total records, per day (20k*182 markets) and even with this huge amount of data, my query response time is in the milliseconds.

This is where TimescaleDB started to shine for me. It gave me fast real-time aggregations, built-in time-series functions, high ingestion rates – and it didn’t require elevated memory usage to do all of this.

In addition to this raw market data, a common use case for me is to analyze the data in different time frames (e.g., 1min, 5min, 1hr, etc.) I maintain these records in a pre-computed aggregate to increase my query performance and allow me to make faster decisions about whether or not to enter a position.

For example, here’s a simple query that I use a lot to follow the performance of my trades on a daily or weekly basis (daily in this case):

SELECT time_group, total_trades, positive_trades, 
	negative_trades,
	ROUND(100 * (positive_trades / total_trades), 2) AS success_rate, profit as gross_profit,
    ROUND((profit - (total_trades * 0.15)), 2) AS net_profit
FROM (
	SELECT time_bucket('1 day', buy_at::TIMESTAMP)::DATE AS time_group, COUNT(*) AS total_trades, 
		SUM(CASE WHEN profit >  0 THEN 1 ELSE 0 END)::NUMERIC AS positive_trades, 
		SUM(CASE WHEN profit <= 0 THEN 1 ELSE 0 END)::NUMERIC AS negative_trades,
		ROUND(SUM(profit), 2) AS profit 
	FROM trade
	GROUP BY time_group ORDER BY time_group 
) T ORDER BY time_group

And, I often use this function to measure market volatility, decomposing the range of a market pair in a period:

CREATE OR REPLACE FUNCTION tr(_symbol TEXT, _till INTERVAL)
	RETURNS TABLE(date TIMESTAMP WITHOUT TIME ZONE, result NUMERIC(9,8), percent NUMERIC(9,8)) LANGUAGE plpgsql AS $$ DECLARE BEGIN

RETURN QUERY 
	WITH candlestick AS ( SELECT * FROM candlestick c WHERE c.symbol = _symbol AND c.time > NOW() - _till )
	SELECT d.time, (GREATEST(a, b, c)) :: NUMERIC(9,8) as result, (GREATEST(a, b, c) / d.close) :: NUMERIC(9,8) as percent FROM ( 
		SELECT today.time, today.close, today.high - today.low as a,
      		COALESCE(ABS(today.high - yesterday.close), 0) b,
      		COALESCE(ABS(today.low - yesterday.close), 0) c FROM candlestick today
      	LEFT JOIN LATERAL ( 
			  SELECT yesterday.close FROM candlestick yesterday WHERE yesterday.time < today.time ORDER BY yesterday.time DESC LIMIT 1 
		) yesterday ON TRUE
    WHERE today.time > NOW() - _till) d;
END; $$;

CREATE OR REPLACE FUNCTION atr(_interval INT, _symbol TEXT, _till INTERVAL)
	RETURNS TABLE(date TIMESTAMP WITHOUT TIME ZONE, result NUMERIC(9,8), percent NUMERIC(9,8)) LANGUAGE plpgsql AS $$ DECLARE BEGIN
	
RETURN QUERY
	WITH true_range AS ( SELECT * FROM tr(_symbol, _till) )
	SELECT tr.date, avg.sma result, avg.sma_percent percent FROM true_range tr
	INNER JOIN LATERAL ( SELECT avg(lat.result) sma, avg(lat.percent) sma_percent
		FROM (
			   SELECT * FROM true_range inr
			   WHERE inr.date <= tr.date
			   ORDER BY inr.date DESC
			   LIMIT _interval
			 ) lat
		) avg ON TRUE
  WHERE tr.date > NOW() - _till ORDER BY tr.date;
END; $$;

SELECT * FROM atr(14, 'BNBBTC', '4 HOURS') ORDER BY date

My current deployment & future plans

My current architecture.

To develop my bot and all its capabilities, I used Node.js as my main programming language and various libraries: cote to communicate between all my modules without overengineering, tensorFlow to train and deploy all my machine learning models, and tulind for technical indicator calculation, as well as various others.

I modified some to meet my needs and created some from scratch, including a candlestick recognition pattern, a level calculator for support/resistance, and Fibonacci retracement.

Today, I have a total of 55 markets (which are re-evaluated every month, based on trade simulation performance) that trade simultaneously 24/7; when all my strategy conditions are met, a trade is automatically opened. The bot respects my capital management system, which is basically to limit myself to 10 opened positions and only using 10% of the available capital at a given time. To keep track of the results of an open trade, I use dynamic Trailing Stop Loss and Trailing Take Profit.

The process of re-evaluating a market requires a second instance of my bot that runs in the background and uses my main strategy to simulate trades in all Bitcoin markets. When it detects that a market is doing well, based on the metrics I track, that market enters the main bot instance and starts live trading. The same applies to those that are performing poorly; as soon as the main instance of my bot detects things are going badly, the market is removed from the main instance and the second instance begins tracking it. If it improves, it's added back in.

As every developer likely knows all too well, the process of building a software is to always improve it. Right now, I’m trying to improve my capital management system using Kelly Criterion.

Why TimescaleDB?

For my use case, I’ve found TimescaleDB is a powerful and solid choice: it’s fast with reliable ingest rates, efficiently stores and compresses a huge dataset in a way that’s manageable and cost-effective, and gives me real-time aggregation functionality.

The TimescaleDB website, core documentation, and this blog post about managing and processing huge time-series datasets is all pretty easy to understand and follow – and the TimescaleDB team is responsive and helpful (and they always show up in community discussions, like my Reddit AMA).

It’s been easy and straightforward to scale, without adding any new technologies to the stack. And, as an SQL user, TimescaleDB adds very little maintenance overhead, especially compared to learning or maintaining a new database or language.

This post is slightly modified from original post on the TimescaleDB blog (read it here).