Data Reduction in Preparation for Lightweight Machine Learning: Applied in Foreign Exchange Trading

Written by paul-arssov | Published 2020/02/16
Tech Story Tags: fintech | forex | forex-trading | ml | ai | crypto-trading | cryptocurrencies | ml-top-story

TLDR Foreign Exchange (ForEx) is a global financial market composed of 4 exchanges located worldwide n - New York, London, Sydney, and Tokyo. There are in total of around 125 different trading items in the ForEx market. The main beneficiaries and drivers of the current state of the art data science and machine learning are the cloud providers who sell GPU/TPU hours and/or bill per 'container' and are not interested in changing the existing status-quo. I am proposing and then implementing the following ways of data reduction.via the TL;DR App

1. Introduction

The article is intended for both data scientists and ordinary software developers.
In general data-scientists work with statistical methods of processing data, use Python programming language, and run tasks on cloud CPUs together with GPU/TPUs (Graphics Processing Units/Tensor-flow Processing Units).
In contrast, I am a programmer, emphasize on creating non-statistics based algorithms, use C programming language, and run tasks on cloud CPUs only.
The article in not theoretical but puts all of the propositions in practice - implemented in the for-ex volatility charts product.

2. What is a Foreign Exchange

Foreign Exchange (ForEx) is a global financial market composed of 4 exchanges located worldwide n - New York, London, Sydney, and Tokyo. The global position of these exchanges allow almost 24 hours trading during the week.
Trading starts Monday morning in Sydney, and ends Friday afternoon in New York.
The market has several distinctive categories:
  • currencies - for ex. USD_CAD, AUD_JPY ...,
  • metals - for ex. Gold, Silver, Platinum to USD, AUD...
  • indexes - for ex. UK 100, US Nasdaq 100 (basket of stocks)
  • commodities - for ex. natural gas, oil, corn, to USD
  • bonds - for ex. US 10Y T-Note
There are in total of around 125 different trading items in the ForEx market.

3. Current State of Machine Learning

The state of the art in machine learning today uses statistical methods and processing of GBytes of data - the more data the better the models are supposed to work.
The ML programs use multiple cloud CPU together with GPU/TPUs and the development of models require multiple iterations and runs which take anywhere from minutes to hours to complete.
The ML programs favor the use and calling of a big specialized library like - TensorFlow, or MxNet. The resulting Python programs have only few or even a single important line of code.
The main beneficiaries and drivers of the current state of the art data science and machine learning are the cloud providers who sell GPU/TPU hours and/or bill per 'container' and are not interested in changing the existing status-quo.

4. Proposed ways of Data Reduction

Instead of data accumulation as proposed by the current state of ML I am performing a data reduction.
In the context of ForEx, I am proposing and then implementing the following ways of data reduction:
  • normalization - all items shown in percentage - %
  • sample time - choosing on time period of 1 min
  • precision - deciding on use of 2 digits after decimal point - 0.00...
  • data variable type - using of a decimal and not a floating point type
  • relevant items - from all of the items choosing ones with highest volatility
  • relevant time period - selecting a period to be from 6 hours to 48 hours
A single share:in the category of currencies for ex. USD_CAD is around $1.30. A single share in the category of commodities for ex. oil_usd is around $65. A single share in the category of metals for ex. XAU_USD (gold_usd) is around $1500. And a single share in the category of indexes for ex. US Nasdaq 100 is around $9,000.
Normalization allows expressing items with difference in several orders of magnitude in value into an uniform way - in percentages (%).
Data collection system can sample the pricing of the for-ex items anywhere from 1 second and up. But is such precision necessary? Looking at the image above we see the rise of the NATGAS_USD shown with a blue line from point1 to point2 take the span of 38 minutes. After considering this and other such examples, a sample time of 1 minute provides sufficient details while reducing the data processing burden.
Looking at the vertical axis we the range of normalized volatility varies fro 0 to around 3.5%. With such span it is acceptable to choose precision of 2 digits after the decimal point, which results in 0.01/3.5 ~= 0.0028 % error in precision.
The pricing if items comes always as a number with multiple digits after the decimal point. In a program it is placed normally in a floating point type of variable.
A test program in 'C' which did have 2 loops with multiplication and division - one of the loops with floating point type of variables and one with decimal type of variables did result in the second loop using decimal variables being around 2 times faster than the loop using the floating point variables.
After choosing precision of 2 digits after decimal point, and multiplying the pricing by 100 we are getting a decimal number which can then be processed much faster.
Probably the most important issue when trading on financial markets is to choose which item is important and is worth placing a buy or sell on it. Traders usually have favorite items like for ex. USD_GBP in currencies, and/or GOLD_USD in metals, and/or OIL_USD in commodities. However the favorite items may not be allowing the best opportunity for gains.
Volatile items, which move rapidly up and down, when used in the right time offer the highest opportunity for gains, with the lowest leverage..
Selecting the top 3, 5, or 8 items in a category and discarding everything else reduces significantly the amount of data to be analyzed.
The last but still important way of data reduction is the time period of data to be considered. While behavior from past week or month may define an overall up/down-trend - the immediate upcoming price movement is influenced mostly by developments (media, news, reports) in the present moment or within the previous 6 to 48 hours.

5. Summary and next steps

While mainstream data science and machine learning use statistical methods, deal with gigabytes of data and uses multiple cloud CPU together with GPU/TPUs, the proposed uses data reduction, deals with kilo Bytes of data, and uses CPUs only.
The ways of the proposed data reduction are - normalization, choosing sample time, choosing precision, selecting data variable type, focusing on relevant items for trading, and choosing relevant time slice period.
The data prepared in this way is used in a non-statistics based light weight machine learning.

Reference


Written by paul-arssov | Developing Decentralized web
Published by HackerNoon on 2020/02/16