Gaëtan Rickter


Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic…

I finally beat the S&P 500 by 10%. This might not sound like much but when we’re dealing with large amounts of capital and with good liquidity, the profits are pretty sweet for a hedge fund. More aggressive approaches have resulted in much higher returns.

It all started after I read a paper by Gur Huberman titled “Contagious Speculation and a Cure for Cancer: A Non-Event that Made Stock Prices Soar,” (with Tomer Regev, Journal of Finance, February 2001, Vol. 56, №1, pp. 387–396). The research described an event that occurred in 1998 with a public company called EntreMed (ENMD was the symbol at the time):

“A Sunday New York Times article on a potential development of new cancer-curing drugs caused EntreMed’s stock price to rise from 12.063 at the Friday close, to open at 85 and close near 52 on Monday. It closed above 30 in the three following weeks. The enthusiasm spilled over to other biotechnology stocks. The potential breakthrough in cancer research already had been reported, however, in the journal Nature, and in various popular newspapers ~including the Times! more than five months earlier. Thus, enthusiastic public attention induced a permanent rise in share prices, even though no genuinely new information had been presented.”

Among the many insightful observations made by the researchers, one stood out in the conclusion:

“[Price] movements may be concentrated in stocks that have some things in common, but these need not be economic fundamentals.”

I wondered if it was possible to cluster stocks based on something other than what’s usually used. I started digging around for datasets and after a few weeks I found one that included scores describing strength of “known and hidden relationships” between stocks and elements of the Periodic Table designed by Vectorspace AI.

Having a background in computational genomics, this also reminded me of how relatively unknown the relationships are between genes and their cell signaling networks. However, when we analyze the data, we begin to see new connections and correlations we may not have been able to predict previously:

Expression patterns of selected genes involved signaling pathways for cell plasticity, growth and differentiation —

Equities, like genes, are influenced via a massive network of strong and weak hidden relationships shared between one another. Some of these influences and relationships can be predicted.

One of my goals was to create long and short clusters of stocks or “basket clusters” I could use to hedge or just profit from. This would require an unsupervised machine learning approach to create clusters of stocks that would share strong and weak relationships with one another. These clusters would double as “baskets” of stocks my firm could trade.

I started by downloading the dataset here. The dataset is based on relationships between elements in the periodic table and public companies. In the future I’d like to work with cryptocurrencies and create baskets similar to what these guys are doing here but that’s a future project.

Then using Python and a subset of the usual machine learning suspects — scikit-learn, numpy, pandas, matplotlib and seaborn, I set out to understand the shape of the dataset I was dealing with. (To do some of this I looked to a Kaggle Kernel titled “Principal Component Analysis with KMeans visuals”.

Output: a quick view of the first 5 rows:

A Pearson Correlation of concept features. In this case, minerals and elements from the periodic table:

Output: (ran against the first 16 samples for this visualization example). It’s also interesting to see how elements in the periodic table correlate to public companies. At some point, I’d like to use the data to predict breakthroughs a company might make based on their correlation to interesting elements or materials.

Measuring ‘Explained Variance’ & Principal Component Analysis (PCA)

Explained variance = (total variance - residual variance). The number of PCA projection components that should be worth looking at can be guided by the Explained Variance Measure which is also nicely described in Sebastian Raschka’s article on Principal Component Analysis:


From this chart we can see that a large amount of variance comes from the first 85% of the predicted Principal Components. It’s a high number so let’s start at the low end and model for just a handful of Principal Component. More information on analyzing a reasonable number of Principal Components can be found here.

Using scikit-learn’s PCA module, lets set n_components = 9. The second line of the code calls the “fit_transform” method, which fits the PCA model with the standardized movie data X_std and applies the dimensionality reduction on this dataset.


We don’t really observe even faint outlines of clusters here so we should likely continue adjusting n_component values until we see something we like. This relates to the “art” part of data science and art.

Now lets try the K-means to see if we are able to visualize any distinct clusters in the next section.

K-Means Clustering

A simple K-Means will now be applied using the PCA projection data.

Using scikit-learn’s KMeans() call and the “fit_predict” method, we compute cluster centers and predict cluster indices for the first and third PCA projections (to see if we can observe any appreciable clusters). We then define our own color scheme and plot the scatter diagram as follows:


This K-Means plot looks more promising now as if our simple clustering model assumption turns out to be right, we can observe 3 distinguishable clusters via this color visualization scheme.

Of course, there are many different ways to cluster and visualize a dataset like this as shown here.

Using seaborn’s convenient pairplot function I can automatically plot all the features in the dataframe in pairwise manner. We can pairplot the first 3 projections against one another and visualize:


Building Basket Clusters

How you fine tune your clusters is up to you. There’s no silver bullet for this and much of it depends on the context in which you’re operating in. In this case, stocks, equities and the financial markets defined by hidden relationships.

Once you’re satisfied with your clusters and have set scoring thresholds to control whether certain stocks qualify for a cluster you can then extract the stocks for a given cluster and trade them as baskets or use the baskets as signals. The list of things you can do with this kind of approach is largely based on your creativity and how well you might be able to optimize using deep learning variants to optimize the returns of each cluster based on which concepts to cluster or data points such as the size of a company’s short interest or float (available shares on the open market).

You might notice a few interesting traits in the way these clusters trade as baskets. Sometimes there’s divergence from the S&P or general Market. This can offer opportunities for arbitrage based essentially on ‘information arbitrage’. Some clusters can correlate to Google search trends.

It might be interesting to see clusters related to materials and their supply chain as mentioned in this article: “Zooming in on 10 materials and their supply chains”. Using the dataset, I only operated on the feature column labels: ‘Cobalt’, ‘Copper’, ‘Gallium’ and ‘Graphene’ just to see if I might uncover any interesting hidden connections between public companies working in this area or exposed to risk in this area. These baskets are also compared against the returns of the S&P (SPY).

By using historical price data, which is readily available at outlets like Quantopian, Numerai, Quandl or Yahoo Finance, you can then aggregate price data to generate projected returns visualized using HighCharts:

The returns I gained from the cluster above beat the S&P by a nice margin, which means you would have approximately an extra 10% over the S&P annual. I’ve seen more aggressive approaches net close to 70% annual. Now I have to admit that I do a few other things that I have to keep black-boxed due to the nature of my work, but from what I’ve observed so far, at least exploring or wrapping new quantitative models around this approach could turn out to be quite worth it and with the only downside being a different kind of signal you could pipe into another system.

Generating short basket clusters could be more profitable than long basket clusters. This approach needs its own article and before the next Black Swan event.

Getting ahead of parasitic, symbiotic and sympathetic relationships between public companies that share known and hidden relationships can be fun and profitable if you’re into machine learning. In the end, one’s ability to profit seems to be all about how clever they can get in coming up with powerful combinations of feature labels, or “concepts”, when generating these kinds of datasets.

My next iteration on this kind of model should probably include a separate algorithm for auto-generating feature combinations or unique lists. Perhaps based on near real-time events that might affect groups of stocks with hidden relationships that only humans, outfitted with unsupervised machine learning algorithms, can predict.

  • Gaëtan R., — Financial Data Consultant — Geneva Switzerland

More by Gaëtan Rickter

Topics of interest

More Related Stories