Why ML Can Predict the Weather, but Not Financial Markets

In the early 2010s, Spread Networks invested about $300 million into a direct fiber-optic line between Chicago and New York. At the time, Chicago was the main hub for futures trading, while New York handled equities. Regular routes took detours that added dozens of kilometers, but the new cable cut latency from ~17 down to ~13 milliseconds. invested invested For high-frequency trading, those 3-4 ms were a massive edge. Whoever saw the data just a little earlier could place orders first and grab the liquidity. Leasing access to the line cost millions of dollars per year, but it paid off — milliseconds literally converted into money. If firms were ready to spend hundreds of millions just to shave off 3–4 milliseconds, imagine the edge from an ML model that doesn’t react (even if instantly), but actually predicts market movements. If such a model ever existed, the market would notice immediately. So why is it that current models fail at this task in finance, even though they work quite well in other domains, let’s say, weather forecasting? Let’s dig into that. So why don’t we see public success stories of ML in finance? You can look at it in two ways. One is that results exist but don’t get shared. Occasionally, you’ll see a paper with good numbers, but they’re rare and usually cherry-picked. The other is that there are real obstacles. Financial data is just harder to work with than data in other domains, mainly for three reasons: Too much noise Not enough data Constantly changing markets Too much noise Not enough data Constantly changing markets That combination makes financial data very different from weather data, where noise is lower because the system follows physical laws, data is abundant thanks to satellites and sensors producing terabytes of observations every day, and the underlying physics stays stable, so even old data remains useful. For each of these problems in Finance, there are known fixes. I’ll go through them below. The problem is that they only work in isolation. Nobody has managed to put them all together into a single trading machine. As they say, you can only pick two. Reducing noise with filters and aggregation Reducing noise with filters and aggregation Financial time series come with a weak signal buried in noise. Prices swing every second — news, rumors, moves of big players. Even simple links like “good news → price goes up” often break down. There are two main sources of noise: The first source of noise is a weak signal-to-noise ratio. The core issue is not the magnitude of the noise itself, but the weakness of the signal: meaningful price movements are usually fractions of a percent, while random swings can easily reach several percent. As a result, the share of informative changes within the overall data stream is extremely small. According to the Efficient Market Hypothesis, prices already reflect all available information from news — which is exactly what we typically aim to predict. However, markets also include uninformed participants whose actions generate additional noise. Noise filtering Noise filtering via FFT remains a staple tool. The approach decomposes a time series into frequency components: low frequencies capture the underlying trend, while high frequencies represent noise. By discarding the high-frequency parts and reconstructing only the low-frequency component, we obtain a smoothed signal that’s much easier to model. (The high-frequency remainder can still serve for volatility estimation if needed.) IBM Heron was able to augment classical computing workflows to better unravel hidden pricing signals in noisy market data than standard, classical-only approaches in use by HSBC, resulting in strong improvements in the bond trading process. Heuristics and reframing the problem Noise from market participants is handled differently. One useful trick is to reframe the question itself. Instead of asking “What will Apple’s stock price be one second from now?” you can ask: “What will it cost to buy 1 share?” “What will it cost to buy 100k shares?” In the second case we predict the average price for a large volume, and that is much more stable and better reflects market movement. def avg_price(order_book, volume): taken, cost = 0, 0 for price, avail in order_book: take = min(avail, volume - taken) cost += take * price taken += take if taken >= volume: break return cost / taken Example: the averaged price for 100k shares y = avg_price(order_book, 100_000) The first source of noise is a weak signal-to-noise ratio. The core issue is not the magnitude of the noise itself, but the weakness of the signal: meaningful price movements are usually fractions of a percent, while random swings can easily reach several percent. As a result, the share of informative changes within the overall data stream is extremely small. The first source of noise is a weak signal-to-noise ratio. The core issue is not the magnitude of the noise itself, but the weakness of the signal: meaningful price movements are usually fractions of a percent, while random swings can easily reach several percent. As a result, the share of informative changes within the overall data stream is extremely small. weak signal-to-noise ratio. According to the Efficient Market Hypothesis, prices already reflect all available information from news — which is exactly what we typically aim to predict. However, markets also include uninformed participants whose actions generate additional noise. Noise filtering Noise filtering via FFT remains a staple tool. The approach decomposes a time series into frequency components: low frequencies capture the underlying trend, while high frequencies represent noise. By discarding the high-frequency parts and reconstructing only the low-frequency component, we obtain a smoothed signal that’s much easier to model. (The high-frequency remainder can still serve for volatility estimation if needed.) IBM Heron was able to augment classical computing workflows to better unravel hidden pricing signals in noisy market data than standard, classical-only approaches in use by HSBC, resulting in strong improvements in the bond trading process. Heuristics and reframing the problem Noise from market participants is handled differently. One useful trick is to reframe the question itself. Instead of asking “What will Apple’s stock price be one second from now?” you can ask: “What will it cost to buy 1 share?” “What will it cost to buy 100k shares?” In the second case we predict the average price for a large volume, and that is much more stable and better reflects market movement. def avg_price(order_book, volume): taken, cost = 0, 0 for price, avail in order_book: take = min(avail, volume - taken) cost += take * price taken += take if taken >= volume: break return cost / taken Example: the averaged price for 100k shares y = avg_price(order_book, 100_000) According to the Efficient Market Hypothesis, prices already reflect all available information from news — which is exactly what we typically aim to predict. However, markets also include uninformed participants whose actions generate additional noise. uninformed participants Noise filtering Noise filtering via FFT remains a staple tool. The approach decomposes a time series into frequency components: low frequencies capture the underlying trend, while high frequencies represent noise. By discarding the high-frequency parts and reconstructing only the low-frequency component, we obtain a smoothed signal that’s much easier to model. (The high-frequency remainder can still serve for volatility estimation if needed.) IBM Heron was able to augment classical computing workflows to better unravel hidden pricing signals in noisy market data than standard, classical-only approaches in use by HSBC, resulting in strong improvements in the bond trading process. IBM Heron IBM Heron Heuristics and reframing the problem Heuristics and reframing the problem Noise from market participants is handled differently. One useful trick is to reframe the question itself. Instead of asking “What will Apple’s stock price be one second from now?” you can ask: “What will Apple’s stock price be one second from now?” “What will it cost to buy 1 share?” “What will it cost to buy 100k shares?” “What will it cost to buy 1 share?” “What will it cost to buy 1 share?” “What will it cost to buy 100k shares?” “What will it cost to buy 100k shares?” In the second case we predict the average price for a large volume, and that is much more stable and better reflects market movement. def avg_price(order_book, volume): taken, cost = 0, 0 for price, avail in order_book: take = min(avail, volume - taken) cost += take * price taken += take if taken >= volume: break return cost / taken Example: the averaged price for 100k shares y = avg_price(order_book, 100_000) def avg_price(order_book, volume): taken, cost = 0, 0 for price, avail in order_book: take = min(avail, volume - taken) cost += take * price taken += take if taken >= volume: break return cost / taken Example: the averaged price for 100k shares y = avg_price(order_book, 100_000) When More Volume Means More Noise When More Volume Means More Noise However, there is also a reverse effect: sometimes accounting for more volume can actually add noise to the data. Compare the two charts: they show the average BTCUSDT prices on Binance, weighted by different trade volumes. In the first case, the $50M volume-weighted chart looks “less noisy” — this happened during a rapid price drop. In the second case, however, the same weighted average shows far more “random” jumps. This occurred because, during a calmer price period, participants were more frequently moving their large orders deeper in the order book, which in turn affected the weighted average price. more It may seem surprising: if you simply count the number of changes for each of these three indicators over the period from August 22 to September 22, you actually get an increase in data when “averaging” the price. P1_1_market_hits.csv: 2,374,605 points of data P2_500k_market_hits.csv: 51,309,973 points of data P3_50m_market_hits.csv: 133,191,896 points of data The takeaway is this: averaging can sometimes make things worse. Still, forewarned is forearmed. Smarter Targets Beat Raw Prices Smarter Targets Beat Raw Prices Another way to “reduce noise” and “help ML” is to reformulate the prediction target itself by averaging noise over time. For example, a naïve approach would be: “let’s predict the exact price 10 seconds from now.” A more robust one is: “let’s predict the volume-weighted average price over the next 10 seconds.” This solves two problems at once: “let’s predict the exact price 10 seconds from now.” “let’s predict the volume-weighted average price over the next 10 seconds.” If a price jump occurs within those 10 seconds, the exact moment doesn’t matter as much — averaging smooths it out. The algorithm therefore has fewer ways to fail. Secondly (and here we get a bit more mathematical), averaging the target also reduces the average penalty the model receives during training for “wrong predictions.” In the simplest case of a regression model, the loss is proportional to (y^* - y)^2, where y^* is the “true answer” and y is the model output. The larger the error, the quadratically higher the penalty. Now, suppose that over the next 10 seconds the price trend is generally upward, but at some random moment there is a brief downward spike. The model would have to predict that spike, otherwise it gets penalized. But in reality, we don’t care much about that random blip — what we want the model to capture is the overall upward movement.\ If a price jump occurs within those 10 seconds, the exact moment doesn’t matter as much — averaging smooths it out. The algorithm therefore has fewer ways to fail. If a price jump occurs within those 10 seconds, the exact moment doesn’t matter as much — averaging smooths it out. The algorithm therefore has fewer ways to fail. Secondly (and here we get a bit more mathematical), averaging the target also reduces the average penalty the model receives during training for “wrong predictions.” In the simplest case of a regression model, the loss is proportional to (y^* - y)^2, where y^* is the “true answer” and y is the model output. The larger the error, the quadratically higher the penalty. Now, suppose that over the next 10 seconds the price trend is generally upward, but at some random moment there is a brief downward spike. The model would have to predict that spike, otherwise it gets penalized. But in reality, we don’t care much about that random blip — what we want the model to capture is the overall upward movement.\ Secondly (and here we get a bit more mathematical), averaging the target also reduces the average penalty the model receives during training for “wrong predictions.” In the simplest case of a regression model, the loss is proportional to (y^* - y)^2, where y^* is the “true answer” and y is the model output. The larger the error, the quadratically higher the penalty. Now, suppose that over the next 10 seconds the price trend is generally upward, but at some random moment there is a brief downward spike. The model would have to predict that spike, otherwise it gets penalized. But in reality, we don’t care much about that random blip — what we want the model to capture is the overall upward movement.\ As an example: on the chart we compare averaging over 1 second vs. 10 seconds. The 10-second target is much easier to predict, at least for a simple regression — because it contains far less random noise. Don’t Predict Price, Predict the Crowd Don’t Predict Price, Predict the Crowd Sometimes it’s smarter to predict not price itself, but crowd reaction. Technical patterns like “head and shoulders” may lack scientific rigor — yet if enough traders believe them and act, price actually moves. Our algorithm taps into these behavioral signals: it doesn’t try to model the “physics of the market,” but learns participants’ reactions. In our algorithms, we usually separate this approach from the general one. In other words, if you “hint” to the ML model what exactly we’re looking for in the data (for example, cases where market participants are trying to complete a certain technical analysis pattern), the model learns much more effectively. How to combine and apply different models in practice — keep reading! Bootstrapping and augmenting limited data Bootstrapping and augmenting limited data The second big challenge is lack of data. Take Apple stock, for example. If you sample prices once per second, you end up with something like 60 times × 60 times × 8 trading hours × 5 days a week × ~50 weeks (excluding holidays) ≈ 7 200 000 points per year — under 10 million. By “big data” standards, that’s crumbs. Sure, there’s HFT data where something happens every millisecond. But that only brings back the first problem: tons of noise and very little actual signal. At its core, machine learning is statistics, and statistics has tricks for working with small samples. Bootstrapping Bootstrapping The idea is simple: say you have 100 observations but you’d like 1,000. You repeatedly take random subsets — for example, 50 items each — and compute statistics. The number of possible “100 choose 50” combinations is huge. As a result, you get hundreds of resampled datasets and more reliable estimates than if you only relied on the original 100. The catch is that for time series this barely works — you can’t just chop a sequence into pieces without losing chronology. That’s why classical bootstrapping is rarely used for price prediction. But in tasks like analyzing 100 trading deals, it can be applied: you can average over the original hundred, or build several random subsets of 50 and average those results. Data augmentation Data augmentation The second strategy is to artificially expand the dataset. In images, this is simple: rotate, distort — and you get a new example. In text, too. With finance it’s harder, but some tricks still work. A simple example: if a model is trained only on a bull market, it will always suggest “buy.” As soon as the market turns downwards — losses begin. To mitigate this, you can invert the data: turn growth into decline, turn buys into sells, turn good news into bad. turn growth into decline, turn buys into sells, turn good news into bad. The algorithm learns from such scenarios and begins selling in a falling market. Synthetic trade generation is a field that still has many open questions Synthetic trade generation is a field that still has many open questions field field The approach is relatively straightforward: take real trades, fit them to a distribution (or set of distributions), and then sample from that distribution to produce additional synthetic data. But the key challenge remains the distribution tails (fat tails) — rare but large events. Their probability is low, yet markets include such events. Recent research increasingly leans on generative models: diffusion models, GANs, variational autoencoders. For example, the paper “Generation of synthetic financial time series by diffusion models” describes a method where order-book series are transformed into wavelet space, generated via DDPM, and then inverted back to time series. Another example is TransFusion, which combines diffusion and transformer architectures to generate long, high-fidelity synthetic time series. The main task is to reconcile two demands: on one hand, preserve the market’s stylized facts (fat tails, volatility clustering, autocorrelations, etc.), and on the other, avoid over-synthetic artifacts. To date, there is no universal model in the literature that reliably reproduces all statistical properties. Time shifts Time shifts There’s also a technique of time-shifting: using the same data, but with a lag. The issue is that it’s easy to accidentally “peek into the future” (look-ahead bias). That’s a classic engineering mistake: the training set ends up containing data the model should predict. On charts, it looks great — but in real markets, it breaks. Thus, with time series, any shifts require strict discipline. Another technique is time shifts: reusing the same data with a lag. The main issue here is the risk of introducing look-ahead bias — a classic engineering mistake where the training set accidentally contains the very data the model is supposed to predict. On charts this often looks convincing, but in live markets it collapses quickly. Any use of shifted time series therefore requires strict discipline. A useful analogy is weather: if it’s raining now, the probability is high that it will keep raining. But forecasting rain when everyone is already carrying umbrellas adds little value. Trading works in a similar way: when there has just been a large price move, the next move is also likely to be large. What does not correlate, however, is the direction — only the amplitudes show persistence. **Adapting to changing markets with ensembles and sliding windows**The third problem — markets behave like the Wild West. Everything changes fast. You could say: “Let’s train only on fresh data from the last few months.” But if data is scarce to begin with, this makes the problem even worse. Ensembles Ensembles One practical approach is ensembles. You train several models: one on the most recent days or weeks, another on the entire history, a third on some mid-range horizon. and a fourth that focuses on special cases — for example, detecting noise patterns or technical-analysis formations, as discussed earlier. Then you aggregate their predictions (e.g., by averaging, or taking the min/max). This is a standard trick for dealing with heteroscedastic data — where the distribution is non-stationary and constantly shifting. Markets are exactly that kind of case. pred1 = model_recent.predict(x) pred2 = model_history.predict(x) pred3 = model_midterm.predict(x) final = np.mean([pred1, pred2, pred3]/ # final = np.max([pred1, pred2, pred3]) The idea is that the market may change tomorrow, but some of the old information is still useful. Averaging helps smooth out these distortions. Sliding windows Another technique is training on sliding windows. Take the last 7 days, predict the next one. Then shift the window: add new data, drop the old. The model keeps updating, allowing it to adapt to new market regimes. window = 7 for t in range(window, len(data)): model.fit(data[t-window:t]) pred = model.predict(data[t]) So why is there no universal ML for trading? So, each of the three problems can be solved on its own, but together they don’t add up to a universal solution. One reason is the lack of quality feedback for training models. In finance, you don’t have the usual ML metrics like accuracy or F1-score. The only metric is money made.Imagine two hedge funds. One shows average returns, the other twice as high. If someone consistently outperforms the rest, everyone immediately assumes it’s a scam. Why? First, because nothing like that shows up in the market — other participants don’t feel like someone is “skimming” them on every trade. Second, there’s the survivor bias. Classic example: take a thousand people, half go long on oil, half go short. The next day, half of them are right. From the remaining 500, split again, and repeat for several rounds. After ten days, you’ll have one “genius” who made the right call ten times in a row. But in reality, he was just lucky — the illusion comes from starting with a thousand players. This is the core problem of verification. There isn’t much data to train on, and there’s even less to validate results. Even if we could see trades from a fund that outperforms the market twofold, over a relatively short horizon we still wouldn’t be able to tell luck from real skill. A good example is the many “one-day wonders” — funds or companies that show great returns when the overall market is going up (say, during an S&P 500 rally). But as soon as conditions turn south, their performance collapses. Over the long run, there are indeed legendary cases like the Medallion Fund. They consistently beat the market, delivering returns above so-called risk-free bonds. But the edge isn’t by orders of magnitude — it’s a few percentage points. To do better than them means being ahead by fractions of a percent, sustained over a very long horizon. The reality is that few funds survive long enough to prove such stability. Over six months, almost anyone can “look like a genius” if they get lucky — that’s the classic survivor bias. And not surprisingly, it’s exactly this illusion that a lot of flashy marketing campaigns for “successful” funds are built on. The philosophical takeaway is a harsh one: an algorithm can’t be called successful until it’s been tested by time. Even if it’s profitable on average, in real life it can get wiped out in a single day with a million-dollar drawdown — simply because you don’t have an extra million lying around to survive that day. one on the most recent days or weeks, one on the most recent days or weeks, another on the entire history, another on the entire history, a third on some mid-range horizon. a third on some mid-range horizon. and a fourth that focuses on special cases — for example, detecting noise patterns or technical-analysis formations, as discussed earlier. Then you aggregate their predictions (e.g., by averaging, or taking the min/max). This is a standard trick for dealing with heteroscedastic data — where the distribution is non-stationary and constantly shifting. Markets are exactly that kind of case. pred1 = model_recent.predict(x) pred2 = model_history.predict(x) pred3 = model_midterm.predict(x) final = np.mean([pred1, pred2, pred3]/ # final = np.max([pred1, pred2, pred3]) The idea is that the market may change tomorrow, but some of the old information is still useful. Averaging helps smooth out these distortions. Sliding windows Another technique is training on sliding windows. Take the last 7 days, predict the next one. Then shift the window: add new data, drop the old. The model keeps updating, allowing it to adapt to new market regimes. window = 7 for t in range(window, len(data)): model.fit(data[t-window:t]) pred = model.predict(data[t]) So why is there no universal ML for trading? So, each of the three problems can be solved on its own, but together they don’t add up to a universal solution. One reason is the lack of quality feedback for training models. In finance, you don’t have the usual ML metrics like accuracy or F1-score. The only metric is money made.Imagine two hedge funds. One shows average returns, the other twice as high. If someone consistently outperforms the rest, everyone immediately assumes it’s a scam. Why? First, because nothing like that shows up in the market — other participants don’t feel like someone is “skimming” them on every trade. Second, there’s the survivor bias. Classic example: take a thousand people, half go long on oil, half go short. The next day, half of them are right. From the remaining 500, split again, and repeat for several rounds. After ten days, you’ll have one “genius” who made the right call ten times in a row. But in reality, he was just lucky — the illusion comes from starting with a thousand players. This is the core problem of verification. There isn’t much data to train on, and there’s even less to validate results. Even if we could see trades from a fund that outperforms the market twofold, over a relatively short horizon we still wouldn’t be able to tell luck from real skill. A good example is the many “one-day wonders” — funds or companies that show great returns when the overall market is going up (say, during an S&P 500 rally). But as soon as conditions turn south, their performance collapses. Over the long run, there are indeed legendary cases like the Medallion Fund. They consistently beat the market, delivering returns above so-called risk-free bonds. But the edge isn’t by orders of magnitude — it’s a few percentage points. To do better than them means being ahead by fractions of a percent, sustained over a very long horizon. The reality is that few funds survive long enough to prove such stability. Over six months, almost anyone can “look like a genius” if they get lucky — that’s the classic survivor bias. And not surprisingly, it’s exactly this illusion that a lot of flashy marketing campaigns for “successful” funds are built on. The philosophical takeaway is a harsh one: an algorithm can’t be called successful until it’s been tested by time. Even if it’s profitable on average, in real life it can get wiped out in a single day with a million-dollar drawdown — simply because you don’t have an extra million lying around to survive that day. and a fourth that focuses on special cases — for example, detecting noise patterns or technical-analysis formations, as discussed earlier. Then you aggregate their predictions (e.g., by averaging, or taking the min/max). This is a standard trick for dealing with heteroscedastic data — where the distribution is non-stationary and constantly shifting. Markets are exactly that kind of case. pred1 = model_recent.predict(x) pred2 = model_history.predict(x) pred3 = model_midterm.predict(x) final = np.mean([pred1, pred2, pred3]/ # final = np.max([pred1, pred2, pred3]) pred1 = model_recent.predict(x) pred2 = model_history.predict(x) pred3 = model_midterm.predict(x) final = np.mean([pred1, pred2, pred3]/ # final = np.max([pred1, pred2, pred3]) The idea is that the market may change tomorrow, but some of the old information is still useful. Averaging helps smooth out these distortions. Sliding windows Sliding windows Another technique is training on sliding windows. Take the last 7 days, predict the next one. Then shift the window: add new data, drop the old. The model keeps updating, allowing it to adapt to new market regimes. window = 7 for t in range(window, len(data)): model.fit(data[t-window:t]) pred = model.predict(data[t]) window = 7 for t in range(window, len(data)): model.fit(data[t-window:t]) pred = model.predict(data[t]) So why is there no universal ML for trading? So why is there no universal ML for trading? So, each of the three problems can be solved on its own, but together they don’t add up to a universal solution. One reason is the lack of quality feedback for training models. In finance, you don’t have the usual ML metrics like accuracy or F1-score. The only metric is money made. Imagine two hedge funds. One shows average returns, the other twice as high. If someone consistently outperforms the rest, everyone immediately assumes it’s a scam. Why? First, because nothing like that shows up in the market — other participants don’t feel like someone is “skimming” them on every trade. Second, there’s the survivor bias. Classic example: take a thousand people, half go long on oil, half go short. The next day, half of them are right. From the remaining 500, split again, and repeat for several rounds. After ten days, you’ll have one “genius” who made the right call ten times in a row. But in reality, he was just lucky — the illusion comes from starting with a thousand players. This is the core problem of verification. There isn’t much data to train on, and there’s even less to validate results. Even if we could see trades from a fund that outperforms the market twofold, over a relatively short horizon we still wouldn’t be able to tell luck from real skill. A good example is the many “one-day wonders” — funds or companies that show great returns when the overall market is going up (say, during an S&P 500 rally). But as soon as conditions turn south, their performance collapses. Over the long run, there are indeed legendary cases like the Medallion Fund. They consistently beat the market, delivering returns above so-called risk-free bonds. But the edge isn’t by orders of magnitude — it’s a few percentage points. To do better than them means being ahead by fractions of a percent, sustained over a very long horizon. The reality is that few funds survive long enough to prove such stability. Over six months, almost anyone can “look like a genius” if they get lucky — that’s the classic survivor bias. And not surprisingly, it’s exactly this illusion that a lot of flashy marketing campaigns for “successful” funds are built on. The philosophical takeaway is a harsh one: an algorithm can’t be called successful until it’s been tested by time. Even if it’s profitable on average, in real life it can get wiped out in a single day with a million-dollar drawdown — simply because you don’t have an extra million lying around to survive that day.