paint-brush
This Deep-learning Approach Can Help Double Your Gains in Crypto Investmentsby@pmatheus

This Deep-learning Approach Can Help Double Your Gains in Crypto Investments

by pmatheusDecember 18th, 2024
Read on Terminal Reader

Too Long; Didn't Read

This report presents a novel approach to cryptocurrency trading using a Transformer-based Deep Reinforcement Learning (DRL) agent.
featured image - This Deep-learning Approach Can Help Double Your Gains in Crypto Investments
pmatheus HackerNoon profile picture

This report presents a novel approach to cryptocurrency trading using a Transformer-based Deep Reinforcement Learning (DRL) agent. The system leverages modern NLP-inspired architectures (Transformers), Double DQN (DDQN), Noisy Networks, and Low-Rank Adaptation (LoRA) for Test-Time Training (TTT). By combining these components, the agent demonstrates improved adaptability, stable policy formation, and significantly enhanced profitability on historical BTC/USDT data. While preliminary, these results suggest the method can be extended to more complex datasets (e.g., orderbook-level) and multiple asset classes, providing a scalable foundation for institutional-grade trading strategies.


Editor’s note: This article is for informational purposes only and does not constitute investment advice. Cryptocurrencies are speculative, complex, and involve high risks. This can mean high prices volatility and potential loss of your initial investment. You should consider your financial situation, investment purposes, and consult with a financial advisor before making any investment decisions. The HackerNoon editorial team has only verified the story for grammatical accuracy and does not endorse or guarantee the accuracy, reliability, or completeness of the information stated in this article. #DYOR


Why Transformers, Why DRL, and Why You Should Care

  • Transformers: Born in the NLP world, they excel at deciphering sequences. Markets are simply time-series puzzles. Transformers read them like stories, anticipating plot twists (price moves) well before your average quant would blink.

  • Double DQN (DDQN): No more one-sided Q-value estimation. DDQN cuts down on those notorious overestimations, giving your trading policy the credibility it needs to handle tricky bull traps and bear raids.

  • Noisy Nets encourage exploration by adding parameterized noise directly to network weights.

  • LoRA allows test-time training (TTT) with minimal overhead, enabling the model to adapt rapidly to new market conditions without a full retraining cycle.

  • Test-Time Training (TTT): Markets shift, evolve, surprise. TTT lets this model adapt swiftly, adjusting strategies mid-flight as if it’s reading tomorrow’s headlines today.


The code is on https://github.com/pmatheus/themoneyhack Clone it, run it, dissect it. You might like it.


In this framework creates a system where a learning agent interacts with a simulated trading environment, guided by a modern Transformer-based architecture. At its core, the system seeks to improve trading decisions over time, refining its strategy through reinforcement learning and adapting quickly to new market conditions.


Consider first the Transformer model. Traditionally, Transformers have revolutionized fields like natural language processing by interpreting sequences of words or tokens. Here, the same principles are applied to time-series price data. Instead of words, the model ingests historical market features — prices, volumes, technical indicators — and attempts to extract meaningful temporal dependencies. The TransformerNetwork class, defined in the code, exemplifies this approach. Notice how the constructor projects raw input features into a higher-dimensional representation and then passes them through multiple Transformer encoder layers:


class TransformerNetwork(nn.Module):
    def __init__(self, state_dim, output_dim, lookback, nhead=8, num_layers=4, model_dim=512, lora_r=8, lora_alpha=1.0, lora_active=False, sigma_init=hyperparameters['SIGMA_INIT']):
        super(TransformerNetwork, self).__init__()
        self.model_dim = model_dim
        self.lookback = lookback
        self.lora_active = lora_active

        self.input_fc = NoisyLoRALinear(state_dim, self.model_dim, r=lora_r, alpha=lora_alpha, lora_active=lora_active, sigma_init=sigma_init)
        
        encoder_layer = TransformerEncoderLayerRelative(d_model=model_dim, nhead=nhead, lora_r=lora_r, alpha=lora_alpha, lora_active=lora_active, sigma_init=sigma_init)
        self.transformer = TransformerEncoderRelative(encoder_layer, num_layers=num_layers)

        self.output_fc = NoisyLoRALinear(self.model_dim, output_dim, r=lora_r, alpha=lora_alpha, lora_active=lora_active, sigma_init=sigma_init)
        self._initialize_weights()

    def forward(self, x):
        x = self.input_fc(x)
        x = x.permute(1, 0, 2)
        x = self.transformer(x)
        x = x.mean(dim=0)
        output = self.output_fc(x)
        return output


This snippet highlights the flow: raw states enter through input_fc (a linear layer enhanced with noise and potential low-rank adaptations), travel through stacked TransformerEncoderLayerRelative modules that capture temporal and relative positional information, and finally are summarized into actions by output_fc. This design allows the model to weigh events at different points in time, identifying recurring patterns or anomalies that may suggest profitable opportunities.


If the Transformer model provides the agent’s “eyes” and “ears,” the TradingEnv class simulates the world it interacts with. This environment class, named TradingEnv, defines what it means to take an action—such as opening or closing positions—and how rewards are assigned. Inside the environment, the agent observes price changes, technical indicators, and its current holdings. The following excerpt shows how the environment constructs states from market data and agent positions:


def get_state(self):
    states = []
    current_timestamp = self.tech_array.iloc[self.time]['timestamp']
    weight_long = float(self.stocks_long * self.current_price / self.total_asset) if self.total_asset > 0 else 0.0
    weight_short = float(self.stocks_short * self.current_price / self.total_asset) if self.total_asset > 0 else 0.0

    for _ in range(self.lookback):
        row = self.precomputed_tech.get(current_timestamp)
        if row is not None:
            features = row.drop(['timestamp', 'date']).astype(np.float32).values
            state_row = np.concatenate(([weight_long, weight_short], features)).astype(np.float32)
        else:
            state_row = np.zeros(self.state_dim, dtype=np.float32)
        states.append(state_row)
        current_timestamp -= self.get_timeframe_in_seconds()

    states = states[::-1]  # ensure chronological order
    state = torch.tensor(np.array(states, dtype=np.float32), dtype=torch.float32).unsqueeze(0).to(self.device)
    return state


Here, the environment forms a rich state embedding that includes position information and a historical window of features. By providing this windowed perspective, the Transformer can model temporal patterns. Each step in the environment updates positions, calculates profit or loss, and returns a new state and reward for the agent. This cyclical exchange fosters a feedback loop, letting the agent learn what works and what doesn’t.


The decision-making engine resides in the DiscreteDDQNAgent class, which implements a Double DQN approach. Double DQN helps address the well-known overestimation bias in Q-learning by using separate networks for action selection and value estimation. The agent maintains online_net and target_net instances of the Transformer-based model and uses a replay buffer to sample past experiences. During training, it updates its parameters by minimizing the difference between predicted Q-values and target Q-values derived from the target network:


def update(self):
    self.online_net.train()
    if len(self.memory) < self.batch_size:
        return None

    states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
    q_values = self.online_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)

    with torch.no_grad():
        next_q_values = self.target_net(next_states).gather(1, self.online_net(next_states).argmax(dim=1).unsqueeze(1)).squeeze(1)
        target_q_values = rewards + self.gamma * next_q_values * (1 - dones)

    ddqn_loss = self.loss_fn(q_values, target_q_values.detach())
    self.optimizer.zero_grad()
    ddqn_loss.backward()
    torch.nn.utils.clip_grad_norm_(self.online_net.parameters(), max_norm=1.0)
    self.optimizer.step()
    self.soft_update()

    return ddqn_loss.item()


This code reveals a careful training routine: the agent repeatedly samples mini-batches of past transitions, computes the DDQN loss, and updates the online_net. Periodically, it performs a soft update of target_net to slowly track improvements. By combining a Transformer’s pattern recognition capabilities with the stable learning dynamics of Double DQN, the agent gradually becomes more adept at selecting profitable trades.


Market conditions, however, never stand still. Shifts in volatility, regulatory changes, or sudden liquidity events can turn yesterday’s winners into today’s losers. To handle such shifts, the concept ofTest-Time Training (TTT) is introduced. Rather than relying solely on updates performed after collecting large amounts of experience, TTT allows certain parts of the network to adapt dynamically as new states arrive, even during inference. This quick adaptation enables the agent to recalibrate its strategies on the fly, adjusting to unforeseen circumstances without waiting for a full retraining cycle.


In this code, TTT is facilitated by selectively training parameters during inference and focusing on a small subset of parameters. The agent can, for instance, modify a portion of its network to quickly absorb new patterns. The methods activate_lora() and deactivate_lora() (not shown now since we focus on Transformer, Env, DDQN, and TTT logic) toggle these adaptive parameters on and off. While that code snippet specifically references LoRA (low-rank adaptation), the same philosophy applies: TTT means enabling certain parts of the model to learn at test time. When TTT mode is active, a specialized optimizer updates these parameters on the spot:


def update_auxiliary(self, state):
    self.autoencoder.train()
    self.online_net.train()  # Ensure noise and adaptive params are active

    masked_state, target_state = self.mask_input(state)
    masked_state_flat = masked_state.view(state.size(0), -1)
    target_state_flat = target_state.view(state.size(0), -1)

    reconstructed = self.autoencoder(masked_state_flat)
    aux_loss = F.mse_loss(reconstructed, target_state_flat)

    self.autoencoder_optimizer.zero_grad()
    self.ttt_optimizer.zero_grad()

    aux_loss.backward()
    torch.nn.utils.clip_grad_norm_(self.autoencoder.parameters(), max_norm=1.0)
    torch.nn.utils.clip_grad_norm_(filter(lambda p: p.requires_grad, self.online_net.parameters()), max_norm=1.0)

    self.autoencoder_optimizer.step()
    self.ttt_optimizer.step()

    self.autoencoder.eval()
    return aux_loss.item()


This auxiliary update routine, triggered at test time, demonstrates the system’s adaptability. The agent intentionally masks some input features and tries to reconstruct them, improving its internal representation of the current market regime. By doing so, it can quickly react to data shifts and maintain profitable strategies in an ever-changing landscape.


In sum, the interplay among these components creates a robust and flexible trading machine. The Transformer-based model provides a rich, context-sensitive understanding of market signals. The TradingEnv simulates realistic conditions and feedback loops. The Double DQN framework ensures stable learning of action-values, gradually refining trade decisions. Finally, TTT allows the agent to remain agile, adjusting its internal parameters when confronted with novel patterns. Together, these elements set the stage for a system that is both sophisticated in pattern recognition and resilient in the face of market uncertainty, offering a compelling vision for next-generation trading strategies.


What’s the Outcome?

  • Multiple Capital Doublings: On historical daily BTC/USDT data, this agent didn’t just eke out small gains — it doubled its initial capital multiple times. If that’s not a wake-up call, I don’t know what is.
  • Stable, Low Losses & Clear Policies: The TensorBoard logs show losses smoothed out to impressively low levels, and Q-value distributions for different trading actions formed distinct, meaningful patterns. Translation: The model’s not just guessing — it’s learning a consistently profitable approach.


Next-Level Potential

  • Orderbook-Level Data: Right now, we’re dancing with OHLCV. Give me orderbook depth, and I’ll give you alpha that’s practically nuclear. Your internal liquidity pool deserves a model that can anticipate and exploit microsecond-level price moves.
  • Multi-Timeframe Integration: Intraday scalping or longer-term position trades? With transformers, we can fuse multiple granularities — 1-minute charts with daily trends — uncovering signals that no single timeframe can offer.
  • Robustness to Regime Shifts: Markets break patterns all the time. With LoRA and TTT, this agent can adapt faster than a caffeinated intern pulling an all-nighter. Overnight volatility spike? Fed announcement? The model says, “Bring it on.”


Comment, Engage, Offer a Deal 🤯

Don’t let this be a fleeting curiosity. Comment, critique, ask questions. If you’re wondering how to pivot your strategy, or how to integrate a flexible DRL solution into your existing stack leave a comment.