paint-brush
Metabase & Substrate: A Simple and Flexible, Yet Unscalable, Approach to Data Analysisby@godlin
247 reads

Metabase & Substrate: A Simple and Flexible, Yet Unscalable, Approach to Data Analysis

by BGJune 23rd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The world of blockchain technology has witnessed remarkable growth, and with it, the need for robust data analytics solutions. While various tools have emerged in recent years, the landscape to analyze Substrate blockchains remains relatively nascent. In this article, we will explore an approach that combines the power of open source technologies for indexing and database management, along with a versatile visualisation tool.
featured image - Metabase & Substrate: A Simple and Flexible, Yet Unscalable, Approach to Data Analysis
BG HackerNoon profile picture

The world of blockchain technology has witnessed remarkable growth, and with it, the need for robust data analytics solutions to extract valuable insights from blockchain networks. While various tools have emerged in recent years, the landscape to analyze Substrate blockchains, in particular, remains relatively nascent.


The current state of Substrate blockchain explorers, such as polkadot.js, polkastats, subscan, and others, showcases significant progress in providing visibility into blockchain data. These tools have contributed to enhancing the accessibility and ease of exploring transactions and events on Substrate-based blockchains. However, while these tools offer valuable exploration features, they have certain limitations. A one notable limitation is the lack of flexibility when it comes to performing aggregations based on specific parameters.


This restriction hinders the ability to derive deeper insights and perform complex analysis on the blockchain data. In this article, we will explore an approach that combines the power of open source technologies for indexing and database management, along with a versatile visualisation tool.


The goal of the article is to discover a simplified and user-friendly method to analyze Substrate blockchains using a range of powerful open-source tools. This approach offers a straightforward and intuitive way to delve into blockchain data analysis. While this approach excels in simplicity, it's important to note its limitations in scalability for large-scale blockchain networks. Nevertheless, by leveraging the capabilities of popular open-source tools, users can easily navigate and explore Substrate blockchains to uncover meaningful patterns and trends.

For those who are eager to dive right into the practical demonstration and get a quick overview of the contents of this article, I've created a video that summarizes the key points and provides a more interactive demo. In the video, I walk through the concepts discussed in this article and showcase the examples in action:


https://odysee.com/metabase-and-substrate-a-simple-and-flexible-yet-nscalable-approach-to-data-analysis:e?r=SBpUVAV9TFtPCydBhjcXtrh1bzoStkeZ

Approach and Considerations


It's important to note that the emphasis in this article is on simplicity and flexibility to provide a comprehensive and customizable analytical experience available locally. It's also necessary to mention that this article is based solely on the author's experience and may not be as scalable or sophisticated as other solutions. Nevertheless, the emphasis on simplicity aims to provide a starting point for those seeking an accessible entry into Substrate blockchain analysis.


Throughout this article, the examples and instructions provided will be tailored for analyzing a Substrate blockchain based on the SORA network. Similarly, the configuration and usage of Subquery will be emphasized as the primary tool for indexing blockchain data. While the concepts and techniques discussed can be applied to other Substrate-based blockchains, it is important to keep in mind that certain details and configurations may vary depending on the specific blockchain network. The focus on SORA blockchain and Subquery stems from the author's familiarity with these networks and tools.



Metabase was chosen for its user-friendly interface and simplicity, making it an ideal choice for beginners and those seeking an easy-to-use analytics solution.


One might wonder, why not use query aggregation, as described in the Subquery documentation, to achieve similar results? While query aggregation is a powerful feature, it currently has limitations when it comes to working with JSON objects, including those of the @jsonField type. This means that aggregating data stored in generic @jsonFields, like in the example mentioned in Subquery’s GitHub issue #522, is technically impossible using the Graphql query editor. To overcome this limitation and provide a flexible and accessible analytical experience, Metabase serves as a valuable alternative.


Finally, it is worth noting that this article is not sponsored by any of the mentioned entities. The choice of SORA and Subquery is solely driven by the author's experience and familiarity with these technologies.

Exploring Alternative Approaches

Before diving into the implementation details, it's essential to acknowledge that different options exist to analyze Substrate blockchains, each with its own strengths and weaknesses.


substrate-api-sidecar is a tool developed by Parity Technologies that provides a simplified interface for interacting with Substrate blockchains. The primary purpose of substrate-api-sidecar is to expose blockchain data through a RESTful API, enabling users to retrieve information such as blocks, transactions, events, and account details. While substrate-api-sidecar is a powerful tool for accessing and retrieving blockchain data, it primarily focuses on providing an interface to interact with the blockchain rather than performing complex data analysis or aggregations. For performing aggregations and advanced analysis on Substrate blockchains, additional tools and frameworks, such as those discussed in this article, can be used in combination with substrate-api-sidecar. These tools allow for more flexibility in customizing aggregations and deriving insights from blockchain data.


The colorfulnotion/substrate-etl project focuses on providing an Extract, Transform, Load (ETL) framework specifically tailored for Substrate-based blockchains. It aims to simplify the extraction of blockchain data, perform necessary transformations, and load the data into a database for further analysis. This approach enables users to create their own custom data pipelines and apply aggregations based on their specific requirements.



Another alternative approach to consider is the Giant Squid API provided by Subsquid. This API offers a great and fast solution for accessing and querying data on Substrate-based chains. It provides a powerful GraphQL interface, allowing you to fetch data in a flexible and efficient manner. However, it's important to note that working with the Giant Squid API requires knowledge of GraphQL and building custom queries. Additionally, unlike Metabase, the Giant Squid API does not have built-in aggregations and data transformations, which means you'll need to handle those aspects manually. If you're comfortable with GraphQL and prefer a more hands-on approach to data analysis, the Giant Squid API can be a valuable tool to explore. For more information on its features and roadmap, you can visit their documentation at link to Giant Squid API roadmap.



Finally, the select-from-polkadot forum post proposes a SQL-like querying mechanism for Substrate blockchains. The author explores the idea of applying a SQL-like syntax to interact with Polkadot's data in a more familiar and efficient manner. The post highlights the benefits of using SQL for data querying, including its simplicity, versatility, and wide adoption. It also touches upon the challenges and considerations involved in implementing SQL-like functionality in a decentralized network like Polkadot. While there isn't a native SQL query language for Polkadot, the author suggests exploring existing tools and technologies that can bridge the gap between SQL and Polkadot's data structure.


Configuration guides

This section will serve as a practical guide, providing readers with detailed step-by-step instructions for configuring the Subquery and Metabase projects. It will cover the necessary setup, installation, and configuration steps for both tools, ensuring readers have a clear understanding of how to set up and customize the projects for their own use.

Subquery

Here we will discuss Subquery configuration, an essential tool for indexing Substrate blockchains and enabling flexible data extraction and transformation. The goal of the section is to cover various aspects of Subquery configuration, including project initialization, schema definition, mapping, and deployment. Additionally, readers will gain insights into customizing Subquery configuration through data filters, event handlers, and indexing strategies.

Introduction

Subquery is a powerful tool specifically designed for indexing Substrate blockchains, enabling efficient data extraction and transformation. It provides a framework that simplifies the process of querying and analyzing blockchain data, making it easier to derive valuable insights from Substrate-based networks. For more information about Subquery, you can visit their website: Subquery Website.

Setting Up Subquery

For the official setup instructions and more detailed documentation on Subquery, please refer to the Subquery Academy. The subsequent sections will dive into the configuration and usage specifics for this particular project, providing step-by-step guides to help you get started with Subquery for your Substrate blockchain analysis needs.


Preparing the necessary setup for Subquery is relatively easy and does not require a significant amount of time. To begin, you will need to fork the Subquery dictionary, which is a collection of data models and transformations that define how the blockchain data will be indexed and queried. More information about dictionaries can be found in the Subquery Academy. Once you have forked the dictionary, you will make several changes to tailor it to your specific needs.


Configuring a manifest file

The manifest file plays a crucial role in customizing your Subquery project to meet your specific needs. Now, let's delve into the specific configuration used in our Subquery project. The following manifest file represents the current setup we will be working with:


specVersion: 1.0.0
name: sora-subquery-dictionary
version: 1.0.0
runner:
  node:
    name: '@subql/node'
    version: 1.9.1
  query:
    name: '@subql/query'
    version: 1.5.0
description: ''
repository: https://github.com/sora-xor/sora-subquery
schema:
  file: ./schema.graphql
network:
  chainId: '0x7e4e32d0feafd4f9c9414b0be86373f9a1efa904809b683453a9af6856d38ad5'
  endpoint: wss://mof3.sora.org
  bypassBlocks: [9633, 14855, 15156, 15663, 25432, 25467, 26157, 27521, 27601, 240175, 242625, 246858, 247995, 258152, 263341, 268943, 282439, 319553, 322075]
  chaintypes:
    file: ./types.json
dataSources:
  - kind: substrate/Runtime
    startBlock: 9000000
    mapping:
      file: ./dist/index.js
      handlers:
        - handler: handleBlock
          kind: substrate/BlockHandler

The provided configuration manifest file appears to be following the official configuration structure for Subquery. It includes specific details such as the project name, version, runner specifications, repository link, network settings, and data sources.


However, it's worth noting that the line query: name: '@subql/query' in the manifest file may be considered redundant in this case. Since you will be using Metabase as the analytical tool instead of the GraphQL query editor provided by Subquery, the @subql/query package may not be utilized. Even though Metabase will be used as the analytical tool in your setup, the @subql/query package is still necessary for the proper functioning of Subquery and its associated packages.


Schema configuration

In order to effectively analyze a Substrate blockchain using Subquery, it is essential to define the data structure and schema of your project. Now, let's take a closer look at the specific schema configuration used in our Subquery project. The following schema file represents the current setup we will be discussing:


type SpecVersion @entity {
  id: ID! #specVersion
  blockHeight: BigInt!
}

type Event @entity {
  id: ID!
  module: String! @index
  method: String! @index
  blockHeight: BigInt! @index
  params: Details!
  timestamp: Date!
}

type Extrinsic @entity {
  id: ID!
  txHash: String! @index
  module: String! @index
  method: String! @index
  blockHeight: BigInt! @index
  success: Boolean!
  isSigned: Boolean!
  details: Details
  timestamp: Date!
}

type Details @jsonField {
  data: String # API is implemented by bypassing the scheme - `data` in all mappings is an object, not a String. See https://github.com/subquery/subql/issues/522.
}


The schema above represents the current configuration used in our Subquery project. It is similar to the one provided in the dictionary, with the addition of the details sub-object. This sub-object is used in the Event and Extrinsic entities and is defined as a Details type with a data field. It serves as a generic JSON structure, allowing for flexibility in capturing additional details associated with events and extrinsics.


However, there is a known issue related to the implementation of the Subquery API. The issue arises when using the @jsonField annotation, and it is currently unresolved on GitHub. You can find more details about this issue at link to the GitHub issue. Despite this limitation, we can still proceed with our schema configuration and utilize the available features to analyze the Substrate blockchain data.


Mapping configuration

Now that we have established the schema configuration for our Subquery project, it's time to delve into the mapping configuration.


In our mapping configuration, we will make a specific modification to capture additional data for events and extrinsics in a human-readable format. This modification allows us to extract and save specific details from these entities, providing a deeper level of analysis and understanding. For events, we will include the line:

newEvent.params = event.event.data.toHuman() as Object; 


This line of code takes the event data and converts it to a human-readable format using the toHuman() function. By doing so, we can access and store the event parameters in a structured manner, enabling easier analysis and interpretation. Similarly, for extrinsics, we will include the line:

newExtrinsic.details = JSON.parse(JSON.stringify(extrinsic.extrinsic.toHuman())).method.args as Object;

This line of code extracts the extrinsic details and converts them to a human-readable format. The toHuman() function allows us to access the method arguments of the extrinsic and save them as the details field in our Subquery project. This additional level of detail enhances the richness of the data captured and provides valuable insights for analysis.


When you launch the project with the configuration provided, you will observe that the JSON structure for the Event and Extrinsic entities will resemble the image shown earlier. The modifications made in the mapping configuration enable the extraction and storage of additional data in a human-readable format.


For events, the extracted data will be stored in the params field of the Event entity. This field will contain a JSON object that represents the human-readable form of the event data. It will provide a structured representation of the event parameters, making it easier to analyze and interpret.

Similarly, for extrinsics, the extracted details will be stored in the details field of the Extrinsic entity. This field will contain a JSON object that represents the human-readable form of the extrinsic details. It will capture the method arguments of the extrinsic, enabling a deeper understanding of the actions performed on the Substrate blockchain.

Metabase

With the Subquery configuration in place, we can now turn our attention to setting up Metabase for data analysis and visualization. Metabase is an open-source analytics tool that provides a user-friendly interface for exploring and understanding data. By integrating Metabase with Subquery, we can leverage its intuitive features to gain valuable insights from the indexed blockchain data.


Installing Metabase is a straightforward process, and it can be done by following the guide provided in the official Metabase documentation. The installation steps are well-documented and ensure a smooth setup of the analytics tool. Once Metabase is up and running, the next step is to connect it to the Postgres database that Subquery is using to store the indexed blockchain data.


Connecting Metabase to the Subquery database involves providing the necessary connection details, such as the database host, port, username, and password. These details can be obtained from the Subquery configuration, specifically the Postgres database configuration. By establishing this connection, Metabase will be able to access the indexed blockchain data and enable powerful analysis and visualization capabilities.


In order to optimize user experience and enhance performance, it is recommended to disable the automatic unfolding of JSON fields in Metabase. By default, Metabase tries to unfold JSON columns to display their nested structure in a more readable format.



However, for analytical purposes and to avoid potential performance issues when dealing with large datasets, it is beneficial to turn off this feature. To disable the automatic unfolding of JSON columns in Metabase, you can follow these steps:

  • Access the Metabase admin panel.
  • Navigate to the "Databases" section.
  • Locate and select the database you have connected Metabase to.
  • Within the database settings, look for the option or toggle labeled "Unfold JSON Columns".
  • Toggle off or disable this feature.

Demo

In this section, we will delve into the practical aspects of using Metabase for querying data and creating visualizations with the sample Substrate blockchain data. Add that the following example will be incorrect as the indexing from the genisis block to the latest finalised is required. It wasn't done on my machine though and instead only certain small ranges are used.

Browse data

Once you have logged into the Metabase interface, you will find yourself on the homepage where you can explore the available databases. To locate the table containing the Substrate blockchain data, follow these steps:


  • Navigate to the "Our Data" section in the top navigation bar of the Metabase interface.
  • Select the "Postgres" option from the available database options.
  • Locate the specific database instance that corresponds to the Substrate blockchain data. This could be labeled as "App" or any other relevant identifier you have assigned.
  • Within the selected database instance, you will find the tables and views associated with the Substrate blockchain data.

Query data

Now that you have located the relevant table within the database, we can proceed to query the Substrate blockchain data and create insightful visualizations using Metabase's intuitive interface.

Calculate the distribution of emitted events based on their sections

Creating the above query is a straightforward process and can be done solely using the simple and intuitive drag-and-drop interface of Metabase.

After executing the query, you can configure the pie chart visualization in Metabase and obtain the following chart:

Calculate an amount of "Transfer" event per hour

Creating this query is also a straightforward process: a native drag-and-drop interface of Metabase can be utilised.

By creating the following request, you can obtain the desired visualization:

Asset Exchange Usage Distribution: Source and Target Analysis

To analyze the distribution of asset usage in exchange operations, we'll need to utilize SQL in our query. As mentioned earlier, Metabase sometimes automatically destructures JSON fields, but in this case, we need to work with the raw data and apply specific transformations. By default, the toHuman() function used in the Subquery mapping decodes the data of the Exchange event as an array and converts numeric values to strings. Therefore, additional type casting and manipulation are necessary, which can be achieved through SQL queries rather than the drag-and-drop interface:


WITH "calc" AS
  (SELECT "app"."events"."id" AS "id",
          ("app"."events"."params"::JSON->2)::JSON->>'code' AS "sourceAsset",
                                                     ("app"."events"."params"::JSON->3)::JSON->>'code' AS "targetAsset"
   FROM "app"."events"
   WHERE "app"."events"."method" = 'Exchange'),
     "targetAssets" AS
  (SELECT count(*) AS "countTargetAsset",
          "calc"."targetAsset"
   FROM "calc"
   GROUP BY "calc"."targetAsset"),
     "sourceAssets" AS
  (SELECT count(*) AS "countSourceAsset",
          "calc"."sourceAsset"
   FROM "calc"
   GROUP BY "calc"."sourceAsset")
SELECT *, ("sourceAssets"."countSourceAsset" + "targetAssets"."countTargetAsset") as "sum"
FROM "sourceAssets"
FULL JOIN "targetAssets" ON ("sourceAssets"."sourceAsset" = "targetAssets"."targetAsset")
WHERE ("sourceAssets"."countSourceAsset" + "targetAssets"."countTargetAsset") is not null
ORDER BY  "sum" DESC
LIMIT 20


The provided query performs an analysis of exchange events in the database. Here's a breakdown of what it does:

  • The query uses Common Table Expressions (CTEs) to create temporary result sets for further calculations.
  • The CTE named "calc" selects the "id," "sourceAsset," and "targetAsset" columns from the "app.events" table. It extracts specific values from the "params" JSON field using the >> operator.
  • The CTE "targetAssets" groups the data from the "calc" CTE by "targetAsset" and calculates the count of each target asset.
  • Similarly, the CTE "sourceAssets" groups the data from the "calc" CTE by "sourceAsset" and calculates the count of each source asset.
  • The final SELECT statement retrieves all columns from the "sourceAssets" CTE and performs a FULL JOIN with the "targetAssets" CTE based on the matching "sourceAsset" and "targetAsset" values.
  • The result set is ordered by the count of source assets ("countSourceAsset").

After executing the query, you can configure the bar chart visualization in Metabase and obtain the following chart:

The graph displays the frequency of each asset being used as both the source and target asset in exchange operations. It provides insights into the usage patterns and popularity of different assets within the exchange.

Addresses balance sheet within a specified period

The following query offers a powerful tool for generating a balance sheet for each address within a specified period. This functionality can be utilized in various scenarios, such as creating reports for tax authorities or analyzing financial transactions.


WITH "sent" AS
  (SELECT SUM(CAST(replace("app"."events"."params"::JSON->>'amount', ',', '') AS float)) AS "amountSent",
          "app"."events"."params"::JSON->>'to' AS "receiver"
   FROM "app"."events"
   WHERE "app"."events"."method" = 'Transfer'
   GROUP BY "receiver"),
     "received" AS
  (SELECT SUM(CAST(replace("app"."events"."params"::JSON->>'amount', ',', '') AS float)) AS "amountReceived",
          "app"."events"."params"::JSON->>'from' AS "sender"
   FROM "app"."events"
   WHERE "app"."events"."method" = 'Transfer'
   GROUP BY "sender")
SELECT *, ("received"."amountReceived" - "sent"."amountSent") as "delta"
FROM "sent"
FULL OUTER JOIN "received" ON ("received"."sender" = "sent"."receiver")
ORDER BY "delta" DESC
LIMIT 10


The provided SQL query is used to calculate a balance sheet for each address within a specified period. Here is a breakdown of the query's functionality for those who are not familiar with SQL:

  • The query uses two Common Table Expressions (CTEs) named "sent" and "received" to aggregate the data from the "Transfer" events in the blockchain.
  • The "sent" CTE calculates the total amount sent by each address by summing up the values in the "amount" field of the events. It groups the data by the receiving address ("receiver").
  • The "received" CTE calculates the total amount received by each address by summing up the values in the "amount" field of the events. It groups the data by the sending address ("sender").
  • The main query selects all columns from the "sent" CTE and performs a FULL OUTER JOIN with the "received" CTE. This combines the results of both CTEs based on matching addresses.
  • The query also includes an additional calculated column called "delta," which represents the difference between the total amount received and the total amount sent for each address.


It's important to note that the query is based on the Subquery project and is specifically tailored to analyze data starting from block 9000000. At the time of executing the query, the blockchain had reached block 9015188, so the results obtained from this query are applicable only for this timeframe.

After executing the query, you can configure the bar chart visualization in Metabase and obtain the following chart:

Other Metabase Features

In addition to the core visualization and dashboard functionalities, Metabase offers a range of advanced features that can further enhance your data exploration and analysis experience. To further expand your knowledge and explore Metabase's capabilities, I encourage you to visit the official Metabase documentation at Metabase Documentation. Let's discover the power of these other Metabase features and take your data analysis to the next level by mentioning just several of the core features:


Dashboards. Dashboards in Metabase are a powerful feature that allows you to create curated collections of visualizations, reports, and metrics in a single, consolidated view. With dashboards, you can bring together relevant data insights into one centralized location for easy access and monitoring.


Data transformations with models: Refine and manipulate your data within Metabase using built-in data transformation capabilities. Clean, transform, and aggregate data to create derived tables, perform calculations, and generate new insights without leaving the platform. Instead of running a separate ETL job to create a model in your database, Metabase allows you to define models directly within its interface. This eliminates the need for complex coding and manual data transformations.

Advanced visualizations: Go beyond basic charts and explore more advanced visualization options in Metabase. From heatmaps and scatter plots to geographical maps, unlock powerful visual representations to uncover patterns, trends, and correlations in your data.


Parameterized queries Make your queries interactive and dynamic by using parameters. Easily create flexible reports and visualizations that allow users to input values for filtering, date ranges, and other variables, enabling personalized data exploration.


Embedding Metabase: Seamlessly integrate Metabase into your applications or websites by leveraging its embedding capabilities. Embed visualizations and dashboards to provide data-driven insights and analytics to your users, enhancing the value of your own products or services.


Data alerts and scheduled reports: Stay informed about important data changes with Metabase's data alerts and scheduled reports. Set up notifications and automated reports to be delivered via email or other channels, ensuring that you and your team are always up to date with the latest insights.

Conclusion

In conclusion, this article serves as a comprehensive reference for utilizing Subquery and Metabase for data analysis on Substrate-based chains. While it acknowledges that this project is still in its early stages and serves as a pet project, it provides a solid foundation and knowledge base for further contributions to the data analysis aspect of Substrate-based chains. The solution presented here offers a simple and straightforward approach to analyzing Substrate blockchains using open-source tools.


Again it's important to highlight that while this approach may not scale well for large-scale and complex projects, it serves as a starting point and showcases the potential of leveraging Subquery and Metabase for data analysis. The article recognizes the limitations and acknowledges that more sophisticated DLT projects, such as Dune Analytics with their spellbook, offer enhanced analysis experiences and scalability. Depending on your specific requirements, you may need to adapt the database structure and table design for scalability and efficiency.


Additionally, the article highlights the recent post in the Polkadot forum discussing the select-from-polkadot topic. This post lays a great foundation for further improvements in analysis. While the SQL-like syntax application described in the forum post provides a comprehensive exploration of the architecture for querying Substrate-based chains data, this article focuses on showcasing how Metabase, with its main query language being SQL, can easily leverage a SQL-like syntax for data analysis.



The compatibility between the SQL-like syntax and Metabase's query capabilities demonstrates the potential for a powerful solution when combining sophisticated architectural approaches with user-friendly BI systems.


Also I would like to note that this article is subject to change as it evolves alongside the technology and community. The author is committed to keeping the revision history accessible and visible to everyone, ensuring that it remains an adequate reference for those interested in utilizing these technologies for analysis purposes. By documenting the journey and inviting contributions, this article aims to foster a collaborative and evolving resource for the Substrate ecosystem.


Finally, I would like to express our gratitude to the contributors to the analysis ecosystem and the developers of the tools mentioned in this article. Their dedication and innovation have paved the way for new possibilities in data analysis on Substrate-based chains. Together, we can unlock the full potential of data analysis in the Substrate ecosystem.


Also published here.