Trino: The Open-source Data Query Engine That Split from Facebook

Written by orhillel | Published 2022/03/30
Tech Story Tags: query-processing | facebook | open-source | data | presto | database | open-source-software | querying

TLDRTrino is an open-source distributed SQL query engine for ad hoc and batch ETL queries against several types of data sources. Trino's speed is that it doesn't rely on checkpointing and fault tolerance methods. The Trino architecture is similar to massively parallel processing (MPP) databases. Pushdown can push the processing of queries down into the connected data source. Unlike a traditional data warehouse, you don't need to move data ahead of time of time. Use Trino as an online processing system for analytical processing (OLAP)via the TL;DR App

Internet users generate 2.5 quintillion bytes of data every day. Are your organization's data management tools up to the task? Trino, an open-source distributed SQL query engine, can give you better data processing and analysis. It can accelerate your queries. Maybe you've heard about Trino but want to know more before you change systems.

Is Trino a database? Is it OLAP? How does it improve query processing? What do I need to know about Trino? Find out more here about Trino and how it can improve query performance.

What Is Trino?

Trino is an open-source distributed SQL query engine. Engineers designed Trino for ad hoc and batch ETL queries against several types of data sources.

Trino supports relational and non-relational sources. Trino can handle standard and semi-structured data types.

Some people mistakenly think Trino is a database. You use Trino to run SQL against data, but that doesn't make it a database. Trino doesn't actually store any data.

Trino Split from Facebook’s Presto

Trino and Presto are closely related. Trino split from Facebook's Presto project. Engineers at Facebook developed Presto to process the petabytes of data Facebook was trying to analyze. The creators wanted Presto to remain open-source and community-based. Facebook wanted more control, which caused a split.

Facebook applied for a trademark for the name Presto. The original creators rebranded as Trino. Now that Trino separated from Presto, the functionalities of each system have started to differ.

How Does Trino Accelerate Queries?

Trino started as a way to manage the incredibly large data sets Facebook needed to analyze. Trino queries to process faster than queries using other engines. Several factors contribute to this acceleration.

Trino Architecture

Trino architecture is similar to massively parallel processing (MPP) databases. A coordinator node manages multiple worker nodes to process all the work.

A user runs their SQL, which goes to the coordinator. The coordinator parses, plans, and schedules a distributed query.

This partitions the data into smaller chunks to distribute across the nodes. When data chunks arrive at a particular machine, the machine processes them in parallel. Processing happens over multiple threads within a particular node.

Trino supports standard ANSI SQL, like complex queries, joins aggregations, and outer joins. Users can run more complex operations like JSON and MAP transformations and parsing.

Reduced Latency

One factor in Trino's speed is that it doesn't rely on checkpointing and fault tolerance methods. Fault tolerance adds resiliency, but it also creates a large amount of latency.

Removing the fault-tolerance requirement is a major change from older big data systems. It makes Trino ideal for queries where the cost of recovering from failure is less than the cost of checkpointing.

Pushdown

Trino can push the processing of queries down into the connected data source. The operation goes to the source system where custom indexes on the data already exist.

Pushdown improves overall query performance. It reduces the amount of data read from storage files. Trino can use several types of pushdown, including:

  • Predicate

  • Projection

  • Dereference

  • Aggregation

  • Join

  • Limit

  • Top-N

These forms of pushdown reduce network traffic between Trino and the data source. They also reduce the load on the remote data source. Support for pushdown depends on each connector and the underlying database or storage system.

Cost-Based Optimizer

The Trino cost-based optimizer (CBO) uses table and column statistics to create the most efficient query plan. It considers the three main factors that contribute to how long a query takes:

  1. CPU time

  2. Memory requirements

  3. Network bandwidth usage

The CBO balances the different demands for queries, namely:

  • Rapid execution time

  • Keeping cluster costs as low as possible

  • Ensuring that all cluster users can work at the same time

You can only truly optimize for one of these priorities. The CBO creates and compares different variants of a query execution plan to find the option with the lowest overall cost.

Agile Approach to Data Access

Trino runs storage and computing separately. It works easily in cloud environments. The Trino cluster doesn't store your data, so it can auto-scale depending on the load without losing any data.

Use Cases for Trino

Trino is an online analytical processing (OLAP) system. Trino extends the traditional OLAP data warehouse solution by running as a query engine for a data lake or data mesh.

You can interactively run queries across various data sources. Unlike a traditional data warehouse, you don't need to move the data ahead of time.

The power and flexibility of Trino make it well-suited for many use cases. You can use it for all of them or to solve one particular problem. As the Trino users in your organization gain experience with its benefits and features, you'll likely discover other uses as well.

Ad Hoc Queries and Reporting

End-users can use SQL to run ad hoc queries where the data resides. You don't have to move the data to a separate system. You can quickly access data sets that analysts need.

You can query data across many sources to build reports and dashboards for business intelligence. Data scientists and analysts can create queries without needing to rely on data ops engineers.

Data Lake Analytics

One common use case for Trino is directly querying data on a data lake without needing transformation. You can query structured or semi-structured data from multiple sources. This streamlines the process of creating operational dashboards.

Trino can use the Hive connector against HDFS and other object storage systems. You can get SQL-based analytics on your data lake however it stores data.

Batch ETL

Trino is a great engine for your batch extract, transform, and load (ETL) queries. It can rapidly process a large amount of data. It can bring in data from different sources without always needing to extract it from sources like MySQL.

An ETL through Trino is a standard SQL statement. The ETL is easy to implement. End users can perform other ad hoc transformations.

The extensive Trino connector framework means that any connector can be the source for an ETL. Most connectors can also be a sink.

Improve Query Performance with Trino

If you were wondering how Trino can accelerate your queries, you're considering two cutting-edge query engines. Several features of Trino can improve your query performance.

Trino architecture uses massively parallel processing. Reduced latency, pushdown, and the cost-based optimizer also accelerate the query lifecycle.


Written by orhillel | Helps executive teams, marketers and data analysts leverage innovative digital strategies and emerging technologies.
Published by HackerNoon on 2022/03/30