Data warehouse was defined by Bill Inmon as "a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions" over 30 years ago. However, the initial data warehouses were unable to store massive heterogenous data, hence the creation of data lakes. In modern times, data lakehouse emerges as a new paradigm. It is an open data management architecture featured by strong data analytics and governance capabilities, high flexibility, and open storage. If I could only use one word to describe the next-gen data lakehouse, it would be unification: to avoid the trouble and risks brought by redundant storage and cross-system ETL. Unified data storage of both data and metadata with support for ACID, Schema Evolution, and Snapshot. Unified governance that supports data access via a single interface for multiple engines and workloads. Unified data application If we look into the architecture of a data lakehouse, we will find that it is not only supported by table formats such as Iceberg, Apache Hudi, and Delta Lake, but more importantly, it is powered by a high-performance query engine to extract value from data. Apache Users are looking for a query engine that allows quick and smooth access to most popular data sources. What they don't want is their data to be locked in a certain and rendered unavailable for other engines or to spend extra time and computing costs on data transfer and format conversion. database To turn these visions into reality, a data query engine needs to figure out the following questions: How to access more data sources and acquire metadata more easily? How to improve query performance on data coming from various sources? How to enable more flexible resource scheduling and workload management? provides a possible answer to these questions. It is a real-time OLAP database that aspires to build itself into a unified data analysis gateway. This means it needs to be easily connected to various RDBMS, data warehouses, and data lake engines (such as Hive, Iceberg, Hudi, Delta Lake, and Flink Table Store) and allow for quick data writing from and queries on these heterogeneous data sources. The rest of this article is an in-depth explanation of Apache Doris' techniques in the above three aspects: metadata acquisition, query performance optimization, and resource scheduling. Apache Doris Metadata Acquisition & Data Access Apache Doris 1.2.2 supports a wide variety of data lake formats and data access from various external data sources. Besides, via the Table Value Function, users can analyze files in object storage or HDFS directly. To support multiple data sources, Apache Doris puts efforts in metadata acquisition and data access. Metadata Acquisition Metadata consists of information about the databases, tables, partitions, indexes, and files from the data source. Thus, metadata of various data sources comes in different formats and patterns, adding to the difficulty of metadata connection. An ideal metadata acquisition service should include: A that can accommodate heterogeneous metadata. metadata structure An that enables quick and low-cost data connection. extensible metadata connection framework Reliable and that supports real-time metadata capture. efficient metadata access services to interface with external privilege management systems and thus reduce migration costs. Custom authentication Metadata Structure Older versions of Doris support a two-tiered metadata structure: database and table. Users need to create mappings for external databases and tables one by one, which is heavy work. Thus, Apache Doris 1.2.0 introduced the Multi-Catalog functionality. With this, you can map to external data at the catalog level, which means: You can map to the whole external data source and ingest all metadata from it. You can manage the properties of the specified data source at the catalog level, such as connection, privileges, and data ingestion details, and easily handle multiple data sources. Data in Doris falls into two types of catalogs: Internal Catalog: Existing Doris databases and tables all belong to the Internal Catalog. External Catalog: This is used to interface with external data sources. For example, HMS External Catalog can be connected to a cluster managed by Hive Metastore, and Iceberg External Catalog can be connected to an Iceberg cluster. You can use the statement to switch catalogs. You can also conduct federated queries using fully qualified names. For example: SWITCH SELECT * FROM hive.db1.tbl1 a JOIN iceberg.db2.tbl2 b
ON a.k1 = b.k1; Extensible Metadata Connection Framework The introduction of the catalog level also enables users to add new data sources simply using the statement: CREATE CATALOG CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
); In data lake scenarios, Apache Doris currently supports the following metadata services: Hive Metastore-compatible metadata services Alibaba Cloud Data Lake Formation AWS Glue This also paves the way for developers who want to connect to more data sources via External Catalog. All they need is to implement the access interface. Efficient Metadata Access Access to external data sources is often hindered by network conditions and data resources. This requires extra efforts of a data query engine to guarantee reliability, stability, and real-timeliness in metadata access. Doris enables high efficiency in metadata access by , which includes Schema Cache, Partition Cache, and File Cache. By this means, Doris can respond to metadata queries on thousands of tables in milliseconds. Doris supports manual refresh of metadata at the Catalog/Database/Table level. Meanwhile, it enables auto synchronization of metadata in Hive Metastore by monitoring Hive Metastore Event so any changes can be updated within seconds. Meta Cache Custom Authorization External data sources usually come with their own privilege management services. Many companies use one single tool (such as Apache Ranger) to provide authorization for their multiple data systems. Doris supports a custom authorization plugin, which can be connected to the user's own privilege management system via the Doris Access Controller interface. As a user, you only need to specify the authorization plugin for a newly created catalog, and then you can readily perform authorization, audit, and data encryption on external data in Doris. Data Access Doris supports data access to external storage systems including HDFS and S3-compatible object storage: Query Performance Optimization After clearing the way for external data access, the next step for a query engine would be to accelerate data queries. In the case of Apache Doris, efforts are made in data reading, execution engine, and optimizer. Data Reading Reading data on remote storage systems is often bottlenecked by access latency, concurrency, and I/O bandwidth, so reducing reading frequency will be a better choice. Native File Format Reader Improving data reading efficiency entails optimizing the reading of Parquet files and ORC files, which are the most commonly seen data files. Doris has refactored its File Reader, which is fine-tuned for each data format. Take the Native Parquet Reader as an example: : It can directly convert files to the Doris storage format or to a format of higher performance using dictionary encoding. Reduce format conversion : It supports Page Index for Parquet files so it can utilize Page-level smart indexing to filter Pages. Smart indexing of finer granularity : It first reads columns with filters first and then reads the other columns of the filtered rows. This remarkably reduces file read volume since it avoids reading irrelevant data. Predicate pushdown and late materialization : Building on the high throughput and low concurrency of remote storage, it combines multiple data reads into one in order to improve overall data reading efficiency. Lower read frequency File Cache Doris caches files from remote storage in local high-performance disks as a way to reduce overhead and increase performance in data reading. It has developed two new features that make queries on remote files as quick as those on local files: : Doris supports the block cache of remote files and can automatically adjust the block size from 4KB to 4MB based on the read request. The block cache method reduces read/write amplification and read latency in cold caches. Block cache : Doris applies consistent hashing to manage cache locations and schedule data scanning. By doing so, it prevents cache failures brought by the onlining and offlining of nodes. It can also increase cache hit rate and query service stability. Consistent hashing for caching Execution Engine Developers surely don't want rebuild all the general features for every new data source. Instead, they hope to reuse the vectorized execution engine and all operators in Doris in the data lakehouse scenario. Thus, Doris has refactored the scan nodes: : All data queries in Doris, including those on internal tables, use the same operators, such as Join, Sort, and Agg. The only difference between queries on internal and external data lies in data access. In Doris, anything above the scan nodes follows the same query logic, while below the scan nodes, the implementation classes will take care of the access to different data sources. Layer the logics : Even for the scan nodes, different data sources have a lot in common, such as task splitting logic, scheduling of sub-tasks and I/O, predicate pushdown, and Runtime Filter. Therefore, Doris uses interfaces to handle them. Then, it implements a unified scheduling logic for all sub-tasks. The scheduler is in charge of all scanning tasks in the node. With global information of the node in hand, the schedular is able to do fine-grained management. Such a general framework makes it easy to connect a new data source to Doris, which will only take a week of work for one developer. Use a general framework for scan operators Query Optimizer Doris supports a range of statistical information from various data sources, including Hive Metastore, Iceberg Metafile, and Hudi MetaTable. It has also refined its cost model inference based on the characteristics of different data sources to enhance its query planning capability. Performance We tested Doris and Presto/Trino on HDFS in flat table scenarios (ClickBench) and multi-table scenarios (TPC-H). Here are the results: As is shown, with the same computing resources and on the same dataset, Apache Doris takes much less time to respond to SQL queries in both scenarios, delivering a 3~10 times higher performance than Presto/Trino. Workload Management & Elastic Computing Querying external data sources requires no internal storage of Doris. This makes the elastic stateless computing nodes possible. Apache Doris 2.0 is going to implement Elastic Compute Node, which is dedicated to supporting query workloads of external data sources. Stateless computing nodes are open for quick scaling so users can easily cope with query workloads during peaks and valleys and strike a balance between performance and cost. In addition, Doris has optimized itself for Kubernetes cluster management and node scheduling. Now Master nodes can automatically manage the onlining and offlining of Elastic Compute Nodes, so users can govern their cluster workloads in cloud-native and hybrid cloud scenarios without difficulty. Use Cases Apache Doris has been adopted by a financial institution for risk management. The user's high demands for data timeliness makes their data mart built on Greenplum and CDH, which could only process data from one day ago, no longer a great fit. In 2022, they incorporated Apache Doris in their data production and application pipeline, which allowed them to perform federated queries across Elasticsearch, Greenplum, and Hive. A few highlights from the user's feedback include: Doris allows them to create one Hive Catalog that maps to tens of thousands of external Hive tables and conduct fast queries on them. Doris makes it possible to perform real-time federated queries using Elasticsearch Catalog and achieve a response time of merely milliseconds. Doris enables the decoupling of daily batch processing and statistical analysis, bringing less resource consumption and higher system stability.

Building the Next-Generation Data Lakehouse: 10X Performance

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Hands-On Guide to Inverted Indexes: Accelerate Text Searches by 40

8 Essential Maven Plugins Beyond The Core

A Comprehensive Guide to Apache Cassandra Architecture

A Detailed Guide To Using Apache Storm

ACID Transactions Are Coming To Apache Cassandra: Here's Why We're Excited

Apache Airflow And Its Contribution to Enterprise Data Integration

A Hands-On Guide to Inverted Indexes: Accelerate Text Searches by 40

8 Essential Maven Plugins Beyond The Core

A Comprehensive Guide to Apache Cassandra Architecture

A Detailed Guide To Using Apache Storm

ACID Transactions Are Coming To Apache Cassandra: Here's Why We're Excited

Apache Airflow And Its Contribution to Enterprise Data Integration

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps