“Emerging Data Lakehouse Design and Practice Handbook · Data Lakehouse Modeling and Naming Standards (2025)” consists of four progressive guides, structured along the main line of Model Architecture – Common Standards – Layering Standards – Naming Standards. Together, they systematically build an evolvable, governable, and shareable modern data lakehouse. “Emerging Data Lakehouse Design and Practice Handbook · Data Lakehouse Modeling and Naming Standards (2025)” Model Architecture – Common Standards – Layering Standards – Naming Standards The first article,(I) Principles of Data Model Architecture: Four Layers and Seven Stages, the "First Cornerstone" of Data Lake and Warehouse Modeling, introduced the four-layer (ODS–DW–APP, with DWD/DWM/DWS inside DW) data layering architecture. Around four core principles—domain partitioning, high cohesion & low coupling, common logic sinking, and cost-performance balance—it laid a unified and extensible foundation for dimensional modeling in a lakehouse architecture. (I) Principles of Data Model Architecture: Four Layers and Seven Stages, the "First Cornerstone" of Data Lake and Warehouse Modeling This article is the second in the series. It details the common design standards followed in data warehouses, including hierarchy invocation standards, data type standards, string handling, and other warehouse design specifications. The remaining two articles will further analyze detailed specifications for each data warehouse layer and the unified naming system within this framework, helping enterprises complete the full chain from data ingestion to value realization with one methodology. Stay tuned for the full series. 1. Hierarchy Invocation Standards: Controlling Data Warehouse Flow and Reference Principles 🚀 Business data flow design and layering reference essentials Business data flow design and layering reference essentials Stable business should follow the standard flow design: ODS → DWD → DWS → APP. For unstable or exploratory demands, ODS → DWD → APP or ODS → DWD → DWM → APP models may be used. Stable business Beyond ensuring data flow rationality, reference principles must also be followed: Normal flow: ODS → DWD → DWM → DWS → APP. If ODS → DWD → DWS → APP occurs, it means the domain is incomplete. DWD data should be landed in DWM. For rarely used tables, DWD → DWS is allowed.


Avoid having a DWS wide table reference both DWD and the DWM of the same domain.


Within the same domain, generating DWM from DWM should be avoided to maintain ETL efficiency.


DWM, DWS, and APP must not directly use ODS tables. Only DWD can reference ODS.


Reverse dependencies are forbidden, e.g., DWM tables depending on DWS tables. Normal flow: ODS → DWD → DWM → DWS → APP. If ODS → DWD → DWS → APP occurs, it means the domain is incomplete. DWD data should be landed in DWM. For rarely used tables, DWD → DWS is allowed. Normal flow: ODS → DWD → DWM → DWS → APP. If ODS → DWD → DWS → APP occurs, it means the domain is incomplete. DWD data should be landed in DWM. For rarely used tables, DWD → DWS is allowed. Avoid having a DWS wide table reference both DWD and the DWM of the same domain. Avoid having a DWS wide table reference both DWD and the DWM of the same domain. Within the same domain, generating DWM from DWM should be avoided to maintain ETL efficiency. Within the same domain, generating DWM from DWM should be avoided to maintain ETL efficiency. DWM, DWS, and APP must not directly use ODS tables. Only DWD can reference ODS. DWM, DWS, and APP must not directly use ODS tables. Only DWD can reference ODS. Reverse dependencies are forbidden, e.g., DWM tables depending on DWS tables. Reverse dependencies are forbidden, e.g., DWM tables depending on DWS tables. Reverse dependencies are forbidden Example: 2. Data Type Standards: Unifying Data Warehouse Type Settings 🔍 Precise type definitions for various data Precise type definitions for various data Different data types must be standardized and strictly enforced: Amount: double or decimal(28,6) for precision, clarify whether the unit is cents or yuan.


String: string.


ID fields: bigint.


Time: string.


Status: string. Amount: double or decimal(28,6) for precision, clarify whether the unit is cents or yuan. Amount: double or decimal(28,6) for precision, clarify whether the unit is cents or yuan. double decimal(28,6) String: string. String: string. string ID fields: bigint. ID fields: bigint. bigint Time: string. Time: string. string Status: string. Status: string. string 3. Data Redundancy Standards: Reasonable Control of Wide Table Redundancy 🤔 Considering frequency, latency, and duplication rate Considering frequency, latency, and duplication rate Redundant fields in wide tables must ensure: Redundant fields are frequently used, referenced downstream by at least 3.


Redundant fields should not introduce significant data latency.


Redundant fields should not overly duplicate existing fields; generally, not exceeding 60%. Otherwise, use join or extend the original table. Redundant fields are frequently used, referenced downstream by at least 3. Redundant fields are frequently used, referenced downstream by at least 3. Redundant fields should not introduce significant data latency. Redundant fields should not introduce significant data latency. Redundant fields should not overly duplicate existing fields; generally, not exceeding 60%. Otherwise, use join or extend the original table. Redundant fields should not overly duplicate existing fields; generally, not exceeding 60%. Otherwise, use join or extend the original table. join 4. NULL Field Handling Standards: Strategies for Dimensions and Metrics ❓ Why set NULL values this way Why set NULL values this way Dimension fields: set to -1.


Metric fields: set to 0. Dimension fields: set to -1. Dimension fields: set to -1. -1 Metric fields: set to 0. Metric fields: set to 0. 0 5. Metric Caliber Standards: Ensuring Consistency of Metrics 🧩 Specific methods for metric collation and management Specific methods for metric collation and management Consistency must be ensured within domains, without ambiguity. Consistency must be ensured within domains, without ambiguity. Through layering, unified data outputs are provided to ensure consistent external metric definitions, avoiding “same metric, different definition.” 1) Metric Collation: Inconsistent definitions increase data usage costs, leading to disputes and repeated verification. In governance, all metrics collected from requirements are further refined to clarify their definitions. If two metrics share a name but differ in definition, determine whether to merge. If both must exist, their names must clearly distinguish them. 1) Metric Collation: 2) Metric Management Metric management includes atomic metrics and derived metrics. 2) Metric Management Atomic metrics: Atomic metrics: Assign to production line, business unit, domain, and business process.
Define source data within the business process.
Record English/Chinese name, description.
Fill in metric function.
System auto-generates definition expressions.
System auto-generates SQL from the definition and source table. Assign to production line, business unit, domain, and business process. Define source data within the business process. Record English/Chinese name, description. Fill in metric function. System auto-generates definition expressions. System auto-generates SQL from the definition and source table. Derived metrics: Derived metrics: Built on atomic metrics with additional dimensions or qualifiers. Built on atomic metrics with additional dimensions or qualifiers. Built on atomic metrics with additional dimensions or qualifiers. 6. Data Table Handling Standards: Characteristics of Different Table Types ⚡ Differences between incremental, full, snapshot, and zipper tables Differences between incremental, full, snapshot, and zipper tables 1) Incremental Table 1) Incremental Table New data since the last extraction. Records changes only, not totals.


Reports changes only; no change, no report.


One partition per day. Records changes only, not totals. Records changes only, not totals. Reports changes only; no change, no report. Reports changes only; no change, no report. One partition per day. One partition per day. 2) Full Table 2) Full Table All the latest state data each day. Reports regardless of change.


Each report includes all data (changed + unchanged).


Only one partition. Reports regardless of change. Reports regardless of change. Each report includes all data (changed + unchanged). Each report includes all data (changed + unchanged). Only one partition. Only one partition. 3) Snapshot Table 3) Snapshot Table Partitioned daily, recording full data as of that date. Reports regardless of change.


Each report includes all data (changed + unchanged).


One partition per day. Reports regardless of change. Reports regardless of change. Each report includes all data (changed + unchanged). Each report includes all data (changed + unchanged). One partition per day. One partition per day. 4) Zipper Table 4) Zipper Table Full data is up to date. Records all changes of an entity from start to current.


Each report includes the historical final state at present.


Current record stores last change before now.


Only one partition. Records all changes of an entity from start to current. Records all changes of an entity from start to current. Each report includes the historical final state at present. Each report includes the historical final state at present. Current record stores last change before now. Current record stores last change before now. Only one partition. Only one partition. 7. Table Lifecycle Management: Strategies Based on Data History and Table Type ⏳ Lifecycle matrix via historical data grading and table type categorization Lifecycle matrix via historical data grading and table type categorization 1) Historical Data Grading 1) Historical Data Grading Four levels: P0, P1, P2, P3. P0: Critical domain and app data, non-recoverable, e.g., transactions, logs, group KPIs, IPO-related tables.


P1: Important business and app data, non-recoverable, e.g., key product data.


P2: Important but recoverable business/app data, e.g., intermediate ETL process data.


P3: Non-critical, recoverable data, e.g., SNS product reports. P0: Critical domain and app data, non-recoverable, e.g., transactions, logs, group KPIs, IPO-related tables. P0: Critical domain and app data, non-recoverable, e.g., transactions, logs, group KPIs, IPO-related tables. P1: Important business and app data, non-recoverable, e.g., key product data. P1: Important business and app data, non-recoverable, e.g., key product data. P2: Important but recoverable business/app data, e.g., intermediate ETL process data. P2: Important but recoverable business/app data, e.g., intermediate ETL process data. P3: Non-critical, recoverable data, e.g., SNS product reports. P3: Non-critical, recoverable data, e.g., SNS product reports. 2) Table Types 2) Table Types Event stream table (incremental): non-key or non-duplicate data like logs.
Event mirror table (incremental): process data with primary key and slowly changing attributes, e.g., transactions, order states.
Dimension tables: users, products.
Merge full table: process/dimension data with possible duplicates; merge keeps latest by key, older states in previous partitions.
ETL temporary tables: generated during processing, not recommended to keep, max 7 days.
TT temporary data: data from TT pulls or DbSync, flows to DS layer, ODS keeps original; lifespan default 93 days.
Regular full tables: small business/product data, often pulled directly; long retention, strategy by historical level. Event stream table (incremental): non-key or non-duplicate data like logs. Event mirror table (incremental): process data with primary key and slowly changing attributes, e.g., transactions, order states. Dimension tables: users, products. Merge full table: process/dimension data with possible duplicates; merge keeps latest by key, older states in previous partitions. ETL temporary tables: generated during processing, not recommended to keep, max 7 days. TT temporary data: data from TT pulls or DbSync, flows to DS layer, ODS keeps original; lifespan default 93 days. Regular full tables: small business/product data, often pulled directly; long retention, strategy by historical level. By grading historical data and categorizing table types, a lifecycle management matrix is generated. Previous article: “(Part 1) Data Model Architecture Principles: Four Layers, Seven Levels—the First Building Block of Lakehouse Modeling”
Next article preview: Detailed design standards for each data warehouse layer Previous article: “(Part 1) Data Model Architecture Principles: Four Layers, Seven Levels—the First Building Block of Lakehouse Modeling” “(Part 1) Data Model Architecture Principles: Four Layers, Seven Levels—the First Building Block of Lakehouse Modeling” Next article preview: Detailed design standards for each data warehouse layer

This story contains new, firsthand information uncovered by the writer.

(Ⅱ) A Complete Guide to Core Data Warehouse Design Standards: From Layers, Types to Lifecycle

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

The blockchain can create a greener future for AI

8 Tools You Can Use to Analyze Big Data

8 ways in which AI helps the logistics industry

Advancing Observability Platforms: Upgrading Data Processing and Reducing Costs with Apache Doris

An Introduction to Blockchain + NoSQL Databases

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

The blockchain can create a greener future for AI

8 Tools You Can Use to Analyze Big Data

8 ways in which AI helps the logistics industry

Advancing Observability Platforms: Upgrading Data Processing and Reducing Costs with Apache Doris

An Introduction to Blockchain + NoSQL Databases

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps