Highlights from the Data and AI Summit (former Apache Spark Summit) for the Busy IT Professional
Right now, I am actually waiting for the keynote of day 2 to kick off. So let me summarize the announcements from day 1 of the Data and AI Summit (DAIS) for you.
Delta Lake has grown up :-). The Linux Foundation-owned, open-source project Delta.io that brings reliability to your data lake matured to version 1.0 now. Happy Birthday, Delta! Key features of Delta 1.0:
A brand new open source project for data sharing was announced at DAIS. With Delta Sharing you can share massive, live data from your Lakehouse between clouds or on-premises. It’s secure, fast, cheap, and reliable using underlying cloud storage systems such as S3, ADLS, and GCS. And it’s open source & open standard. Recipients accessing shared data can directly connect to it through pandas, Tableau, or dozens of other systems that implement the open protocol. #DataAISummit.
Databricks rolled out a data catalog, Unity Catalog. Unity Catalog solves a major issue. Imagine a CSV file stored in your S3 data lake and you want to grant access to certain rows only?
Unity Catalog enforces permissions at the row, column, or view level instead of the file level. It governs tables and ML models. Simply use ANSI SQL standard GRANT statements, or discover data assets from the UI. Works for the Lakehouse on all clouds.
My fav #DataAISummit announcement: Delta Live Tables. Data flow made simple: Specify the outcomes that a pipeline needs to achieve using SQL or Python. Treat your transformations and data quality expectations as code.
And now for day 2 of DAIS!
Databricks Machine Learning brings together managed MLflow, and introduces new components, such as AutoML and the Feature Store, and supports the full ML lifecycle.
MLflow integration enables the Feature Store to package up feature lookup logic hermetically with the model artifact. When an MLflow model that was trained on data from the Feature Store is deployed, the model itself will look up features from the appropriate online store.
The Databricks Feature Store automatically tracks the data sources used for feature computation, as well as the exact version of the code that was used.
AutoML allows you to quickly build and deploy machine learning models by automating the heavy lifting of preprocessing, feature engineering, and model training/tuning. AutoML detects the best preprocessing, ML model, and hyperparameters for you and creates a notebook with all steps required. It automatically tracks trial run metrics and parameters with MLflow and easily enables teams to register and version control their models in the Databricks Model Registry for deployment. You define the max runtime that you want to spend to solve the task.
There are more video resources from DAIS 2021 that I recommend: The fireside chat with Bill Inmon (DWH inventor), Delta 1.0 announcement, Delta Sharing, SQL and Photon updates, Unity Catalog, and Delta Live Tables.
Slice & DAIS 2021 — EMEA Live Event
Join us for the first Slice & DAIS session that talks about all the new announcements in a beginner-friendly way.
Previously read behind a paywall at https://medium.com/geekculture/what-is-new-in-data-and-ai-2021-376cefe67fb3