Let’s go.
I was developing a script to prepare data samples for customers that reached out for consultations. The samples have 100 rows in each file, and they are split into 55 locales. My query looks like this
SELECT *
FROM `project.dataset.table`
WHERE
ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
AND locale = "US"
LIMIT 100;
The data was stored in “europe-west-4”, and the pricing for querying is $6 per TB. So, by running the script, I processed:
Very expensive.
The script was written in JavaScript modules.
// bq-samples.mjs
import { BigQuery } from "@google-cloud/bigquery";
import { Parser } from "@json2csv/plainjs";
import makeDir from "make-dir";
import { write } from "./write.mjs";
import { locales } from "./locales.mjs";
import { perf } from "./performance.mjs";
const q = (locale, start, end, limit) => `SELECT *
FROM \`project.dataset.table\`
WHERE
ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
AND locale = "${locale}"
LIMIT ${limit}`
async function main() {
const timer = perf()
const dir = await makeDir('samples')
const bigquery = new BigQuery()
const csvParser = new Parser({})
try {
const jobs = locales.map((locale) => async () => {
// get query result from BigQuery
const [job] = await bigquery.createQueryJob({
query: q(locale, "2022-12-01", "2023-02-28", 100),
})
const [rows] = await job.getQueryResults()
// parse rows into csv format
const csv = parse(csvParser, rows)
// write data into csv files and store in the file system
await write(csv, dir, locale, "2022-12-01", "2023-02-28", 100)
})
await Promise.all(jobs.map((job) => job()))
console.log(`✨ Done in ${timer.stop()} seconds.`)
} catch (error) {
console.error('❌ Failed to create sample file', error)
}
}
await main()
It generates one sample file in CSV format per locale. The process is straightforward:
It turns out that I did several things wrong in my query. If you look at the pricing model again, you’ll notice the cost is only related to how much data you process. So it’s clear that my query looked up too much data to produce 100 rows.
With this insight, let’s optimize the query step by step.
It’s a little counterintuitive. Why does my select statement have anything to do with how much data it processes? Regardless of the columns I select, I should be reading from the same resources and data, right?
It’s only true for row-oriented databases.
BigQuery is actually a columnar database. It’s column-oriented, meaning the data is structured in columns. BigQuery uses Dremel as its underlying computing engine. When the data is moved from the cold storage to the active storage in Dremel, it stores the data in a tree structure.
Each leaf node is a column-oriented “record” in Protobuf format.
In BigQuery, each node is a VM. A query execution propagates from the root server (node) through intermediate servers to the leaf servers to retrieve the selected columns.
We can modify the query to select individual columns:
SELECT
session_info_1,
session_info_2,
session_info_3,
user_info_1,
user_info_2,
user_info_3,
query_info_1,
query_info_2,
query_info_3,
impression_info_1,
impression_info_2,
impression_info_3,
ts
FROM `project.dataset.table`
WHERE
ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28")
AND locale = "US"
LIMIT 100;
Just by selecting all the columns explicitly, I was able to reduce the processed data from 3.08 TB to 2.94 TB. That’s a 100 GB reduction.
Google Cloud recommends we partition tables by date. It lets us query only a subset of data.
To optimize the query further, we can narrow down the date range in the where statement because the table is partitioned by the “ts” column.
SELECT
session_info_1,
session_info_2,
session_info_3,
user_info_1,
user_info_2,
user_info_3,
query_info_1,
query_info_2,
query_info_3,
impression_info_1,
impression_info_2,
impression_info_3,
ts
FROM `project.dataset.table`
WHERE
ts = TIMESTAMP("2022-12-01")
AND locale = "US"
LIMIT 100;
I narrowed down the date range to one day instead of three months. I was able to cut down the processed data to 37.43 GB. It’s just a fraction of the original query.
Another way to reduce costs is to reduce the dataset you’re querying from. BigQuery offers destination tables to store query results as smaller datasets. Destination tables come in two forms: temporary and permanent.
Because temporary tables have a lifetime and it’s not designed to be shared and queried, I created a permanent destination table to materialize the query result:
// bq-samples.mjs
const dataset = bigquery.dataset('materialized_dataset')
const materialzedTable = dataset.table('materialized_table')
// ...
const [job] = await bigquery.createQueryJob({
query: q(locale, '2022-12-01', '2023-02-28', 100),
destination: materialzedTable,
})
The query results will be stored in the destination table. It’ll serve as a reference for future queries. Whenever it’s possible to query from the destination table, BigQuery will process the data from the table. It’ll greatly reduce the data size we look up.
It’s a very interesting study to reduce the cost in BigQuery. With only three easy steps:
I was able to reduce the processed data size from 3 TB to 37.5 GB. It reduces the total cost significantly from $3,000 to $30.
If you’re interested in learning more about the BigQuery architecture, here’re the references that helped me:
You can read more about BigQuery cost optimizations in the Google Cloud documentation.
Special thanks to Abu Nashir for collaborating with me on the case study and providing valuable insights that helped me understand BigQuery’s architecture.
Want to Connect?
This article was originally posted on Daw-Chih’s website.