I Unknowingly Blew $3,000 in 6 Hours on One Query in Google Cloud BigQuery. Here’s Why and 3 Easy Steps to Optimize The Cost. In This Article 🙈 You’ll learn what contributes to BigQuery’s costs. 🏛️ We’ll dive into BigQuery’s architecture. 💰 You’ll learn 3 easy steps to save 99.97% on your next Google Cloud bill. Let’s go. How Did It Happen? I was developing a script to prepare data samples for customers that reached out for consultations. The samples have 100 rows in each file, and they are split into 55 locales. My query looks like this SELECT * FROM `project.dataset.table` WHERE ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28") AND locale = "US" LIMIT 100; The data was stored in “europe-west-4”, and the . So, by running the script, I processed: pricing for querying is $6 per TB 500 TB of data in total 3 TB of data per country on average $54 per data sample file on average Very expensive. The Script That Cost $3,000 The script was written in . JavaScript modules // bq-samples.mjs import { BigQuery } from "@google-cloud/bigquery"; import { Parser } from "@json2csv/plainjs"; import makeDir from "make-dir"; import { write } from "./write.mjs"; import { locales } from "./locales.mjs"; import { perf } from "./performance.mjs"; const q = (locale, start, end, limit) => `SELECT * FROM \`project.dataset.table\` WHERE ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28") AND locale = "${locale}" LIMIT ${limit}` async function main() { const timer = perf() const dir = await makeDir('samples') const bigquery = new BigQuery() const csvParser = new Parser({}) try { const jobs = locales.map((locale) => async () => { // get query result from BigQuery const [job] = await bigquery.createQueryJob({ query: q(locale, "2022-12-01", "2023-02-28", 100), }) const [rows] = await job.getQueryResults() // parse rows into csv format const csv = parse(csvParser, rows) // write data into csv files and store in the file system await write(csv, dir, locale, "2022-12-01", "2023-02-28", 100) }) await Promise.all(jobs.map((job) => job())) console.log(`✨ Done in ${timer.stop()} seconds.`) } catch (error) { console.error('❌ Failed to create sample file', error) } } await main() It generates one sample file in format per locale. The process is straightforward: CSV Querying the BigQuery table with a local, start date, end date, and limit. Transforming the query result into CSV format. Writing the CSV in the file system. Repeating the process for all locales. What’s the Problem? It turns out that I did several things wrong in my query. If you look at the pricing model again, you’ll notice the cost is only related to how much data you process. So it’s clear that my query looked up too much data to produce 100 rows. With this insight, let’s optimize the query step by step. Don’t Select * It’s a little counterintuitive. Why does my select statement have anything to do with how much data it processes? Regardless of the columns I select, I should be reading from the same resources and data, right? It’s only true for row-oriented databases. BigQuery is actually a . It’s column-oriented, meaning the data is structured in columns. BigQuery uses as its underlying computing engine. When the data is moved from the cold storage to the active storage in Dremel, it stores the data in a tree structure. columnar database Dremel Each leaf node is a column-oriented “record” in format. Protobuf In BigQuery, each node is a VM. A query execution propagates from the root server (node) through intermediate servers to the leaf servers to retrieve the selected columns. We can modify the query to select individual columns: SELECT session_info_1, session_info_2, session_info_3, user_info_1, user_info_2, user_info_3, query_info_1, query_info_2, query_info_3, impression_info_1, impression_info_2, impression_info_3, ts FROM `project.dataset.table` WHERE ts BETWEEN TIMESTAMP("2022-12-01") AND TIMESTAMP("2023-02-28") AND locale = "US" LIMIT 100; Just by selecting all the columns explicitly, I was able to reduce the processed data from 3.08 TB to 2.94 TB. That’s a . 100 GB reduction Use Partitioned Table and Query Only Subsets of Data Google Cloud recommends we . It lets us query only a subset of data. partition tables by date To optimize the query further, we can narrow down the date range in the where statement because the table is partitioned by the “ts” column. SELECT session_info_1, session_info_2, session_info_3, user_info_1, user_info_2, user_info_3, query_info_1, query_info_2, query_info_3, impression_info_1, impression_info_2, impression_info_3, ts FROM `project.dataset.table` WHERE ts = TIMESTAMP("2022-12-01") AND locale = "US" LIMIT 100; I narrowed down the date range to one day instead of three months. I was able to cut down the processed data to . It’s just a fraction of the original query. 37.43 GB Use Materialized Query Results in Stages Another way to reduce costs is to reduce the dataset you’re querying from. BigQuery offers to store query results as smaller datasets. Destination tables come in two forms: temporary and permanent. destination tables Because temporary tables have a lifetime and it’s not designed to be shared and queried, I created a permanent destination table to materialize the query result: // bq-samples.mjs const dataset = bigquery.dataset('materialized_dataset') const materialzedTable = dataset.table('materialized_table') // ... const [job] = await bigquery.createQueryJob({ query: q(locale, '2022-12-01', '2023-02-28', 100), destination: materialzedTable, }) The query results will be stored in the destination table. It’ll serve as a reference for future queries. Whenever it’s possible to query from the destination table, BigQuery will process the data from the table. It’ll greatly reduce the data size we look up. Final Thoughts It’s a very interesting study to reduce the cost in BigQuery. With only three easy steps: Don’t use * Use Partitioned Table and Query Only Subsets of Data Use Materialized Query Results in Stages I was able to reduce the processed data size . It reduces the total cost significantly from $3,000 to $30. from 3 TB to 37.5 GB If you’re interested in learning more about the BigQuery architecture, here’re the references that helped me: BigQuery explained: An overview of BigQuery’s architecture A Look at Dremel Colossus under the hood: a peek into Google’s scalable storage system Jupiter evolving: Reflecting on Google’s data center network transformation You can read more about in the Google Cloud documentation. BigQuery cost optimizations Special thanks to for collaborating with me on the case study and providing valuable insights that helped me understand BigQuery’s architecture. Abu Nashir References — LinkedIn Abu Nashir — Peter Goldsborough A Look at Dremel — Google Cloud BigQuery — Google Cloud BigQuery explained: An overview of BigQuery’s architecture — Google Cloud Colossus under the hood: a peek into Google’s scalable storage system — Wikipedia Column-oriented DBMS — Wikipedia Comma-separated values — Google Cloud Creating partitioned tables — MDN JavaScript modules — Google Cloud Jupiter evolving: Reflecting on Google’s data center network transformation — Google Cloud Method: jobs.query — Wikipedia Protocol_Buffers — Google Cloud Run interactive and batch query jobs — Google Cloud Using cached query results — Google Cloud Writing query results Want to Connect? This article was originally posted on . Daw-Chih’s website