Problem Statement
Simple: Migrating the Data Lake from Snowflake to BigQuery.
|
As-Is |
To-Be |
|---|---|
|
SELECT order_id::STRING AS order_id_str FROM orders; |
SELECT CAST(order_id AS STRING) AS order_id_str FROM orders; |
During the Snowflake-to-BigQuery migration, the task was not only to move terabytes of data across platforms, but to do so efficiently and cost-effectively. At the same time, refactor the data pipelines.
That’s where the real complexity surfaced. We were following an ELT approach for the data lake, which meant the extraction layer would remain unchanged. But the loading and transformation logic had to be refactored. The technology stack included Python for development, AWS for infrastructure, and Snowflake as the source data store, all of which now needed to align with BigQuery’s ecosystem.
Over time, the platform had accumulated extensive SQL—complex queries, layered views, and tightly coupled stored procedures. Translating this logic was not a simple syntax conversion; it required rethinking how transformations were executed, optimized, and governed in a fundamentally different data environment.
Evolution of pattern detection and replacement
Taking a step back and checking what I used to do and what the options are now.
|
|
|---|
|
|
|
|
|
Use AI Gents |
Given the scale and complexity of the migration comprising thousands of Database objects(Tables, Views, Tasks, Procedures), hundreds of SQL scripts, and hundreds of ad hoc queries, it is evident that leveraging large language models embedded within the IDE was the most practical and effective approach for refactoring and validating SQL. Over the past few years, Generative AI has matured significantly, so the next step was to decide whether the instruction should be framed simply as a direct task—
- “Refactor the Snowflake SQL into a BigQuery-compatible format”—or should it establish clearer context and expertise, such as
- “Given your expertise in translating Snowflake SQL to BigQuery, translate the following SQL”? Alternatively, would a more guided approach be more effective,
- "You are a domain expert in translating Snowflake SQL to BigQuery; and here are example of few translations; now translate the sql of file …”?
AGENTS.md Adoption
Introduction
The AGENTS.md file is a guide for AI agents, providing context and instructions to help them work on the project. It is an open-format file. Unlike the structured configurations found in `*.json` or `*.yml,` the md file is a "human-to-machine" plain-text README. It is designed to give the high-level context, tribal knowledge, and specific behavioral instructions that code alone cannot convey. So no syntax error, no compilation error, and many more data engineer nightmares.
Structure: Formatting & Hierarchy
A simple Markdown-style hierarchy is followed to parse the importance of different instructions:
# (Single Hash): Indicates the highest level of hierarchy (Project Vision or Major Sections).
- ## (Double Hash): Used for the second-highest level (Specific Modules or Workflows).
- ### (Triple Hash): Used for the third level and below (Detailed Tasks or Technical Nuances).
There are no specific constraints apart from above markers.
Sample File
Here is a sample Agents.md file. All the instructions and guidelines are in plain text.
The first line states the project vision and the expectations for AI Agents. The second line has instructions with details.
# Snowflake to BigQuery SQL Refactoring Guide
## Reusable Setup for Future Refactoring Tasks
This guide provides a systematic approach to refactoring Snowflake SQL to BigQuery SQL,
reusing the patterns and tools established for the Adobe API refactoring.
---
One of the biggest challenges in this migration was related to handling the JSON structure. There are multiple pipelines that dump JSON-type response payloads into Snowflake variant data types. Snowflake gracefully handles them or parses them with the LATERAL FLATTEN keyword. However same is not the case in BigQuery. The section below in the md file covers most JSON data types in BigQuery. So a combination of UNNEST and JSON PARSE did the trick. First, identify which JSON structures to transform. Based on that UNNEST part is derived.
## Quick Start Checklist
### Step 0: MANDATORY - Detect JSON Structure First (CRITICAL)**
⚠️ ALWAYS run this diagnostic query BEFORE refactoring to avoid zero-row issues:**
```sql
-- Comprehensive JSON Structure Diagnostic
-- Run this for EACH table with JSON content BEFORE refactoring
SELECT
-- Basic info
COUNT(*) as total_rows,
COUNTIF(content IS NOT NULL) as rows_with_content,
COUNTIF(SAFE.PARSE_JSON(content) IS NULL) as parse_failures,
-- JSON type (CRITICAL - determines parsing approach)
JSON_TYPE(PARSE_JSON(ANY_VALUE(content))) as json_type,
-- Array lengths for different paths (identifies correct path)
MAX(ARRAY_LENGTH(JSON_QUERY_ARRAY(PARSE_JSON(content), '$'))) as direct_array_max,
MAX(ARRAY_LENGTH(JSON_QUERY_ARRAY(PARSE_JSON(content), '$.content'))) as content_array_max,
MAX(ARRAY_LENGTH(JSON_QUERY_ARRAY(PARSE_JSON(content), '$.data'))) as data_array_max,
MAX(ARRAY_LENGTH(JSON_QUERY_ARRAY(PARSE_JSON(content), '$.rows'))) as rows_array_max,
-- Sample preview (shows actual structure)
SUBSTR(ANY_VALUE(content), 1, 200) as sample_preview,
-- Check for double-quoted JSON (escaped JSON string)
CASE WHEN JSON_TYPE(PARSE_JSON(ANY_VALUE(content))) = 'OBJECT'
AND JSON_TYPE(PARSE_JSON(JSON_VALUE(PARSE_JSON(ANY_VALUE(content)), '$.content'))) = 'ARRAY'
THEN 'YES - Double-quoted JSON detected'
WHEN JSON_TYPE(PARSE_JSON(ANY_VALUE(content))) = 'STRING'
THEN 'YES - Needs double parsing'
ELSE 'NO' END as is_double_quoted_json
FROM `{project_id}.{dataset}.{table_name}`;
```**Decision Tree Based on Diagnostic Results:
**1. **If `json_type = 'ARRAY'` and `direct_array_max > 0`:**
- Structure: `[{...}, {...}]` (direct array)
- Use: `UNNEST(JSON_QUERY_ARRAY(PARSE_JSON(content), '$'))`
2. **If `json_type = 'OBJECT'` and `content_array_max > 0`:**
- Structure: `{"content": [{...}, {...}]}` (nested content)
- Use: `UNNEST(JSON_QUERY_ARRAY(PARSE_JSON(content), '$.content'))`
3. **If `json_type = 'OBJECT'` and `rows_array_max > 0`:**
- Structure: `{"rows": [{...}, {...}]}` (nested rows)
- Use: `UNNEST(JSON_QUERY_ARRAY(PARSE_JSON(content), '$.rows'))`
4. **If `is_double_quoted_json = 'YES'`:**
- Structure: `{"content": "[{\"id\":\"...\"}]"}` (escaped JSON string)
- Use: `UNNEST(JSON_QUERY_ARRAY(PARSE_JSON(JSON_VALUE(PARSE_JSON(content), '$.content')), '$'))`
- **This is the most common cause of zero-row issues!**
5. **If `json_type = 'OBJECT'` and all array_max are NULL:**
- Structure: Single object `{...}` (not array)
- Use: Direct extraction without UNNEST: `JSON_VALUE(PARSE_JSON(content), '$.field')`
Below are some common environment-based refactoring guidelines.
### Step 1: Analyze the SQL File
- [ ] Identify all Snowflake-specific syntax:
- `LATERAL FLATTEN` patterns
- `::type` casting
- Snowflake functions (`IFF`, `NVL`, `DECODE`, etc.)
- `USE WAREHOUSE`, `USE DATABASE`, `USE SCHEMA`
- `DELETE FROM ... INSERT INTO` patterns
- `QUALIFY` clauses- [ ] List all tables involved (source and target)
- [ ] Identify incremental load patterns
- [ ] Note special handling (NaN, NULL, date formats, etc.)
### Pattern 1: Type Casting
| Snowflake | BigQuery |
|-----------|----------|
| `value:field::string` | `JSON_VALUE(item, '$.field')` |
| `value:field::float` | `CAST(JSON_VALUE(item, '$.field') AS FLOAT64)` |
| `value:field::int` | `CAST(JSON_VALUE(item, '$.field') AS INT64)` |
| `value:field::date` | `CAST(JSON_VALUE(item, '$.field') AS DATE)` |
| `value:field::timestamp_ntz` | `CAST(JSON_VALUE(item, '$.field') AS TIMESTAMP)` |
| `value:field::numeric(10,2)` | `ROUND(CAST(JSON_VALUE(item, '$.field') AS NUMERIC), 2)` |
### Pattern 2: DELETE + INSERT → TRUNCATE + INSERT
```sql
-- Snowflake
DELETE FROM {edw}.schema.table;
INSERT INTO {edw}.schema.table ...
-- BigQuery
TRUNCATE TABLE `{{projectid}}.EDW_SCHEMA.table`;
INSERT INTO `{{projectid}}.EDW_SCHEMA.table` ...
### Pattern 3: Function Conversions
| Snowflake | BigQuery |
|-----------|----------|
| `IFF(cond, a, b)` | `IF(cond, a, b)` |
| `NVL(expr, default)` | `COALESCE(expr, default)` |
| `IFNULL(expr, default)` | `COALESCE(expr, default)` |
| `DECODE(expr, s1, r1, default)` | `CASE expr WHEN s1 THEN r1 ELSE default END` |
| `split(str, '/')[1]` | `SPLIT(str, '/')[SAFE_OFFSET(1)]` |
| `YEAR(date)` | `EXTRACT(YEAR FROM date)` |
| `TO_DATE(str)` | `CAST(str AS DATE)` or `PARSE_DATE('%Y-%m-%d', str)` |
| `TO_CHAR(num)` | `CAST(num AS STRING)` |
| `CHARINDEX(substr, str)` | `STRPOS(str, substr)` |
| `EDITDISTANCE(str1, str2)` | `EDIT_DISTANCE(str1, str2)` |
### Pattern 4: QUALIFY Clause
```sql
-- Snowflake
SELECT ...
QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1
-- BigQuery
SELECT ... FROM ( SELECT ...,
ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) AS rn
FROM ...
)
WHERE rn = 1
Pattern 5: Date/Time Functions
-- Snowflake
DATEADD(day, -90, CURRENT_DATE())
DATEDIFF(day, date1, date2)
TO_TIMESTAMP(str, 'YYYY-MM-DDTHH24:MI:SS')
-- BigQuery
DATE_SUB(CURRENT_DATE, INTERVAL 90 DAY)
DATE_DIFF(date2, date1, DAY)
PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', str)
Finally, to end the md file, we can include validation of translated SQL. We can also ask the agent to connect BigQuery and do a dummy run.
### Step 2: Validate
- [ ] Run validation queries
- [ ] Test each table individually
- [ ] Fix any issues found
- [ ] Document assumptions and changes
Overall, this is how the md file is structured. Just keep it in the repository root directory and instruct the IDE agent to follow it. The IDE can also reference different MD files depending on where the MD files are placed.
Conclusion
Wrapping this write-up, I would say translating 100s of batch SQL scripts and ad hoc queries, and validating them, would be a daunting and time-intensive effort. The introduction of AI-assisted workflows, powered by AGENTS.md file, made it simpler. By codifying domain knowledge, refactoring patterns, and validation rules in a single, human-readable guide, AGENTS.md established a reliable baseline for consistency and correctness. This approach not only reduced manual effort but also enabled the management of complexity at scale, making it a key enabler of the successful transition to BigQuery and laying a strong foundation for the future platform.
References
- https://agents.md/
- https://docs.cloud.google.com/bigquery/docs
- https://docs.snowflake.com/en/
