We Kinda Bypassed Firebase's Paywall: Here's How

Written by bari | Published 2020/11/22
Tech Story Tags: data-science | web-scraping | python | startup | data | analytics | firebase | hackernoon-top-story

TLDR The value of good analytics at this stage of a product’s lifecycle can not be overstated. To monitor its health and growth, we decided to use Firebase Analytics. We had approached the point of diminishing returns when it came to using the analytics dashboards to inform our enhancements. Our experiments needed access to more data. We would build a small automation tool to loop through our dashboards, collecting all the data that we wanted — a web scraping solution. We used Selenium for this purpose a browser automation tool.via the TL;DR App

Some time ago, a few friends and I decided to build an app. We duck-taped our code together, launched our first version, then attracted a few users with a small marketing budget. 
The value of good analytics at this stage of a product’s lifecycle can not be overstated. Your product metrics are like the vital signs of your app. To monitor its health and growth, we decided to use Firebase Analytics.
In true start-up fashion, we rapidly looped through the development cycle. Designing small experiments at each step saw our product vitals steadily improve.
We had approached the point of diminishing returns when it came to using the analytics dashboards to inform our enhancements. Our experiments needed access to more data. We hit a wall — Firebase’s paywall that guards access to all the data you could want about your product.
Why not just pay for the upgrade? Headlines like the one below were enough to make us think twice.
And so, we had a new challenge: 
“How do we get a clean dataset from the analytics dashboards?” 
What we were after was a table containing the data of all events triggered every day by each user.
Note — this is not how you would want to store your data in most cases. In this particular instance, we decided to use this structure as it lent itself to easier analysis for to our purpose. Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session. 
Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session. Red teaming, whiteboarding, and too many power drinks later, we had reached a conclusion. We would build a small automation tool to loop through our dashboards, collecting all the data that we wanted — a web scraping solution. 
At a high level, our script would work something like this:
There were three key challenges to overcome:
  1. Getting access to user-specific data
  2. Navigating our analytics dashboard
  3. Scraping and saving data

Part I — Getting access to user-specific data

Analytics platforms provide a birds-eye view of your product metrics. Before we could start building our solution, we needed a way to filter out the events for each user. 
[Enter Firebase’s User Properties] — a feature that allows us to filter our data by user characteristics. For example, you might want to check for differences in engagement between users in their 20s, and those in their 40s. In that case, age-group would be the user property we use to filter our data. Some other common examples are gender, region, and language. We stretched this feature to its limit. Tagging each user with a unique user property (a user-id that matched our database records), we were able to filter the dashboards for individual user events.
The task would be to access the dashboards, and zip around collecting the data we needed. But first, we needed to complete the simple steps of navigating to the firebase webpage and signing-in. We used Selenium for this purpose a browser automation tool. 
def open_browser_to_firebse():
    print('open_browser_to_firebse')
    driver.get('https://firebase.com')
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//a[@data-label='Sign in button']")))
    except:
        print("an error occured while loading the page - could not find sign-in button")
        

def sign_in_to_firebase():
    print('sign_in_to_firebase')
    # Log in credentials
    username = "YOUR USER NAME HERE"
    pwd = "YOUR PASSWORD HERE"
    # TODO - check if the current page has the correct URL
    sign_in_link = driver.find_element_by_xpath("//a[@data-label='Sign in button']")
    sign_in_link.click()
    time.sleep(3)

    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@id='identifierId']")))
        input_email = driver.find_element_by_xpath("//input[@id='identifierId']")
        input_email.clear()
        input_email.send_keys(username)
    except:
        print("Could not find the input field for username - sign_in_to_firebase()")

    next_button = driver.find_element_by_xpath("//div[@role='button'][@id='identifierNext']")
    next_button.click()
    time.sleep(3)

    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@type='password']")))
        input_pwd = driver.find_element_by_xpath("//input[@type='password']")
        input_pwd.clear()
        input_pwd.send_keys(pwd)
    except:
        print("Could not find the input field for password - sign_in_to_firebase()")
    time.sleep(1)
    next_button = driver.find_element_by_xpath("//div[@role='button'][@id='passwordNext']")
    next_button.click()
All that’s left for this part is to get the list of ids for users whose data we needed to access. We pulled those directly from our database like so: 
def get_list_of_users():
    # print('get_list_of_users')
    try:
        connection = mysql.connector.connect(
                                             #PARAMS FOR YOUR DB HERE 
                                            )
        if connection.is_connected():
            db_Info = connection.get_server_info()
            print("Connected to MySQL database... MySQL Server version on ", db_Info)
            sql_select_Query = "SELECT id FROM users_tbl" # QUERY THAT WORKS FOR YOUR DB STRUCT  
            cursor = connection.cursor()
            cursor.execute(sql_select_Query)
            record = cursor.fetchall()
            # print(record)
            # print(len(record))
            list_of_records = []

            for r in range(len(record)):
                x = int(record[r][0])
                list_of_records.append(x)
            print(list_of_records)
            print(len(list_of_records))
            # print ("Your connected to - ", record)
    except Error as e:
        print("Error while connecting to MySQL", e)
    finally:
        # closing database connection.
        if (connection.is_connected()):
            cursor.close()
            connection.close()
            print("MySQL connection is closed")

    return list_of_records

Part II — Navigating the dashboard

If Part I gave us the keys to the kingdom, Part II would be about making our way through it. The list of events and users we collected in Part I will serve as our map, marking out the destinations that we needed to visit. To reach them, our vehicle of choice would be the URL.
Our destination page URLs revealed a pattern:
https://console.firebase.google.com/u/2/project/
PROJECT_NAME
/analytics/app/
APP_COM
/events~2Freport~2F
EVENT_NAME
%3Ft=1&params=_u..pageSize%253D25%2526_r..eventId%253D
EVENT_NAME
%2526_r..tabIndex%253D0%2526_
u.dateOption
%253D
DATE_FILTER
%2526_u..dataFilters%253D
userProperty
%25253A1%252C
USER_PROPERTY
%252CEQ%252CDIMENSION%252C
&cs=app.m.events.
detail&r=default&g=1
Knowing where to place the parameters in the link, we had quick access to all the pages we needed. All we had to do was plug in the right user-idevent name, and date.
A simple function takes care of this: 
# Dictionary with event names, and page urls  
event_page_urls = {
    # Update to match your event names and sample urls 
    'EVENT_1': 'https://console.firebase.google.com/project/...',
    'EVENT_2': 'https://console.firebase.google.com/project/...',
    'EVENT_3': 'https://console.firebase.google.com/project/...',
}

# Function to set url parameters 
def set_url_for_current_user_id(event, user_id):
    print('set_url_for_current_user_id')
    
    # If all event urls are of the same length use OR operator in if 
    if event == 'EVENT_1':
        print('EVENT_1')
        app_launch_url = event_page_urls['EVENT_1']
        print(app_launch_url)
        # Update to match your url first half/second half indices  
        first_half = app_launch_url[:180]
        second_half = app_launch_url[190:]
        user = user_id
        url = first_half + user + second_half
        print(url)
        return url

    # If event page urls differ in length / position of parameters
    elif event == 'EVENT_2':
        print('EVENT_2')
        app_launch_url = event_page_urls['EVENT_2']
        print(app_launch_url)
        # Update to match your url first half/second half indices  
        first_half = app_launch_url[:185]
        second_half = app_launch_url[195:]
        user = user_id
        url = first_half + user + second_half
        print(url)
        return url
    
    elif event == 'EVENT_3':
        print('EVENT_3')
        app_launch_url = event_page_urls['EVENT_3']
        print(app_launch_url)
        # Update to match your url first half/second half indices  
        first_half = app_launch_url[:190]
        second_half = app_launch_url[200:]
        user = user_id
        url = first_half + user + second_half
        print(url)
        return url

# This is not the cleanest or most optimal code. We do not recommend writing your functions like this (with repetitions, and redundancies). Note - the temptation to copy and paste is greatest in the early hours of the morning.
A thought towards optimisation:
Let’s imagine that we only wanted to track five events. If we had 1,000 users, we would visit 5,000 pages in total — five for each user. But what if two of our events were never triggered? In that case 2,000 of our trips would be wasted visiting pages that held no data. In reality, user behaviour is variable. Different users will show different levels of activity over time, and the majority of them will eventually contribute to our churn rate.
Luckily, firebase has a summary page of recorded events. Before we would visit any user’s event pages, we would first check the summary to identify which of them were triggered. In this case:
  • If none of the events we triggered, we would run one cycle to save 5
  • If all the events were triggered, then we run only one extra cycle for that user
Overall, the numbers work out in our favour. 
Here’s the code:
# Dictionary with event names, and page urls  
event_page_urls = {
    # Update to match your event names and sample urls 
    'EVENT_1': 'https://console.firebase.google.com/project/...',
    'EVENT_2': 'https://console.firebase.google.com/project/...',
    'EVENT_3': 'https://console.firebase.google.com/project/...',
}

# Get list of logged events from current events page  
def get_logged_events():
    print("get_logged_events")
    logged_events = []
    driver.switch_to_default_content()
    iframes = driver.find_elements_by_tag_name("iframe")
    driver.switch_to.frame(iframes[0])
    try:
        element_present = WebDriverWait(driver, 1).until(
            EC.presence_of_all_elements_located(
                (By.XPATH, '//*[@class="first-col event-name ng-binding"]')
            )
        )
        html = driver.execute_script('return document.documentElement.innerHTML')
        soup = BeautifulSoup(html, 'html.parser')
        # print(soup.text[:500], "\n")
        print("number of event rows found: ", 
              len(soup.findAll('div', {"class": "first-col event-name ng-binding"}))
             )
        if len(soup.findAll('div', {"class": "first-col event-name ng-binding"})) != 0:
            contents_of_page = soup.findAll('div', {"class": "first-col event-name ng-binding"})
            for content in range(len(contents_of_page)):
                content = contents_of_page[content]
                content = content.text.strip()
                logged_events.append(content)
            print(logged_events)
            print("\n")

        return(logged_events)

    except:
        print("Could not find any rows on - all events page. ")

        
# Filter the whole list of events to include only those required for scraping 
def check_if_logged_events_are_required(logged_events, user_id):
    print('check_if_logged_events_are_required')
    required_events = event_page_urls.keys()    # check if the logged activities are required
    for event in logged_events:
        if event in required_events:
            url = event_page_urls[event]        # Gets the base url for that page
            url = set_url_for_current_user_id(event, user_id)       # Adds the current user's id to it
            driver.get(url)          # Load the page for that metric
            time.sleep(10)           # Wait for the page to load
            scrape_data(event)
        else:
            print(event, " is not a required event. ")

Part III — Scraping and saving the data

Scraping:
We used Beautiful Soup (BS4) to scrape the pages — a library that helps programmers easily parse and analyse a webpage’s HTML. Inspecting the page helps us identify where, in the sea of HTML, our data is being held. Once we identify that (with the correct tags), we can point our scripts in the right direction to go collect it.  
There were two challenges here:
The first was locating the HTML elements on the page. Using the inspection tool was easy enough, but all our early test scripts to retrieve some sample data failed. You can imagine our frustration, looking at the elements in the inspector, yet having our scripts return empty-handed. What we hadn’t realised was that Firebase’s dashboards make use of iFrames — a mechanism for storing one HTML document within another (think HTML inception). The issue was that our parser could only access one document at a time, (starting with the outermost), whereas the elements we were trying to reach was embedded in one of the sub-documents.
☝️Trying to reach our data in a document, within a document - source.
Once this was clear, it was just a matter of switching into the right frame, and our code would run as expected. We found that each page had five iFrames, and that content we were looking for was embedded in the first. 
Here’s the code:
def get_list_of_all_frames_on_page():
    print("Looking for all iframes")
    iframes = driver.find_elements_by_tag_name("iframe")
    number_of_iframes = len(iframes)
    while number_of_iframes != 5:
        iframes = driver.find_elements_by_tag_name("iframe")
        if len(iframes) == 5: 
            break
        else:
            time.sleep(1)
    time.sleep(2)
    iframes = driver.find_elements_by_tag_name("iframe")
    print("Found ", number_of_iframes, " iframes")          
    return iframes
The second was finding a unique HTML tag that pointed to the exact location of our data on the page. In our case, there wasn’t one. Our data was in a table where all the cells were wrapped in the same tag. The closest we could get with BS4 was to narrow down our search by pulling all the values from the table into a local array. From there, it was just a matter of trial and error before we had the index of the value(s) we needed.
Here’s the scraping function we used :
def scrape_data(event):
    ## FOR EVENTS WITHOUT PARAMETERS 
    if event == 'EVENT_1' or event == 'EVENT_2' or event == 'EVENT_3':  
        print('event is: ', event)
        driver.switch_to.default_content()
        iframes = get_list_of_all_frames_on_page()
        driver.switch_to.frame(iframes[0])
        time.sleep(3)
        try:
            element_present = WebDriverWait(driver, 1).until(
                EC.presence_of_all_elements_located(
                    (By.XPATH, '//*[@class="value ng-scope layout-row"]')
                )
            )
            html = driver.execute_script('return document.documentElement.innerHTML')
            soup = BeautifulSoup(html, 'html.parser')
            # print(soup.text[:500], "\n")
            print("number of event rows found: ",
                  len(soup.findAll('div', {"class": "value ng-scope layout-row"})))
            scraped = []
            if len(soup.findAll('div', {"class": "value ng-scope layout-row"})) != 0:
                contents_of_page = soup.findAll('div', {"class": "value ng-scope layout-row"})
                for content in range(len(contents_of_page)):
                    content = contents_of_page[content].span
                    content = content.text.strip()
                    scraped.append(content)
                print(scraped)
                print("\n")
                # Store in collected_data object
                if event == 'EVENT_1':
                    collected_data['EVENT_1'] = scraped[0] # Use the index for the data you need 
                elif event == 'EVENT_2':
                    collected_data['EVENT_2'] = scraped[0] # Edit index to requirement 
                elif event == 'EVENT_3':
                    collected_data['EVENT_3'] = scraped[0] # Edit index to requirement 
        except:
            print("Could not find the element on this page. event: ", event, "\n")
    
    ## FOR EVENTS WITH PARAMETERS 
    elif event == 'EVENT_WITH_PARAMS':   
        print('event is: ', event)
        driver.switch_to.default_content()
        iframes = get_list_of_all_frames_on_page()
        driver.switch_to.frame(iframes[0])
        time.sleep(3)
        # try:
        element_present = WebDriverWait(driver, 1).until(
            EC.presence_of_all_elements_located(
                (By.XPATH, '//*[@class="value ng-scope layout-row"]')
            )
        )
        html = driver.execute_script('return document.documentElement.innerHTML')
        soup = BeautifulSoup(html, 'html.parser')
        # print(soup.text[:500], "\n")
        print("number of event rows found: ",
              len(soup.findAll('text', {"dominant-baseline": "center"})))
        event_with_params_scraped = []
        if len(soup.findAll('text', {"dominant-baseline": "center"})) != 0:
            contents_of_page = soup.findAll('text', {"dominant-baseline": "center"})
            print("There are this many items to loop through: ", 
                  len(soup.findAll('text', {"dominant-baseline": "center"})))
            for content in range(len(contents_of_page)):
                content = contents_of_page[content]     # set the variable content to be the first item of the list
                content = content.text.strip()
                print(content)
                event_with_params_scraped.append(content)
            print(event_with_params_scraped)
            print("\n")
            if event == 'EVENT_WITH_PARAMS':
                # Filter the collected data to get only what is required
                for data_point in event_with_params_scraped:
                    if '%' in data_point:
                        event_with_params_scraped.remove(data_point)
                # Store that data in global var
                number_of_times_to_loop = (len(event_with_params_scraped))/2
                print("Number of times to loop = ", number_of_times_to_loop)
                for i in range(int(number_of_times_to_loop)):
                    event_param = event_with_params_scraped[:2]
                    print(event_detail)
                    collected_data['EVENT_WITH_PARAMS'].append(event_detail)
                    print('appended')
                    print(collected_data)
                    del event_with_params_scraped[:2]
Saving:
And finally, we just needed to save our data. For this, we decided to repurpose some code from an earlier project to store our data in an excel sheet. We used the openpyxl library for this:
def store_in_spreadsheet(collected_data):
    
    wb = openpyxl.load_workbook('test_sheet.xlsx',  # filename here 
                                data_only=True)
    sheet = wb.get_sheet_by_name('Sheet1')          # sheet name here

    row = sheet.max_row
    for i in collected_data:
        # Date col. added against user / set of events 
        if i == 'date':
            cell = sheet['A' + str(row+1)]
            cell.value = collected_data['date']

        elif i == 'EVENT_1':
            cell = sheet['B' + str(row+1)]
            cell.value = collected_data['EVENT_1']

        elif i == 'EVENT_2':
            cell = sheet['C' + str(row+1)]
            cell.value = collected_data['EVENT_2']

        elif i == 'EVENT_3':
            cell = sheet['D' + str(row+1)]
            cell.value = collected_data['EVENT_3']

        # For some events, we stored arrays, tuples, or JSON-like structures
        elif i == 'EVENT_WITH_PARAMS':
            print(len(collected_data['EVENT_WITH_PARAMS']))
            cell = sheet['E' + str(row + 1)]
            event_name_and_time = ''
            # content = ''
            for event_list in collected_data['EVENT_WITH_PARAMS']:
                print(event_list)
                content = ''
                for item_in_event_list in event_list:
                    print(item_in_event_list)
                    content = content + item_in_event_list + ' '
                event_name_and_time = event_name_and_time + content
                print(event_name_and_time)
            if len(event_name_and_time) > 0:
                cell.value = event_name_and_time

    wb.save(filename='test_sheet.xlsx')
    print('saved')

Missteps and Lessons Learned

  1. URLs beat GUI automation for navigation — we got a bit carried away with automation at the start of this project.
  2. Never use Regex to parse HTML — you can find a more detailed, entertaining answer than I can offer here.
  3. Be ready to lose it all — One of the pages we were scraping from was updated about a week after we completed this project. Updates to a page’s structure can render your web-scraper redundant.


So, did it work?

It fulfilled the requirements for our specific use-case, although it had its limitations.
Pros:
  1. Provided us with the data we required to run our next product experiment over the next few weeks (without upgrading).
  2. Compared with a manual process, we calculated that this script saved us around 2.8 hours per day (or $754.32/mo given the average hourly rate of data entry clerk in the US).
Key Limitations:
  1. For a few hundred users, it would still take over an hour per day (with some variability considering the number of users, application usage, and the wifi connection speed).
  2. The data we were able to scrape from the dashboards is still extremely limited in comparison to what you would get from an upgrade. 
If you're interested to see how each of these pieces fits together, you can find the full project on GitHub here. 👈

Written by bari | Product Manager | Data Scientist
Published by HackerNoon on 2020/11/22