Some time ago, a few friends and I decided to build an app. We duck-taped our code together, launched our first version, then attracted a few users with a small marketing budget. The value of good analytics at this stage of a product’s lifecycle can not be overstated. Your are like the vital signs of your app. To monitor its health and growth, we decided to use . product metrics Firebase Analytics In true start-up fashion, we rapidly looped through the development cycle. Designing small experiments at each step saw our product vitals steadily improve. We had approached the point of diminishing returns when it came to using the analytics dashboards to inform our enhancements. Our experiments needed access to more data. We hit a wall — Firebase’s paywall that guards access to all the data you could want about your product. Why not just pay for the upgrade? like the one below were enough to make us think twice. Headlines And so, we had a new challenge: “How do we get a clean dataset from the analytics dashboards?” What we were after was a table containing the data of triggered by . all events every day each user Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session. Note — this is not how you would want to store your data in most cases. In this particular instance, we decided to use this structure as it lent itself to easier analysis for to our purpose. Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session. , whiteboarding, and too many power drinks later, we had reached a conclusion. We would build a small automation tool to loop through our dashboards, collecting all the data that we wanted — a web scraping solution. Red teaming At a high level, our script would work something like this: There were three key challenges to overcome: Getting access to user-specific data Navigating our analytics dashboard Scraping and saving data Part I — Getting access to user-specific data Analytics platforms provide a birds-eye view of your . Before we could start building our solution, we needed a way to filter out the events for each user. product metrics [Enter ] — a feature that allows us to filter our data by user characteristics. For example, you might want to check for differences in engagement between users in their 20s, and those in their 40s. In that case, age-group would be the user property we use to filter our data. Some other common examples are gender, region, and language. We stretched this feature to its limit. Tagging each user with a unique user property (a user-id that matched our database records), we were able to filter the dashboards for individual user events. Firebase’s User Properties The task would be to access the dashboards, and zip around collecting the data we needed. But first, we needed to complete the simple steps of navigating to the firebase webpage and signing-in. We used for this purpose a browser automation tool. Selenium print( ) driver.get( ) : element = WebDriverWait(driver, ).until( EC.presence_of_element_located((By.XPATH, ))) : print( ) print( ) username = pwd = sign_in_link = driver.find_element_by_xpath( ) sign_in_link.click() time.sleep( ) : element = WebDriverWait(driver, ).until( EC.presence_of_element_located((By.XPATH, ))) input_email = driver.find_element_by_xpath( ) input_email.clear() input_email.send_keys(username) : print( ) next_button = driver.find_element_by_xpath( ) next_button.click() time.sleep( ) : element = WebDriverWait(driver, ).until( EC.presence_of_element_located((By.XPATH, ))) input_pwd = driver.find_element_by_xpath( ) input_pwd.clear() input_pwd.send_keys(pwd) : print( ) time.sleep( ) next_button = driver.find_element_by_xpath( ) next_button.click() : def open_browser_to_firebse () 'open_browser_to_firebse' 'https://firebase.com' try 10 "//a[@data-label='Sign in button']" except "an error occured while loading the page - could not find sign-in button" : def sign_in_to_firebase () 'sign_in_to_firebase' # Log in credentials "YOUR USER NAME HERE" "YOUR PASSWORD HERE" # TODO - check if the current page has the correct URL "//a[@data-label='Sign in button']" 3 try 10 "//input[@id='identifierId']" "//input[@id='identifierId']" except "Could not find the input field for username - sign_in_to_firebase()" "//div[@role='button'][@id='identifierNext']" 3 try 10 "//input[@type='password']" "//input[@type='password']" except "Could not find the input field for password - sign_in_to_firebase()" 1 "//div[@role='button'][@id='passwordNext']" All that’s left for this part is to get the list of ids for users whose data we needed to access. We pulled those directly from our database like so: : connection = mysql.connector.connect( ) connection.is_connected(): db_Info = connection.get_server_info() print( , db_Info) sql_select_Query = cursor = connection.cursor() cursor.execute(sql_select_Query) record = cursor.fetchall() list_of_records = [] r range(len(record)): x = int(record[r][ ]) list_of_records.append(x) print(list_of_records) print(len(list_of_records)) Error e: print( , e) : (connection.is_connected()): cursor.close() connection.close() print( ) list_of_records : def get_list_of_users () # print('get_list_of_users') try #PARAMS FOR YOUR DB HERE if "Connected to MySQL database... MySQL Server version on " "SELECT id FROM users_tbl" # QUERY THAT WORKS FOR YOUR DB STRUCT # print(record) # print(len(record)) for in 0 # print ("Your connected to - ", record) except as "Error while connecting to MySQL" finally # closing database connection. if "MySQL connection is closed" return Part II — Navigating the dashboard If gave us the keys to the kingdom, would be about making our way through it. The list of events and users we collected in will serve as our map, . To reach them, our vehicle of choice would be the URL. Part I Part II Part I marking out the destinations that we needed to visit Our destination page URLs revealed a pattern: https://console.firebase.google.com/u/2/project/ PROJECT_NAME /analytics/app/ APP_COM /events~2Freport~2F EVENT_NAME %3Ft=1&params=_u..pageSize%253D25%2526_r..eventId%253D EVENT_NAME %2526_r..tabIndex%253D0%2526_ u.dateOption %253D DATE_FILTER %2526_u..dataFilters%253D userProperty %25253A1%252C USER_PROPERTY %252CEQ%252CDIMENSION%252C &cs=app.m.events. detail&r=default&g=1 Knowing where to place the parameters in the link, we had quick access to all the pages we needed. All we had to do was plug in the right , , and . user-id event name date A simple function takes care of this: event_page_urls = { : , : , : , } print( ) event == : print( ) app_launch_url = event_page_urls[ ] print(app_launch_url) first_half = app_launch_url[: ] second_half = app_launch_url[ :] user = user_id url = first_half + user + second_half print(url) url event == : print( ) app_launch_url = event_page_urls[ ] print(app_launch_url) first_half = app_launch_url[: ] second_half = app_launch_url[ :] user = user_id url = first_half + user + second_half print(url) url event == : print( ) app_launch_url = event_page_urls[ ] print(app_launch_url) first_half = app_launch_url[: ] second_half = app_launch_url[ :] user = user_id url = first_half + user + second_half print(url) url # Dictionary with event names, and page urls # Update to match your event names and sample urls 'EVENT_1' 'https://console.firebase.google.com/project/...' 'EVENT_2' 'https://console.firebase.google.com/project/...' 'EVENT_3' 'https://console.firebase.google.com/project/...' # Function to set url parameters : def set_url_for_current_user_id (event, user_id) 'set_url_for_current_user_id' # If all event urls are of the same length use OR operator in if if 'EVENT_1' 'EVENT_1' 'EVENT_1' # Update to match your url first half/second half indices 180 190 return # If event page urls differ in length / position of parameters elif 'EVENT_2' 'EVENT_2' 'EVENT_2' # Update to match your url first half/second half indices 185 195 return elif 'EVENT_3' 'EVENT_3' 'EVENT_3' # Update to match your url first half/second half indices 190 200 return # This is not the cleanest or most optimal code. We do not recommend writing your functions like this (with repetitions, and redundancies). Note - the temptation to copy and paste is greatest in the early hours of the morning. A thought towards optimisation: Let’s imagine that we only wanted to track five events. If we had 1,000 users, we would visit 5,000 pages in total — five for each user. But what if two of our events were never triggered? In that case 2,000 of our trips would be wasted visiting pages that held no data. In reality, user behaviour is variable. Different users will show different levels of activity over time, and the majority of them will eventually contribute to our . churn rate Luckily, firebase has a summary page of recorded events. Before we would visit any user’s event pages, we would first check the summary to identify which of them were triggered. In this case: If none of the events we triggered, we would run one cycle to save 5 If all the events were triggered, then we run only one extra cycle for that user Overall, the numbers work out in our favour. Here’s the code: event_page_urls = { : , : , : , } print( ) logged_events = [] driver.switch_to_default_content() iframes = driver.find_elements_by_tag_name( ) driver.switch_to.frame(iframes[ ]) : element_present = WebDriverWait(driver, ).until( EC.presence_of_all_elements_located( (By.XPATH, ) ) ) html = driver.execute_script( ) soup = BeautifulSoup(html, ) print( , len(soup.findAll( , { : })) ) len(soup.findAll( , { : })) != : contents_of_page = soup.findAll( , { : }) content range(len(contents_of_page)): content = contents_of_page[content] content = content.text.strip() logged_events.append(content) print(logged_events) print( ) (logged_events) : print( ) print( ) required_events = event_page_urls.keys() event logged_events: event required_events: url = event_page_urls[event] url = set_url_for_current_user_id(event, user_id) driver.get(url) time.sleep( ) scrape_data(event) : print(event, ) # Dictionary with event names, and page urls # Update to match your event names and sample urls 'EVENT_1' 'https://console.firebase.google.com/project/...' 'EVENT_2' 'https://console.firebase.google.com/project/...' 'EVENT_3' 'https://console.firebase.google.com/project/...' # Get list of logged events from current events page : def get_logged_events () "get_logged_events" "iframe" 0 try 1 '//*[@class="first-col event-name ng-binding"]' 'return document.documentElement.innerHTML' 'html.parser' # print(soup.text[:500], "\n") "number of event rows found: " 'div' "class" "first-col event-name ng-binding" if 'div' "class" "first-col event-name ng-binding" 0 'div' "class" "first-col event-name ng-binding" for in "\n" return except "Could not find any rows on - all events page. " # Filter the whole list of events to include only those required for scraping : def check_if_logged_events_are_required (logged_events, user_id) 'check_if_logged_events_are_required' # check if the logged activities are required for in if in # Gets the base url for that page # Adds the current user's id to it # Load the page for that metric 10 # Wait for the page to load else " is not a required event. " Part III — Scraping and saving the data Scraping: We used (BS4) to scrape the pages — a library that helps programmers easily parse and analyse a webpage’s HTML. Inspecting the page helps us identify where, in the sea of HTML, our data is being held. Once we identify that (with the correct tags), we can point our scripts in the right direction to go collect it. Beautiful Soup There were two challenges here: was locating the HTML elements on the page. Using the inspection tool was easy enough, but all our early test scripts to retrieve some sample data failed. You can imagine our frustration, looking at the elements in the inspector, yet having our scripts return empty-handed. What we hadn’t realised was that Firebase’s dashboards make use of — a mechanism for storing one HTML document within another (think HTML inception). The issue was that our parser could only access one document at a time, (starting with the outermost), whereas the elements we were trying to reach was embedded in one of the sub-documents. The first iFrames ☝️Trying to reach our data in a document, within a document - source. Once this was clear, it was just a matter of switching into the right frame, and our code would run as expected. We found that each page had five iFrames, and that content we were looking for was embedded in the first. Here’s the code: print( ) iframes = driver.find_elements_by_tag_name( ) number_of_iframes = len(iframes) number_of_iframes != : iframes = driver.find_elements_by_tag_name( ) len(iframes) == : : time.sleep( ) time.sleep( ) iframes = driver.find_elements_by_tag_name( ) print( , number_of_iframes, ) iframes : def get_list_of_all_frames_on_page () "Looking for all iframes" "iframe" while 5 "iframe" if 5 break else 1 2 "iframe" "Found " " iframes" return was finding a unique HTML tag that pointed to the exact location of our data on the page. In our case, there wasn’t one. Our data was in a table where all the cells were wrapped in the same tag. The closest we could get with BS4 was to narrow down our search by pulling all the values from the table into a local array. From there, it was just a matter of trial and error before we had the index of the value(s) we needed. The second Here’s the scraping function we used : event == event == event == : print( , event) driver.switch_to.default_content() iframes = get_list_of_all_frames_on_page() driver.switch_to.frame(iframes[ ]) time.sleep( ) : element_present = WebDriverWait(driver, ).until( EC.presence_of_all_elements_located( (By.XPATH, ) ) ) html = driver.execute_script( ) soup = BeautifulSoup(html, ) print( , len(soup.findAll( , { : }))) scraped = [] len(soup.findAll( , { : })) != : contents_of_page = soup.findAll( , { : }) content range(len(contents_of_page)): content = contents_of_page[content].span content = content.text.strip() scraped.append(content) print(scraped) print( ) event == : collected_data[ ] = scraped[ ] event == : collected_data[ ] = scraped[ ] event == : collected_data[ ] = scraped[ ] : print( , event, ) event == : print( , event) driver.switch_to.default_content() iframes = get_list_of_all_frames_on_page() driver.switch_to.frame(iframes[ ]) time.sleep( ) element_present = WebDriverWait(driver, ).until( EC.presence_of_all_elements_located( (By.XPATH, ) ) ) html = driver.execute_script( ) soup = BeautifulSoup(html, ) print( , len(soup.findAll( , { : }))) event_with_params_scraped = [] len(soup.findAll( , { : })) != : contents_of_page = soup.findAll( , { : }) print( , len(soup.findAll( , { : }))) content range(len(contents_of_page)): content = contents_of_page[content] content = content.text.strip() print(content) event_with_params_scraped.append(content) print(event_with_params_scraped) print( ) event == : data_point event_with_params_scraped: data_point: event_with_params_scraped.remove(data_point) number_of_times_to_loop = (len(event_with_params_scraped))/ print( , number_of_times_to_loop) i range(int(number_of_times_to_loop)): event_param = event_with_params_scraped[: ] print(event_detail) collected_data[ ].append(event_detail) print( ) print(collected_data) event_with_params_scraped[: ] : def scrape_data (event) ## FOR EVENTS WITHOUT PARAMETERS if 'EVENT_1' or 'EVENT_2' or 'EVENT_3' 'event is: ' 0 3 try 1 '//*[@class="value ng-scope layout-row"]' 'return document.documentElement.innerHTML' 'html.parser' # print(soup.text[:500], "\n") "number of event rows found: " 'div' "class" "value ng-scope layout-row" if 'div' "class" "value ng-scope layout-row" 0 'div' "class" "value ng-scope layout-row" for in "\n" # Store in collected_data object if 'EVENT_1' 'EVENT_1' 0 # Use the index for the data you need elif 'EVENT_2' 'EVENT_2' 0 # Edit index to requirement elif 'EVENT_3' 'EVENT_3' 0 # Edit index to requirement except "Could not find the element on this page. event: " "\n" ## FOR EVENTS WITH PARAMETERS elif 'EVENT_WITH_PARAMS' 'event is: ' 0 3 # try: 1 '//*[@class="value ng-scope layout-row"]' 'return document.documentElement.innerHTML' 'html.parser' # print(soup.text[:500], "\n") "number of event rows found: " 'text' "dominant-baseline" "center" if 'text' "dominant-baseline" "center" 0 'text' "dominant-baseline" "center" "There are this many items to loop through: " 'text' "dominant-baseline" "center" for in # set the variable content to be the first item of the list "\n" if 'EVENT_WITH_PARAMS' # Filter the collected data to get only what is required for in if '%' in # Store that data in global var 2 "Number of times to loop = " for in 2 'EVENT_WITH_PARAMS' 'appended' del 2 Saving: And finally, we just needed to save our data. For this, we decided to repurpose some code from an earlier project to store our data in an excel sheet. We used the library for this: openpyxl wb = openpyxl.load_workbook( , data_only= ) sheet = wb.get_sheet_by_name( ) row = sheet.max_row i collected_data: i == : cell = sheet[ + str(row+ )] cell.value = collected_data[ ] i == : cell = sheet[ + str(row+ )] cell.value = collected_data[ ] i == : cell = sheet[ + str(row+ )] cell.value = collected_data[ ] i == : cell = sheet[ + str(row+ )] cell.value = collected_data[ ] i == : print(len(collected_data[ ])) cell = sheet[ + str(row + )] event_name_and_time = event_list collected_data[ ]: print(event_list) content = item_in_event_list event_list: print(item_in_event_list) content = content + item_in_event_list + event_name_and_time = event_name_and_time + content print(event_name_and_time) len(event_name_and_time) > : cell.value = event_name_and_time wb.save(filename= ) print( ) : def store_in_spreadsheet (collected_data) 'test_sheet.xlsx' # filename here True 'Sheet1' # sheet name here for in # Date col. added against user / set of events if 'date' 'A' 1 'date' elif 'EVENT_1' 'B' 1 'EVENT_1' elif 'EVENT_2' 'C' 1 'EVENT_2' elif 'EVENT_3' 'D' 1 'EVENT_3' # For some events, we stored arrays, tuples, or JSON-like structures elif 'EVENT_WITH_PARAMS' 'EVENT_WITH_PARAMS' 'E' 1 '' # content = '' for in 'EVENT_WITH_PARAMS' '' for in ' ' if 0 'test_sheet.xlsx' 'saved' Missteps and Lessons Learned — we got a bit carried away with automation at the start of this project. URLs beat GUI automation for navigation — you can find a more detailed, entertaining answer than I can offer . Never use Regex to parse HTML here — One of the pages we were scraping from was updated about a week after we completed this project. Updates to a page’s structure can render your web-scraper redundant. Be ready to lose it all So, did it work? It fulfilled the requirements for our specific use-case, although it had its limitations. Pros: Provided us with the data we required to run our next product experiment over the next few weeks (without upgrading). Compared with a manual process, we calculated that this script saved us around 2.8 hours per day (or $754.32/mo given the of data entry clerk in the US). average hourly rate Key Limitations: For a few hundred users, it would still take over an hour per day (with some variability considering the number of users, application usage, and the wifi connection speed). The data we were able to scrape from the dashboards is still extremely limited in comparison to what you would get from an upgrade. If you're interested to see how each of these pieces fits together, you can find the full project . 👈 on GitHub here