Some time ago, a few friends and I decided to build an app. We duck-taped our code together, launched our first version, then attracted a few users with a small marketing budget.
The value of good analytics at this stage of a product’s lifecycle can not be overstated. Your product metrics are like the vital signs of your app. To monitor its health and growth, we decided to use Firebase Analytics.
In true start-up fashion, we rapidly looped through the development cycle. Designing small experiments at each step saw our product vitals steadily improve.
We had approached the point of diminishing returns when it came to using the analytics dashboards to inform our enhancements. Our experiments needed access to more data. We hit a wall — Firebase’s paywall that guards access to all the data you could want about your product.
Why not just pay for the upgrade? Headlines like the one below were enough to make us think twice.
And so, we had a new challenge:
“How do we get a clean dataset from the analytics dashboards?”
What we were after was a table containing the data of all events triggered every day by each user.
Note — this is not how you would want to store your data in most cases. In this particular instance, we decided to use this structure as it lent itself to easier analysis for to our purpose. Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session.
Brainstorm clouds gathered, and we were soon engaged in a turbulent problem-solving session. Red teaming, whiteboarding, and too many power drinks later, we had reached a conclusion. We would build a small automation tool to loop through our dashboards, collecting all the data that we wanted — a web scraping solution.
At a high level, our script would work something like this:
There were three key challenges to overcome:
Analytics platforms provide a birds-eye view of your product metrics. Before we could start building our solution, we needed a way to filter out the events for each user.
[Enter Firebase’s User Properties] — a feature that allows us to filter our data by user characteristics. For example, you might want to check for differences in engagement between users in their 20s, and those in their 40s. In that case, age-group would be the user property we use to filter our data. Some other common examples are gender, region, and language. We stretched this feature to its limit. Tagging each user with a unique user property (a user-id that matched our database records), we were able to filter the dashboards for individual user events.
The task would be to access the dashboards, and zip around collecting the data we needed. But first, we needed to complete the simple steps of navigating to the firebase webpage and signing-in. We used Selenium for this purpose a browser automation tool.
def open_browser_to_firebse():
print('open_browser_to_firebse')
driver.get('https://firebase.com')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//a[@data-label='Sign in button']")))
except:
print("an error occured while loading the page - could not find sign-in button")
def sign_in_to_firebase():
print('sign_in_to_firebase')
# Log in credentials
username = "YOUR USER NAME HERE"
pwd = "YOUR PASSWORD HERE"
# TODO - check if the current page has the correct URL
sign_in_link = driver.find_element_by_xpath("//a[@data-label='Sign in button']")
sign_in_link.click()
time.sleep(3)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//input[@id='identifierId']")))
input_email = driver.find_element_by_xpath("//input[@id='identifierId']")
input_email.clear()
input_email.send_keys(username)
except:
print("Could not find the input field for username - sign_in_to_firebase()")
next_button = driver.find_element_by_xpath("//div[@role='button'][@id='identifierNext']")
next_button.click()
time.sleep(3)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//input[@type='password']")))
input_pwd = driver.find_element_by_xpath("//input[@type='password']")
input_pwd.clear()
input_pwd.send_keys(pwd)
except:
print("Could not find the input field for password - sign_in_to_firebase()")
time.sleep(1)
next_button = driver.find_element_by_xpath("//div[@role='button'][@id='passwordNext']")
next_button.click()
All that’s left for this part is to get the list of ids for users whose data we needed to access. We pulled those directly from our database like so:
def get_list_of_users():
# print('get_list_of_users')
try:
connection = mysql.connector.connect(
#PARAMS FOR YOUR DB HERE
)
if connection.is_connected():
db_Info = connection.get_server_info()
print("Connected to MySQL database... MySQL Server version on ", db_Info)
sql_select_Query = "SELECT id FROM users_tbl" # QUERY THAT WORKS FOR YOUR DB STRUCT
cursor = connection.cursor()
cursor.execute(sql_select_Query)
record = cursor.fetchall()
# print(record)
# print(len(record))
list_of_records = []
for r in range(len(record)):
x = int(record[r][0])
list_of_records.append(x)
print(list_of_records)
print(len(list_of_records))
# print ("Your connected to - ", record)
except Error as e:
print("Error while connecting to MySQL", e)
finally:
# closing database connection.
if (connection.is_connected()):
cursor.close()
connection.close()
print("MySQL connection is closed")
return list_of_records
If Part I gave us the keys to the kingdom, Part II would be about making our way through it. The list of events and users we collected in Part I will serve as our map, marking out the destinations that we needed to visit. To reach them, our vehicle of choice would be the URL.
Our destination page URLs revealed a pattern:
https://console.firebase.google.com/u/2/project/
PROJECT_NAME
/analytics/app/
APP_COM
/events~2Freport~2F
EVENT_NAME
%3Ft=1¶ms=_u..pageSize%253D25%2526_r..eventId%253D
EVENT_NAME
%2526_r..tabIndex%253D0%2526_
u.dateOption
%253D
DATE_FILTER
%2526_u..dataFilters%253D
userProperty
%25253A1%252C
USER_PROPERTY
%252CEQ%252CDIMENSION%252C
&cs=app.m.events.
detail&r=default&g=1
Knowing where to place the parameters in the link, we had quick access to all the pages we needed. All we had to do was plug in the right user-id, event name, and date.
A simple function takes care of this:
# Dictionary with event names, and page urls
event_page_urls = {
# Update to match your event names and sample urls
'EVENT_1': 'https://console.firebase.google.com/project/...',
'EVENT_2': 'https://console.firebase.google.com/project/...',
'EVENT_3': 'https://console.firebase.google.com/project/...',
}
# Function to set url parameters
def set_url_for_current_user_id(event, user_id):
print('set_url_for_current_user_id')
# If all event urls are of the same length use OR operator in if
if event == 'EVENT_1':
print('EVENT_1')
app_launch_url = event_page_urls['EVENT_1']
print(app_launch_url)
# Update to match your url first half/second half indices
first_half = app_launch_url[:180]
second_half = app_launch_url[190:]
user = user_id
url = first_half + user + second_half
print(url)
return url
# If event page urls differ in length / position of parameters
elif event == 'EVENT_2':
print('EVENT_2')
app_launch_url = event_page_urls['EVENT_2']
print(app_launch_url)
# Update to match your url first half/second half indices
first_half = app_launch_url[:185]
second_half = app_launch_url[195:]
user = user_id
url = first_half + user + second_half
print(url)
return url
elif event == 'EVENT_3':
print('EVENT_3')
app_launch_url = event_page_urls['EVENT_3']
print(app_launch_url)
# Update to match your url first half/second half indices
first_half = app_launch_url[:190]
second_half = app_launch_url[200:]
user = user_id
url = first_half + user + second_half
print(url)
return url
# This is not the cleanest or most optimal code. We do not recommend writing your functions like this (with repetitions, and redundancies). Note - the temptation to copy and paste is greatest in the early hours of the morning.
A thought towards optimisation:
Let’s imagine that we only wanted to track five events. If we had 1,000 users, we would visit 5,000 pages in total — five for each user. But what if two of our events were never triggered? In that case 2,000 of our trips would be wasted visiting pages that held no data. In reality, user behaviour is variable. Different users will show different levels of activity over time, and the majority of them will eventually contribute to our churn rate.
Luckily, firebase has a summary page of recorded events. Before we would visit any user’s event pages, we would first check the summary to identify which of them were triggered. In this case:
Overall, the numbers work out in our favour.
Here’s the code:
# Dictionary with event names, and page urls
event_page_urls = {
# Update to match your event names and sample urls
'EVENT_1': 'https://console.firebase.google.com/project/...',
'EVENT_2': 'https://console.firebase.google.com/project/...',
'EVENT_3': 'https://console.firebase.google.com/project/...',
}
# Get list of logged events from current events page
def get_logged_events():
print("get_logged_events")
logged_events = []
driver.switch_to_default_content()
iframes = driver.find_elements_by_tag_name("iframe")
driver.switch_to.frame(iframes[0])
try:
element_present = WebDriverWait(driver, 1).until(
EC.presence_of_all_elements_located(
(By.XPATH, '//*[@class="first-col event-name ng-binding"]')
)
)
html = driver.execute_script('return document.documentElement.innerHTML')
soup = BeautifulSoup(html, 'html.parser')
# print(soup.text[:500], "\n")
print("number of event rows found: ",
len(soup.findAll('div', {"class": "first-col event-name ng-binding"}))
)
if len(soup.findAll('div', {"class": "first-col event-name ng-binding"})) != 0:
contents_of_page = soup.findAll('div', {"class": "first-col event-name ng-binding"})
for content in range(len(contents_of_page)):
content = contents_of_page[content]
content = content.text.strip()
logged_events.append(content)
print(logged_events)
print("\n")
return(logged_events)
except:
print("Could not find any rows on - all events page. ")
# Filter the whole list of events to include only those required for scraping
def check_if_logged_events_are_required(logged_events, user_id):
print('check_if_logged_events_are_required')
required_events = event_page_urls.keys() # check if the logged activities are required
for event in logged_events:
if event in required_events:
url = event_page_urls[event] # Gets the base url for that page
url = set_url_for_current_user_id(event, user_id) # Adds the current user's id to it
driver.get(url) # Load the page for that metric
time.sleep(10) # Wait for the page to load
scrape_data(event)
else:
print(event, " is not a required event. ")
Scraping:
We used Beautiful Soup (BS4) to scrape the pages — a library that helps programmers easily parse and analyse a webpage’s HTML. Inspecting the page helps us identify where, in the sea of HTML, our data is being held. Once we identify that (with the correct tags), we can point our scripts in the right direction to go collect it.
There were two challenges here:
The first was locating the HTML elements on the page. Using the inspection tool was easy enough, but all our early test scripts to retrieve some sample data failed. You can imagine our frustration, looking at the elements in the inspector, yet having our scripts return empty-handed. What we hadn’t realised was that Firebase’s dashboards make use of iFrames — a mechanism for storing one HTML document within another (think HTML inception). The issue was that our parser could only access one document at a time, (starting with the outermost), whereas the elements we were trying to reach was embedded in one of the sub-documents.
☝️Trying to reach our data in a document, within a document - source.
Once this was clear, it was just a matter of switching into the right frame, and our code would run as expected. We found that each page had five iFrames, and that content we were looking for was embedded in the first.
Here’s the code:
def get_list_of_all_frames_on_page():
print("Looking for all iframes")
iframes = driver.find_elements_by_tag_name("iframe")
number_of_iframes = len(iframes)
while number_of_iframes != 5:
iframes = driver.find_elements_by_tag_name("iframe")
if len(iframes) == 5:
break
else:
time.sleep(1)
time.sleep(2)
iframes = driver.find_elements_by_tag_name("iframe")
print("Found ", number_of_iframes, " iframes")
return iframes
The second was finding a unique HTML tag that pointed to the exact location of our data on the page. In our case, there wasn’t one. Our data was in a table where all the cells were wrapped in the same tag. The closest we could get with BS4 was to narrow down our search by pulling all the values from the table into a local array. From there, it was just a matter of trial and error before we had the index of the value(s) we needed.
Here’s the scraping function we used :
def scrape_data(event):
## FOR EVENTS WITHOUT PARAMETERS
if event == 'EVENT_1' or event == 'EVENT_2' or event == 'EVENT_3':
print('event is: ', event)
driver.switch_to.default_content()
iframes = get_list_of_all_frames_on_page()
driver.switch_to.frame(iframes[0])
time.sleep(3)
try:
element_present = WebDriverWait(driver, 1).until(
EC.presence_of_all_elements_located(
(By.XPATH, '//*[@class="value ng-scope layout-row"]')
)
)
html = driver.execute_script('return document.documentElement.innerHTML')
soup = BeautifulSoup(html, 'html.parser')
# print(soup.text[:500], "\n")
print("number of event rows found: ",
len(soup.findAll('div', {"class": "value ng-scope layout-row"})))
scraped = []
if len(soup.findAll('div', {"class": "value ng-scope layout-row"})) != 0:
contents_of_page = soup.findAll('div', {"class": "value ng-scope layout-row"})
for content in range(len(contents_of_page)):
content = contents_of_page[content].span
content = content.text.strip()
scraped.append(content)
print(scraped)
print("\n")
# Store in collected_data object
if event == 'EVENT_1':
collected_data['EVENT_1'] = scraped[0] # Use the index for the data you need
elif event == 'EVENT_2':
collected_data['EVENT_2'] = scraped[0] # Edit index to requirement
elif event == 'EVENT_3':
collected_data['EVENT_3'] = scraped[0] # Edit index to requirement
except:
print("Could not find the element on this page. event: ", event, "\n")
## FOR EVENTS WITH PARAMETERS
elif event == 'EVENT_WITH_PARAMS':
print('event is: ', event)
driver.switch_to.default_content()
iframes = get_list_of_all_frames_on_page()
driver.switch_to.frame(iframes[0])
time.sleep(3)
# try:
element_present = WebDriverWait(driver, 1).until(
EC.presence_of_all_elements_located(
(By.XPATH, '//*[@class="value ng-scope layout-row"]')
)
)
html = driver.execute_script('return document.documentElement.innerHTML')
soup = BeautifulSoup(html, 'html.parser')
# print(soup.text[:500], "\n")
print("number of event rows found: ",
len(soup.findAll('text', {"dominant-baseline": "center"})))
event_with_params_scraped = []
if len(soup.findAll('text', {"dominant-baseline": "center"})) != 0:
contents_of_page = soup.findAll('text', {"dominant-baseline": "center"})
print("There are this many items to loop through: ",
len(soup.findAll('text', {"dominant-baseline": "center"})))
for content in range(len(contents_of_page)):
content = contents_of_page[content] # set the variable content to be the first item of the list
content = content.text.strip()
print(content)
event_with_params_scraped.append(content)
print(event_with_params_scraped)
print("\n")
if event == 'EVENT_WITH_PARAMS':
# Filter the collected data to get only what is required
for data_point in event_with_params_scraped:
if '%' in data_point:
event_with_params_scraped.remove(data_point)
# Store that data in global var
number_of_times_to_loop = (len(event_with_params_scraped))/2
print("Number of times to loop = ", number_of_times_to_loop)
for i in range(int(number_of_times_to_loop)):
event_param = event_with_params_scraped[:2]
print(event_detail)
collected_data['EVENT_WITH_PARAMS'].append(event_detail)
print('appended')
print(collected_data)
del event_with_params_scraped[:2]
Saving:
And finally, we just needed to save our data. For this, we decided to repurpose some code from an earlier project to store our data in an excel sheet. We used the openpyxl library for this:
def store_in_spreadsheet(collected_data):
wb = openpyxl.load_workbook('test_sheet.xlsx', # filename here
data_only=True)
sheet = wb.get_sheet_by_name('Sheet1') # sheet name here
row = sheet.max_row
for i in collected_data:
# Date col. added against user / set of events
if i == 'date':
cell = sheet['A' + str(row+1)]
cell.value = collected_data['date']
elif i == 'EVENT_1':
cell = sheet['B' + str(row+1)]
cell.value = collected_data['EVENT_1']
elif i == 'EVENT_2':
cell = sheet['C' + str(row+1)]
cell.value = collected_data['EVENT_2']
elif i == 'EVENT_3':
cell = sheet['D' + str(row+1)]
cell.value = collected_data['EVENT_3']
# For some events, we stored arrays, tuples, or JSON-like structures
elif i == 'EVENT_WITH_PARAMS':
print(len(collected_data['EVENT_WITH_PARAMS']))
cell = sheet['E' + str(row + 1)]
event_name_and_time = ''
# content = ''
for event_list in collected_data['EVENT_WITH_PARAMS']:
print(event_list)
content = ''
for item_in_event_list in event_list:
print(item_in_event_list)
content = content + item_in_event_list + ' '
event_name_and_time = event_name_and_time + content
print(event_name_and_time)
if len(event_name_and_time) > 0:
cell.value = event_name_and_time
wb.save(filename='test_sheet.xlsx')
print('saved')
It fulfilled the requirements for our specific use-case, although it had its limitations.
Pros:
Key Limitations:
If you're interested to see how each of these pieces fits together, you can find the full project on GitHub here. 👈