paint-brush
Python Challenge Anyone? by@jszafran

Python Challenge Anyone?

by Jakub Szafran6mDecember 11th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The Python exercise focuses on identifying missing dates in the JustJoin.it job market dataset fetched via AWS Lambda. The solution utilizes sets, date manipulation, and file traversal to pinpoint the gaps, ensuring a comprehensive and reliable dataset.

Company Mentioned

Mention Thumbnail
featured image - Python Challenge Anyone?
Jakub Szafran HackerNoon profile picture


Python Exercise

Let’s say you've come up with a great side-project idea - analyzing information about the IT job market in Poland!


While browsing the JustJoin.it job board, you noticed that all job offers are served as JSON through an HTTP API. Since it's publicly available, you decided to create a small application to fetch & store this data.


You decide to go with AWS Cloud for hosting your app.


Your workload includes a short Lambda function written in Python (which fetches the data from the job offers API endpoint and persists JSON data into the S3 bucket) which is executed on a daily schedule (through AWS EventBridge trigger). Each successful run of the function creates a new object in S3, following s3://some-s3-bucket-name/justjoinit-data/<year>/<month>/<day>.json naming convention.


You quickly test it and everything seems to be fine. Then you deploy resources to your AWS account and forget about the whole thing for a long time.


Recently you decided to revive this project and try to extract something meaningful from this data. You quickly realize there are gaps in the data (some days are missing). Turns out that you were so confident about your code that you did not include any retry in case of HTTP request failure. Shame on you!


Your task

Clone the challenges blueprints repository and navigate to 0005_justjoinit_data_finding_the_gaps directory.


It contains a directory called justjoinit_data which is supposed to mimic the structure of the original S3 bucket with raw data - each year of data is a separate directory containing subdirectories with months (and each month directory contains multiple JSON files representing a single day of data).


Here's an output of tree command on this directory:

justjoinit_data
├── 2021
│   ├── 10
│   │   ├── 23.json
│   │   ├── 24.json
│   │   ├── 25.json
│   │   ├── 26.json
│   │   ├── 27.json
│   │   ├── 28.json
│   │   ├── 29.json
│   │   ├── 30.json
│   │   └── 31.json
│   ├── 11
│   │   ├── 01.json
│   │   ├── 02.json
│   │   ├── 03.json

...


Your task is to find out which dates (JSON files) are missing from justjoinit_directory directory (which would be days when our small AWS job failed due to some reason).


Put your logic into find_missing_dates function (inside missing_dates.py file). Missing dates should be returned as a string of dates joined by a comma and a space character. If 2021-01-01, 2021-03-05 and 2022-05-10 were the missing dates, the result string would look like the following:

"2021-01-01, 2021-03-05, 2022-05-10"


You can assume that directories will always be named after a valid month (1 <= month <= 12) or day (1 <= day <= 31) and days within specific months are correct (for example there are no dates like February 31st).


You can use a test from test_missing_dates.py file to check if your solution is correct. Run the below command (while in 0005_justjoinit_data_finding_the_gaps the directory) to run the test suite:

python -m unittest


or

python test_missing_dates.py


P.S. I plan to share this JustJoin.it job offers dataset publicly (probably on Kaggle). Once this is done, I'll update this page and provide the link to the dataset.


Sample solution

Note: you'll find the detailed explanation of the solution below the code snippet.

import pathlib
from datetime import date, timedelta


def find_missing_dates(input_directory: pathlib.Path):
    dates_from_disk = set()
    for file in input_directory.glob("**/*.json"):
        *_, year, month, day = file.parts

        dates_from_disk.add(
            date(
                year=int(year),
                month=int(month),
                day=int(day.replace(".json", "")),
            )
        )

    start_date = min(dates_from_disk)
    end_date = max(dates_from_disk)
    difference_in_days = (end_date - start_date).days
    expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}

    missing_dates = expected_dates - dates_from_disk

    return ", ".join(x.strftime("%Y-%m-%d") for x in sorted(missing_dates))



Our solution for this challenge will leverage sets and operations they provide (sets difference). Steps we'll take:


  • create a set of all dates existing within justjoinit_data directory (set dates_from_disk)
  • calculate earliest and latest date from dates_from_disk set
  • create a set of expected dates expected_dates containing all dates from range between earliest and latest dates calculated in previous step
  • calculate a difference between expected_dates and dates_from_disk (dates existing in expected_dates but missing from dates_from_disk)
  • sort dates chronologically and transform them to conform to expected string format


We start with defining an empty set dates_from_disk.

Glob pattern **/*.json allows us to iterate over all files with .json extension (** means traversing justjoinit_data directory and all its subdirectories recursively).


To extract year, month, and day info, we leverage path’s .parts attribute - a tuple containing the individual components of the path:

>>> pathlib.Path("/some/path/justjoinit_data/2022/10/01.json").parts
('/', 'some', 'path', 'justjoinit_data', '2022', '10', '01.json')


Tuple unpacking lets us conveniently capture year, month, and day variables within a single line. Every part that comes before the year part is captured within _ variable (it's a Pythonic way of saying that you don't care about something). We also combine it with asterisk *, which means that _ variable can hold multiple elements.


*_, year, month, day = file.parts
>>> year
'2022'
>>> month
'10'
>>> day
'01.json'
>>> _
['/', 'some', 'path', 'justjoinit_data']


After a small cleanup (removing .json suffix from day and converting day, month, year to integers), we're able to construct a valid date object and add it to dates_from_disk set:

        dates_from_disk.add(
            date(
                year=int(year),
                month=int(month),
                day=int(day.replace(".json", "")),
            )
        )



After for loop is done, dates_from_disk contains all the dates existing in the justjoinit_data directory.


We use built-in min and max functions to calculate the earliest and latest date. We use these dates to calculate a helper variable called difference_in_days, which is later used for generating a range of expected dates between start_date and end_date:


expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}


To find the missing dates within justjoinit_data directory, we simply calculate the difference between expected_dates and dates_from_disk:

missing_dates = expected_dates - dates_from_disk


The last thing we do is sort the dates (sorted(missing_dates), transforming them to strings with .strftime("%Y-%m-%d") string method and joining with ", " string (so it matches the expected format from the task's description).

Summary

I hope you enjoyed this little exercise. I encourage you to check Quest Of Python for more :-). Let me know your approach to this exercise!


Also published here.