Let’s say you've come up with a great side-project idea - analyzing information about the IT job market in Poland!
While browsing the JustJoin.it job board, you noticed that all job offers are served as JSON through an HTTP API. Since it's publicly available, you decided to create a small application to fetch & store this data.
You decide to go with AWS Cloud for hosting your app.
Your workload includes a short Lambda function written in Python (which fetches the data from the job offers API endpoint and persists JSON data into the S3 bucket) which is executed on a daily schedule (through AWS EventBridge trigger). Each successful run of the function creates a new object in S3, following s3://some-s3-bucket-name/justjoinit-data/<year>/<month>/<day>.json
naming convention.
You quickly test it and everything seems to be fine. Then you deploy resources to your AWS account and forget about the whole thing for a long time.
Recently you decided to revive this project and try to extract something meaningful from this data. You quickly realize there are gaps in the data (some days are missing). Turns out that you were so confident about your code that you did not include any retry in case of HTTP request failure. Shame on you!
Clone the challenges blueprints repository and navigate to 0005_justjoinit_data_finding_the_gaps
directory.
It contains a directory called justjoinit_data
which is supposed to mimic the structure of the original S3 bucket with raw data - each year of data is a separate directory containing subdirectories with months (and each month directory contains multiple JSON files representing a single day of data).
Here's an output of tree
command on this directory:
justjoinit_data
├── 2021
│ ├── 10
│ │ ├── 23.json
│ │ ├── 24.json
│ │ ├── 25.json
│ │ ├── 26.json
│ │ ├── 27.json
│ │ ├── 28.json
│ │ ├── 29.json
│ │ ├── 30.json
│ │ └── 31.json
│ ├── 11
│ │ ├── 01.json
│ │ ├── 02.json
│ │ ├── 03.json
...
Your task is to find out which dates (JSON files) are missing from justjoinit_directory
directory (which would be days when our small AWS job failed due to some reason).
Put your logic into find_missing_dates
function (inside missing_dates.py
file). Missing dates should be returned as a string of dates joined by a comma and a space character. If 2021-01-01
, 2021-03-05
and 2022-05-10
were the missing dates, the result string would look like the following:
"2021-01-01, 2021-03-05, 2022-05-10"
You can assume that directories will always be named after a valid month (1 <= month <= 12
) or day (1 <= day <= 31
) and days within specific months are correct (for example there are no dates like February 31st).
You can use a test from test_missing_dates.py
file to check if your solution is correct. Run the below command (while in 0005_justjoinit_data_finding_the_gaps
the directory) to run the test suite:
python -m unittest
or
python test_missing_dates.py
P.S. I plan to share this JustJoin.it job offers dataset publicly (probably on Kaggle). Once this is done, I'll update this page and provide the link to the dataset.
Note: you'll find the detailed explanation of the solution below the code snippet.
import pathlib
from datetime import date, timedelta
def find_missing_dates(input_directory: pathlib.Path):
dates_from_disk = set()
for file in input_directory.glob("**/*.json"):
*_, year, month, day = file.parts
dates_from_disk.add(
date(
year=int(year),
month=int(month),
day=int(day.replace(".json", "")),
)
)
start_date = min(dates_from_disk)
end_date = max(dates_from_disk)
difference_in_days = (end_date - start_date).days
expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}
missing_dates = expected_dates - dates_from_disk
return ", ".join(x.strftime("%Y-%m-%d") for x in sorted(missing_dates))
Our solution for this challenge will leverage sets and operations they provide (sets difference). Steps we'll take:
justjoinit_data
directory (set dates_from_disk
)dates_from_disk
setexpected_dates
containing all dates from range between earliest and latest dates calculated in previous stepexpected_dates
and dates_from_disk
(dates existing in expected_dates
but missing from dates_from_disk
)
We start with defining an empty set dates_from_disk
.
Glob pattern **/*.json
allows us to iterate over all files with .json
extension (**
means traversing justjoinit_data
directory and all its subdirectories
To extract year, month, and day info, we leverage path’s .parts
attribute - a tuple containing the individual components of the path:
>>> pathlib.Path("/some/path/justjoinit_data/2022/10/01.json").parts
('/', 'some', 'path', 'justjoinit_data', '2022', '10', '01.json')
Tuple unpacking lets us conveniently capture year, month, and day variables within a single line. Every part that comes before the year part is captured within _
variable (it's a Pythonic way of saying that you don't care about something). We also combine it with asterisk *
, which means that _
variable can hold multiple elements.
*_, year, month, day = file.parts
>>> year
'2022'
>>> month
'10'
>>> day
'01.json'
>>> _
['/', 'some', 'path', 'justjoinit_data']
After a small cleanup (removing .json
suffix from day
and converting day
, month
, year
to integers), we're able to construct a valid date
object and add it to dates_from_disk
set:
dates_from_disk.add(
date(
year=int(year),
month=int(month),
day=int(day.replace(".json", "")),
)
)
After for loop is done, dates_from_disk
contains all the dates existing in the justjoinit_data
directory.
We use built-in min
and max
functions to calculate the earliest and latest date. We use these dates to calculate a helper variable called difference_in_days
, which is later used for generating a range of expected dates between start_date
and end_date
:
expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}
To find the missing dates within justjoinit_data
directory, we simply calculate the difference between expected_dates
and dates_from_disk
:
missing_dates = expected_dates - dates_from_disk
The last thing we do is sort the dates (sorted(missing_dates)
, transforming them to strings with .strftime("%Y-%m-%d")
string method and joining with ", "
string (so it matches the expected format from the task's description).
I hope you enjoyed this little exercise. I encourage you to check Quest Of Python for more :-). Let me know your approach to this exercise!
Also published here.