Python Exercise Let’s say you've come up with a great side-project idea - analyzing information about the IT job market in Poland! While browsing the job board, you noticed that all job offers are served as through an HTTP API. Since it's publicly available, you decided to create a small application to fetch & store this data. JustJoin.it JSON You decide to go with for hosting your app. AWS Cloud Your workload includes a short Lambda function written in (which fetches the data from the job offers API endpoint and persists JSON data into the S3 bucket) which is executed on a daily schedule (through AWS EventBridge trigger). Each successful run of the function creates a new object in S3, following naming convention. Python s3://some-s3-bucket-name/justjoinit-data/<year>/<month>/<day>.json You quickly test it and everything seems to be fine. Then you deploy resources to your AWS account and forget about the whole thing for a long time. Recently you decided to revive this project and try to extract something meaningful from this data. You quickly realize there are gaps in the data (some days are missing). Turns out that you were so confident about your code that you did not include any retry in case of HTTP request failure. Shame on you! Your task Clone the and navigate to directory. challenges blueprints repository 0005_justjoinit_data_finding_the_gaps It contains a directory called which is supposed to mimic the structure of the original S3 bucket with raw data - each year of data is a separate directory containing subdirectories with months (and each month directory contains multiple JSON files representing a single day of data). justjoinit_data Here's an output of command on this directory: tree justjoinit_data ├── 2021 │ ├── 10 │ │ ├── 23.json │ │ ├── 24.json │ │ ├── 25.json │ │ ├── 26.json │ │ ├── 27.json │ │ ├── 28.json │ │ ├── 29.json │ │ ├── 30.json │ │ └── 31.json │ ├── 11 │ │ ├── 01.json │ │ ├── 02.json │ │ ├── 03.json ... Your task is to find out which dates (JSON files) are missing from directory (which would be days when our small AWS job failed due to some reason). justjoinit_directory Put your logic into function (inside file). Missing dates should be returned as a string of dates joined by a comma and a space character. If , and were the missing dates, the result string would look like the following: find_missing_dates missing_dates.py 2021-01-01 2021-03-05 2022-05-10 "2021-01-01, 2021-03-05, 2022-05-10" You can assume that directories will always be named after a valid month ( ) or day ( ) and days within specific months are correct (for example there are no dates like February 31st). 1 <= month <= 12 1 <= day <= 31 You can use a test from file to check if your solution is correct. Run the below command (while in the directory) to run the test suite: test_missing_dates.py 0005_justjoinit_data_finding_the_gaps python -m unittest or python test_missing_dates.py P.S. I plan to share this JustJoin.it job offers dataset publicly (probably on Kaggle). Once this is done, I'll update this page and provide the link to the dataset. Sample solution : you'll find the detailed explanation of the solution below the code snippet. Note import pathlib from datetime import date, timedelta def find_missing_dates(input_directory: pathlib.Path): dates_from_disk = set() for file in input_directory.glob("**/*.json"): *_, year, month, day = file.parts dates_from_disk.add( date( year=int(year), month=int(month), day=int(day.replace(".json", "")), ) ) start_date = min(dates_from_disk) end_date = max(dates_from_disk) difference_in_days = (end_date - start_date).days expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)} missing_dates = expected_dates - dates_from_disk return ", ".join(x.strftime("%Y-%m-%d") for x in sorted(missing_dates)) Our solution for this challenge will leverage sets and operations they provide (sets difference). Steps we'll take: create a set of all dates existing within directory (set ) justjoinit_data dates_from_disk calculate earliest and latest date from set dates_from_disk create a set of expected dates containing all dates from range between earliest and latest dates calculated in previous step expected_dates calculate a difference between and (dates existing in but missing from ) expected_dates dates_from_disk expected_dates dates_from_disk sort dates chronologically and transform them to conform to expected string format We start with defining an empty set . dates_from_disk Glob pattern allows us to iterate over all files with extension ( means traversing directory and all its subdirectories ). **/*.json .json ** justjoinit_data recursively To extract year, month, and day info, we leverage path’s attribute - a tuple containing the individual components of the path: .parts >>> pathlib.Path("/some/path/justjoinit_data/2022/10/01.json").parts ('/', 'some', 'path', 'justjoinit_data', '2022', '10', '01.json') Tuple unpacking lets us conveniently capture year, month, and day variables within a single line. Every part that comes before the year part is captured within variable (it's a Pythonic way of saying that you don't care about something). We also combine it with asterisk , which means that variable can hold multiple elements. _ * _ *_, year, month, day = file.parts >>> year '2022' >>> month '10' >>> day '01.json' >>> _ ['/', 'some', 'path', 'justjoinit_data'] After a small cleanup (removing suffix from and converting , , to integers), we're able to construct a valid object and add it to set: .json day day month year date dates_from_disk dates_from_disk.add( date( year=int(year), month=int(month), day=int(day.replace(".json", "")), ) ) After for loop is done, contains all the dates existing in the directory. dates_from_disk justjoinit_data We use built-in and functions to calculate the earliest and latest date. We use these dates to calculate a helper variable called , which is later used for generating a range of expected dates between and : min max difference_in_days start_date end_date expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)} To find the missing dates within directory, we simply calculate the difference between and : justjoinit_data expected_dates dates_from_disk missing_dates = expected_dates - dates_from_disk The last thing we do is sort the dates ( , transforming them to strings with string method and joining with string (so it matches the expected format from the task's description). sorted(missing_dates) .strftime("%Y-%m-%d") ", " Summary I hope you enjoyed this little exercise. I encourage you to check for more :-). Let me know your approach to this exercise! Quest Of Python Also published here.