Approaching Cricket Analytics With Python and Indian Premier League Databy@hvemugan
193 reads

Approaching Cricket Analytics With Python and Indian Premier League Data

by Hari ChandanApril 18th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This project originated as an Independent Research Topic Investigation for the Data Science for Public Affairs class at Indiana University.
featured image - Approaching Cricket Analytics With Python and Indian Premier League Data
Hari Chandan HackerNoon profile picture

In the realm of sports, few events capture the imagination and fervor of fans quite like cricket, and within cricket, the Indian Premier League (IPL) stands as a colossus. Launched in 2008, the IPL has burgeoned into one of the world's most illustrious cricket franchises, bringing together international stars and emerging talent in a spectacle of sport that spans over a month annually. Yet, despite its global appeal and the richness of stories within each game, access to structured, analytical data on IPL matches remains a closely guarded treasure, primarily monopolized by major players in the market. For enthusiasts, researchers, and analysts outside these organizations, this presents a significant barrier to exploration and understanding of the game through data.

This project originated as an Independent Research Topic Investigation for the Data Science for Public Affairs class at Indiana University. The primary objective was ambitious yet straightforward: to analyze and compare the inaugural 2008 IPL season with the latest season in 2023, uncovering the evolution of strategies, player performances, and the tournament's dynamics over fifteen years. However, we quickly encountered a formidable challenge—the dearth of readily available structured data on the IPL. The only accessible source was Cricsheet, offering detailed ball-by-ball match data in JSON format. While rich in content, covering every delivery, run, wicket, and player involved in each match, the format was far from user-friendly, especially for those not well-versed in data science or programming.

As we delved deeper, attempting to mold this unwieldy data into a form suitable for our project's goals, a broader mission crystallized. Enhancing the data availability for the IPL wasn't just a necessity for our class project; it was a contribution to the global cricket community. Considering IPL's stature as one of the world's premier sports franchises, celebrated for its spectacle, competition, and cultural significance in cricketing nations and beyond, it became clear that making its data more accessible could democratize insights into the game, foster a deeper appreciation among fans, and potentially unveil narratives hitherto hidden within the complexities of raw data.

This introduction to our journey is more than a tale of academic pursuit; it's a narrative about unlocking the treasure trove of IPL data for the world, making the nuances of cricket more approachable, and perhaps, in the process, unraveling the fabric of one of the most vibrant chapters in modern cricket.

The Motivation

Embarking on this project, we confronted the raw, unrefined essence of cricket data: a JSON labyrinth detailing each ball bowled in the IPL matches. For the uninitiated, JSON (JavaScript Object Notation) is a lightweight data interchange format, designed for human readability but primarily used for machine exchange. Each file encapsulated the granular intricacies of a cricket match, from the trajectory of every delivery to the outcome of each ball—runs scored, extras conceded, wickets fallen, and the players involved. While this richness offered an almost cinematic replay in data format, the complexity of distilling actionable insights from such detailed chronicles posed a formidable challenge.

Our motivation was twofold, driven by a deep-seated passion for cricket and a professional commitment to data science. Cricket, in many ways, is the pulsating heart of sports in India, transcending mere athletic competition to embody a cultural phenomenon, akin to the fervor surrounding the NFL in the United States. In India, cricket is not just a game; it's a religion, a season of unity celebrated across the country, cutting across the diverse tapestry of Indian society. This universal appeal, combined with our academic pursuits as data science graduate students, presented a unique opportunity. We were not just analyzing data; we were connecting with a piece of our heritage, attempting to contribute to a community that had given us so much joy and pride.

This project was also an expression of our belief in "data for all" — a principle that argues for democratizing data access and understanding. The exclusivity surrounding cricket data, especially detailed analyses like ball-by-ball match information, seemed antithetical to the very spirit of the game that unites millions. By breaking down these barriers, we aimed to serve the community, providing cricket enthusiasts, researchers, and casual fans alike the tools to explore and understand the game in new depths. The goal was to empower anyone with an interest, regardless of their technical proficiency with data analysis tools like Pandas, to dive into the IPL's rich history and emerge with newfound insights.

In essence, this journey was more than an academic endeavor; it was an act of community service, a tribute to our shared passion for cricket, and a challenge to the status quo of data accessibility. We embarked on this path with a clear vision: to unlock the stories hidden within IPL matches for everyone, making the complex world of cricket data not just accessible but inviting to all who wished to explore it. This project was our ode to the sport, a bridge connecting the realms of data science and cricket, and an open invitation for others to join us in this exploration.

Breaking down the data

The first challenge in our adventure was sourcing the data. Our treasure trove,, presented itself as a beacon in the vast sea of the internet, offering comprehensive ball-by-ball data for cricket matches, including every IPL game since its inception. Here, data wasn't just numbers but a narrative of battles fought on the pitch, distilled into JSON files—each a detailed account of a single IPL match. Yet, this bounty, while rich, was encased in the complexity of its format. Each file was a meticulous record: every delivery, run, wicket, and the subtleties of the game were logged with precision but in a format that, while perfect for machines, was a labyrinth for the uninitiated.

The data dictionary within these files was extensive: balls_per_over, city, dates, match_type, outcome, teams, venue, and so much more. Delve deeper, and you'd find overs, deliveries, batters, bowlers, runs, and extras, each with its own nested details. This level of granularity is both a boon and a bane. For analysis, it's a goldmine. For accessibility, a dense jungle where the trees are made of code.

Our toolset for this expedition was chosen with care. Python formed the backbone, with its libraries—os for navigating directories, json for parsing the data files, and pandas for transforming this data into a structured, analyzable format. We flirted with the idea of using Apache Spark for its prowess in handling big data but realized our arsenal was overkill for the task at hand. The analogy that came to mind was using a sword to cut garlic cloves—a mismatch of tool and task that could complicate rather than simplify.

Our approach was to make this data comprehensible and accessible at multiple levels, akin to zooming in and out on a map, each level providing different insights into the game:

  • Level 1: Match Summary Data offers a bird's-eye view. Here, we distill each match into a single row of essential metadata: date, teams, venue, toss decisions, and match outcomes. This dataset is the doorway for anyone looking to gauge trends, analyze the impact of toss decisions, or understand performance variations across different venues.

  • Level 2: Player Performances Per Match zooms in a bit closer. We aggregate data to the player level for each match, detailing runs scored, wickets taken, and more. This layer is perfect for analyzing player contributions, identifying the stars of each match, and comparing performances across the season.

  • Level 3: Detailed Ball-by-Ball Data is the most granular view, where every delivery is dissected. This dataset is a haven for the detail-oriented, offering insights into the mechanics of the game, player strategies, and the minutiae that can turn the tide of a match.

Implementing these levels required a strategy that balanced depth with accessibility. Our aim was not just to unlock the data but to lay it out in a manner that invites exploration, regardless of one's familiarity with data science tools.

Constructing the IPL Match Summary Dataset

The objective of this project was to transform raw, JSON-formatted ball-by-ball IPL match data into a structured and analysis-ready dataset. This transformation process involved several steps, utilizing Python's robust data processing capabilities.

Data Source and Format

Our primary data source was, which provides comprehensive ball-by-ball details of IPL matches in JSON format. JSON, or JavaScript Object Notation, is a flexible, text-based format that's easy for humans to read and write, and easy for machines to parse and generate. Despite its structured nature, JSON data requires parsing and transformation to be effectively used in data analysis tasks.

The Python Script for Data Transformation

The script employs Python, renowned for its simplicity and the powerful data manipulation and analysis libraries it supports. Here’s a step-by-step breakdown of the script's operations:

  • Opening and Reading JSON Files: Using Python's built-in json library, the script reads each IPL match's JSON file. This step involves iterating over all match files within a specified directory, leveraging the os library for filesystem navigation.
  • Data Extraction and Aggregation: The core of the script extracts essential match information, such as date, venue, teams, toss decisions, match outcome, and player of the match. Furthermore, it computes aggregated metrics like total runs and wickets for each team by parsing through the delivery details within each inning.
  • DataFrame Creation with Pandas: The extracted and aggregated data for each match is then structured into a pandas DataFrame. Pandas, a powerful tool for data manipulation and analysis in Python, facilitates handling and organizing the data into a tabular format, ideal for analysis.
  • Compiling Season Summary: The script aggregates the summaries of all matches in the season into a single DataFrame. This aggregated data provides a comprehensive overview of the season, capturing the essence of each game in structured form.

Libraries and Their Usage

  • json: For decoding JSON files into Python data structures.
  • pandas: Used for creating, manipulating, and structuring the data into DataFrames for easy analysis.
  • os: For navigating directories and file paths, enabling dynamic access to the JSON files for each match.

Decision Against Apache Spark

While Apache Spark is a powerful tool for big data processing, it was deemed unnecessary for the scale of this dataset. The decision was driven by a preference for simplicity and the relatively moderate size of the data, which did not warrant Spark's distributed computing capabilities. The analogy used was "using a sword to cut garlic cloves," emphasizing the overkill Spark would represent in this context.

Output and Accessibility

The final output of the script is a CSV file named match_summary.csv, which stores the structured match summary data. Saving the dataset in CSV format ensures that it is easily accessible and compatible with a wide range of data analysis tools and environments, furthering the goal of democratizing access to IPL data.

import json
import pandas as pd
import os

def process_match_data(file_path):
    """Process a single match JSON file to extract detailed match summary data with clear first and second innings info."""
    with open(file_path, 'r') as file:
        data = json.load(file)
    match_summary = {
        'date': data['info']['dates'][0],
        'venue': data['info']['venue'],
        'team1': data['info']['teams'][0],  
        'team2': data['info']['teams'][1],  
        'toss_winner': data['info']['toss']['winner'],
        'toss_decision': data['info']['toss']['decision'],
        'match_winner': data['info'].get('outcome', {}).get('winner', 'No result'),
        'player_of_match': ', '.join(data['info'].get('player_of_match', [])),
        # Initialize placeholders for innings-specific data
        '1st_innings_team': '',
        '1st_innings_runs': 0,
        '1st_innings_wickets': 0,
        '2nd_innings_team': '',
        '2nd_innings_runs': 0,
        '2nd_innings_wickets': 0

    for inning_number, inning in enumerate(data['innings'], start=1):
        team = inning['team']
        runs = 0
        wickets = 0
        for over in inning['overs']:
            for delivery in over['deliveries']:
                runs += delivery['runs']['total']
                if 'wickets' in delivery:
                    wickets += 1
        # Assign innings data based on the iteration
        if inning_number == 1:
            match_summary['1st_innings_team'] = team
            match_summary['1st_innings_runs'] = runs
            match_summary['1st_innings_wickets'] = wickets
        elif inning_number == 2:
            match_summary['2nd_innings_team'] = team
            match_summary['2nd_innings_runs'] = runs
            match_summary['2nd_innings_wickets'] = wickets

    return match_summary

def consolidate_season_data(folder_path):
    all_match_summaries = []

    for filename in os.listdir(folder_path):
        if filename.endswith('.json'):
            file_path = os.path.join(folder_path, filename)
            match_summary = process_match_data(file_path)
    return pd.DataFrame(all_match_summaries)

# Path to the folder containing your JSON files for a season
folder_path = r'D:\\IPL Data\\2008\\'
season_df = consolidate_season_data(folder_path)

# Save the DataFrame to a CSV file
csv_file_path = 'D:\\IPL Data\\2008\\layers\\match_summary_revised.csv'
season_df.to_csv(csv_file_path, index=False)

print(f"Revised match summary data saved to {csv_file_path}")

Enhancing the Dataset with Individual Batting Performances

Data Organization and Preparation

To begin, we needed a structured approach to align our JSON match data files with the existing match summary records. The solution involved:

  • JSON File Sorting: A preparatory step where we list and sort all JSON filenames (stripping the '.json' extension for use as match_id) from the specified directory. This sorting ensures a sequential alignment with our match summary dataset, facilitating a one-to-one correspondence between detailed match data and summary records.

Feature Expansion Workflow

The workflow for integrating individual batting performances into the match summary dataset involves several key steps, realized through specific Python functions:

  1. get_batting_scores() Function:
    • Purpose: Extracts batting scores from innings data within a JSON file.
    • Implementation: Iterates through each delivery in the innings, aggregating runs scored by each batter. Outputs a dictionary mapping batters to their total runs in the innings.
  2. process_season_jsons() Function:
    • Purpose: Conducts a season-wide analysis to identify the highest batting scores for each team across all matches.
    • Implementation: Iterates through each JSON file in the directory, applying get_batting_scores() to extract and aggregate player scores. For each match, it identifies the highest scorer for each team and compiles these into a dictionary keyed by match_id for easy reference.

Integrating Batting Performances into the Match Summary

To merge this new data with our match summary information, we employed a structured approach:

  • New match_id Column: Added to the match summary DataFrame, derived from sorted JSON filenames. This column serves as a key to link detailed batting data with match summaries.
  • Highest Scorer Columns Initialization: Initialized new columns within the match summary DataFrame (highest_scorer_1st_innings, highest_score_1st_innings, highest_scorer_2nd_innings, highest_score_2nd_innings) to store the names and scores of the highest scorers from each innings.
  • DataFrame Update Process: Iterates through each match, using the match_id to find the corresponding highest scores and update the DataFrame with this new information.

json_folder_path = "D:\\IPL Data\\2008\\"

# Define the path to the CSV file
csv_file_path = "D:\\IPL Data\\2008\\layers\\match_summary_revised.csv"

# Step 1: Get and Sort JSON Filenames (without the '.json' extension for match_id)
json_files = [f[:-5] for f in os.listdir(json_folder_path) if f.endswith('.json')]  # Remove '.json' from filenames
json_files.sort()  # Sort filenames in ascending order

# Step 2: Read the CSV and add a new 'match_id' column with sorted JSON filenames (without extensions)
matches_df = pd.read_csv(csv_file_path)
matches_df['match_id'] = json_files  # Assuming each row corresponds to the sorted list of matches

# Step 3 & 4: Iterate through each JSON, calculate highest scores, and update CSV
# Initialize columns for highest scorers and their scores in both innings
matches_df['highest_scorer_1st_innings'] = None
matches_df['highest_score_1st_innings'] = None
matches_df['highest_scorer_2nd_innings'] = None
matches_df['highest_score_2nd_innings'] = None

for match_id in json_files:
    # Construct the full path to the JSON file
    file_path = os.path.join(json_folder_path, match_id + '.json')
    # Read JSON data
    with open(file_path, 'r') as f:
        data = json.load(f)
    # Initialize dictionaries to hold the scores
    scores_1st_innings = {}
    scores_2nd_innings = {}
    # Process innings data
    for innings in data['innings']:
        team_name = innings['team']
        innings_number = data['innings'].index(innings) + 1
        # Extract scores for each player
        for over in innings['overs']:
            for delivery in over['deliveries']:
                batter = delivery['batter']
                runs = delivery['runs']['batter']
                if innings_number == 1:
                    scores_1st_innings[batter] = scores_1st_innings.get(batter, 0) + runs
                    scores_2nd_innings[batter] = scores_2nd_innings.get(batter, 0) + runs
    # Find the highest scorer and their score in each innings
    if scores_1st_innings:
        highest_scorer_1st, highest_score_1st = max(scores_1st_innings.items(), key=lambda item: item[1])
        highest_scorer_1st, highest_score_1st = ("No data", 0)

    if scores_2nd_innings:
        highest_scorer_2nd, highest_score_2nd = max(scores_2nd_innings.items(), key=lambda item: item[1])
        highest_scorer_2nd, highest_score_2nd = ("No data", 0)
    # Update the DataFrame
    match_index = matches_df[matches_df['match_id'] == match_id].index
    if not match_index.empty:
        index = match_index[0][index, 'highest_scorer_1st_innings'] = highest_scorer_1st[index, 'highest_score_1st_innings'] = highest_score_1st[index, 'highest_scorer_2nd_innings'] = highest_scorer_2nd[index, 'highest_score_2nd_innings'] = highest_score_2nd

Feature Expansion: Expanding Dataset with Bowling Insights

Incorporating Bowling Performances

The addition of bowling performances required a few critical steps, implemented through precise Python scripting:

  1. New DataFrame Columns: To accommodate the data on bowling performances, we added new columns to the match summary DataFrame (highest_wicket_taker_1st_innings, highest_wickets_1st_innings, highest_wicket_taker_2nd_innings, highest_wickets_2nd_innings). These columns are designated to store the names of the highest wicket-takers and their wicket counts for both innings.
  2. process_innings() Function:
    • Purpose: Extracts and aggregates runs scored and wickets taken from innings data.
    • Implementation: Iterates through each delivery within an innings, tallying runs for each batter and wickets for each bowler. Outputs two dictionaries: one mapping batters to their scores and another mapping bowlers to their wickets.
  3. Processing Each Match:
    • Workflow: For every match, represented by its match_id, we process the corresponding JSON file to extract innings data. Utilizing process_innings(), we determine both the highest scores and highest wicket counts.

    • Data Integration: The highest wicket-taker and their count are then integrated into the match summary DataFrame, updating the newly added columns with this crucial information.

# New columns for highest wicket-taker and their wickets
columns = ['highest_wicket_taker_1st_innings', 'highest_wickets_1st_innings',
           'highest_wicket_taker_2nd_innings', 'highest_wickets_2nd_innings']
matches_df = matches_df.reindex(columns=matches_df.columns.tolist() + columns, fill_value=None)

# Function to process innings for scores and wickets
def process_innings(innings):
    scores, wickets = {}, {}
    for over in innings.get('overs', []):
        for delivery in over.get('deliveries', []):
            batter = delivery.get('batter')
            bowler = delivery.get('bowler')
            runs = delivery.get('runs', {}).get('batter', 0)
            scores[batter] = scores.get(batter, 0) + runs
            if 'wickets' in delivery:
                wickets[bowler] = wickets.get(bowler, 0) + 1
    return scores, wickets

# Process each JSON file
for match_id in json_files:
    file_path = os.path.join(json_folder_path, match_id + '.json')
    with open(file_path, 'r') as f:
        data = json.load(f)
    innings_data = {}
    for innings in data['innings']:
        team_name = innings['team']
        innings_number = data['innings'].index(innings) + 1
        scores, wickets = process_innings(innings)
        highest_score = max(scores.items(), key=lambda x: x[1]) if scores else ("No data", 0)
        highest_wickets = max(wickets.items(), key=lambda x: x[1]) if wickets else ("No data", 0)
        innings_data[innings_number] = (highest_score, highest_wickets)
    # Update DataFrame
    index = matches_df[matches_df['match_id'] == match_id].index[0]
    for i in [1, 2]:  # 1st and 2nd innings
        if i in innings_data:
  [index, f'highest_wicket_taker_{i}st_innings'] = innings_data[i][1][0]
  [index, f'highest_wickets_{i}st_innings'] = innings_data[i][1][1]

Feature Expansion: Incorporating Extras Details

Adding Extras Details

To accurately reflect the contributions of extras to the match dynamics, we introduced several steps in our data processing pipeline:

  1. New Columns for Extras: The match summary DataFrame is expanded to include new columns dedicated to recording the total extras and a detailed breakdown of extras (such as wides, no balls, leg byes, byes, and penalties) for both the first and second innings.
  2. process_extras() Function:
    • Purpose: Extracts and aggregates details about extras conceded in an innings.
    • Implementation: Iterates through deliveries, tallying up each type of extra and the total extras. Outputs the total count of extras along with a dictionary detailing the counts of each extra type.
  3. Data Extraction and Integration:
    • Workflow: For each match, represented by its match_id, the corresponding JSON file is processed to extract data about extras. The process_extras() function facilitates the extraction and aggregation of extras data.

    • DataFrame Update: The extracted information about extras, both the total count and the detailed breakdown, is integrated into the match summary DataFrame, enriching it with this critical aspect of the game.

# Columns for highest wicket-taker and extras
extras_columns = ['total_extras_1st_innings', 'extras_detail_1st_innings',
                  'total_extras_2nd_innings', 'extras_detail_2nd_innings']
matches_df = matches_df.reindex(columns=matches_df.columns.tolist() + extras_columns, fill_value=None)

# Function to process innings for extras
def process_extras(innings):
    extras = {'wides': 0, 'noballs': 0, 'legbyes': 0, 'byes': 0, 'penalty': 0}
    total_extras = 0
    for over in innings.get('overs', []):
        for delivery in over.get('deliveries', []):
            if 'extras' in delivery:
                for extra_type, runs in delivery['extras'].items():
                    extras[extra_type] = extras.get(extra_type, 0) + runs
                    total_extras += runs
    return total_extras, extras

# Process each JSON file for extras
for match_id in json_files:
    file_path = os.path.join(json_folder_path, match_id + '.json')
    with open(file_path, 'r') as f:
        data = json.load(f)
    for innings in data['innings']:
        team_name = innings['team']
        innings_number = data['innings'].index(innings) + 1
        total_extras, extras_detail = process_extras(innings)
        # Update DataFrame with total extras and detail
        index = matches_df[matches_df['match_id'] == match_id].index[0][index, f'total_extras_{innings_number}st_innings'] = total_extras[index, f'extras_detail_{innings_number}st_innings'] = str(extras_detail)

Augmenting Dataset with Powerplay Insights

Dataset Preparation

After reading the CSV and ensuring each match is identifiable via match_id, we proceeded to add new dimensions to our analysis:

  • Introduction of Powerplay Columns: To encapsulate Powerplay performances, we appended new columns to our DataFrame (powerplay_score_1st_innings, powerplay_wickets_1st_innings, powerplay_score_2nd_innings, powerplay_wickets_2nd_innings). These are designed to hold the scores and wickets taken in the Powerplay overs for both innings.

Powerplay Data Extraction and Processing

The extraction of Powerplay data hinges on the process_powerplay() function, detailed as follows:

  • Purpose: Calculates the total score and wickets during Powerplay overs (first 6 overs) for an innings.
  • Implementation: Iterates through the first six overs of an innings, summing up runs scored and wickets fallen. This focused analysis captures the essence of the Powerplay, reflecting the aggressive batting and bowling strategies employed during this phase.

Updating the Match Summary Dataset

With Powerplay data in hand for each innings, we then updated the match summary DataFrame:

  • Powerplay Scores and Wickets Integration: For each match, the DataFrame is updated with Powerplay scores and wickets, accurately reflecting these early innings performances.

# Read the CSV and update it with match IDs
matches_df = pd.read_csv(csv_file_path)
matches_df['match_id'] = json_files

# Columns for Powerplay scores and wickets
powerplay_columns = ['powerplay_score_1st_innings', 'powerplay_wickets_1st_innings',
                     'powerplay_score_2nd_innings', 'powerplay_wickets_2nd_innings']
matches_df = matches_df.reindex(columns=matches_df.columns.tolist() + powerplay_columns, fill_value=None)

# Function to process Powerplay overs
def process_powerplay(innings):
    powerplay_score = 0
    powerplay_wickets = 0
    for over in innings.get('overs', []):
        if over['over'] < 6:  # Powerplay is the first 6 overs
            for delivery in over.get('deliveries', []):
                powerplay_score += delivery.get('runs', {}).get('total', 0)
                if 'wickets' in delivery:
                    powerplay_wickets += 1
    return powerplay_score, powerplay_wickets

# Process each JSON file for Powerplay scores and wickets
for match_id in json_files:
    file_path = os.path.join(json_folder_path, match_id + '.json')
    with open(file_path, 'r') as f:
        data = json.load(f)
    for innings in data['innings']:
        team_name = innings['team']
        innings_number = data['innings'].index(innings) + 1
        powerplay_score, powerplay_wickets = process_powerplay(innings)
        # Update DataFrame with Powerplay scores and wickets
        index = matches_df[matches_df['match_id'] == match_id].index[0][index, f'powerplay_score_{innings_number}st_innings'] = powerplay_score[index, f'powerplay_wickets_{innings_number}st_innings'] = powerplay_wickets

Building upon the comprehensive IPL match summary data we've curated, there's potential for further enrichment that could provide even deeper insights into the game's multifaceted nature. One significant area of expansion could involve detailed fielding performance metrics, such as catches taken, run-outs executed, and stumping instances, offering a nuanced view of the game's defensive strategies. Additionally, incorporating detailed bowler analysis, such as economy rates, average bowling speeds, and dot ball counts, could offer a richer perspective on the bowling strategies employed across different match stages. Another valuable addition could be the inclusion of partnership records for each wicket, shedding light on crucial batting collaborations that often shift match momentum. Incorporating weather conditions and pitch reports could also provide context for performance variations, offering a holistic view of each match's external influences. Expanding the dataset to include these elements would not only elevate the analytical possibilities but also foster a more detailed understanding of the game's dynamics, catering to an array of analytical pursuits within the cricket analytics community.

Feature Expansion: Detailed Match-Specific Scorecards

The create_detailed_scorecard_for_match function meticulously processes each delivery of a match to compile comprehensive scorecards that include critical performance metrics for each player. Key steps in the process include:

  • Match and Player Identification: Initially, the function identifies the match ID from the JSON file's name and establishes a directory for storing the match's detailed scorecard. It then maps each player to their respective teams, laying the groundwork for detailed analytics.
  • Data Aggregation and Processing: As the function iterates through each delivery, it aggregates data on runs scored, dismissals, and the bowler involved for each batter. This meticulous aggregation allows for a nuanced view of the match, highlighting key performances and strategic elements.
  • DataFrame Transformation and Aggregation: The collected data is transformed into a pandas DataFrame, which is then aggregated to summarize each batter's performance in the match, including total runs scored, their dismissal status, and the bowler responsible for their wicket.
  • Detailed Scorecard Generation: The aggregated DataFrame is saved as a CSV file within a match-specific folder, creating a persistent and accessible record of detailed player performances for that match.

Recognizing the Scope for Further Enhancements

While this feature significantly enriches the dataset by providing detailed insights into batting performances, we acknowledge the immense potential for further enhancements to broaden the dataset's scope and depth:

  • Incorporation of Fielding and Bowling Metrics: Adding detailed fielding statistics, such as catches, run-outs, and stumpings, along with comprehensive bowling metrics, including overs bowled, maiden overs, runs conceded, and wickets taken, would provide a more holistic view of each player's contribution to the match.

  • Partnership Analyses: Documenting batting partnerships and their impact on the match outcome could offer valuable insights into team strategies and player compatibilities.

  • Match Contextual Data: Integrating data on pitch conditions, weather, and match context (such as tournament stage) could enable more nuanced analyses of performance variations and strategic decisions.

  • Advanced Analytics and Predictive Modeling: Leveraging the detailed match data for advanced statistical analyses and predictive modeling could uncover deeper patterns and trends, potentially offering strategic insights for teams and analysts.

def create_detailed_scorecard_for_match(json_file_path, output_folder):
    # Extract the match ID or name from the file path to use as a folder name
    match_id = os.path.basename(json_file_path).split('.')[0]
    match_folder = os.path.join(output_folder, match_id)
    os.makedirs(match_folder, exist_ok=True)  # Create the match-specific folder
    with open(json_file_path, 'r') as file:
        data = json.load(file)
    # Initialize a dictionary to map players to teams
    player_teams = {}
    for team in data['info']['teams']:
        for player in data['info']['players'][team]:
            player_teams[player] = team

    scorecard_data = []
    for inning in data['innings']:
        batting_team = inning['team']
        for over in inning['overs']:
            for delivery in over['deliveries']:
                delivery_data = {
                    'team': batting_team,
                    'batter': delivery['batter'],
                    'runs': delivery['runs']['batter'],
                    'dismissal': 'Not out',
                    'bowled_by': ''
                if 'wickets' in delivery:
                    for wicket in delivery['wickets']:
                        if 'player_out' in wicket:
                            delivery_data['batter'] = wicket['player_out']
                            delivery_data['dismissal'] = wicket.get('kind', 'Not out')
                            delivery_data['bowled_by'] = wicket.get('bowler', '')
                            if delivery_data['batter'] in player_teams:
                                delivery_data['team'] = player_teams[delivery_data['batter']]
    # Convert the list of dictionaries to a DataFrame
    df = pd.DataFrame(scorecard_data)
    # Group by team and batter to aggregate runs and keep the last instance of dismissal info
    aggregated_df = df.groupby(['team', 'batter'], as_index=False).agg({'runs':'sum', 'dismissal':'last', 'bowled_by':'last'})

    # Save to a CSV file within the match-specific folder
    scorecard_file_path = os.path.join(match_folder, f"{match_id}_detailed_scorecard.csv")
    aggregated_df.to_csv(scorecard_file_path, index=False)
    print(f"Detailed scorecard saved to {scorecard_file_path}")

Further Future Work: Expanding Data Granularity and Accessibility

The creation of detailed scorecards for each match represents a significant leap in enhancing the granularity of our IPL dataset. This process involves meticulously mapping each delivery to its respective batter, aggregating runs, and detailing dismissals, thereby offering a comprehensive view of individual performances within a structured format. The conversion of this detailed match data into a consolidated DataFrame, followed by the aggregation of player performances, provides a powerful tool for in-depth analysis.

Expanding Dataset Granularity

While the detailed scorecards significantly enrich our dataset, there's a vision for further expansion that could offer even deeper insights into the game's dynamics. Time constraints have limited our ability to explore these avenues fully, but they represent exciting future directions for this project:

  • Fielding Performances: Incorporating detailed fielding statistics, such as catches, run-outs, and stumpings, could offer a fuller picture of a team's defensive capabilities and individual fielders' contributions to the match outcome.
  • Bowler Analysis: Expanding the dataset to include detailed bowling metrics like economy rates, strike rates, and specific bowling spell analyses would provide a clearer view of bowling strategies and efficiencies throughout different phases of the game.
  • Partnership Records: Documenting batting partnerships for each wicket could illuminate key moments and collaborations that significantly impact the match's flow and outcome.
  • Pitch and Weather Conditions: Adding data on pitch conditions and weather could help analysts understand performance variations and strategies adapted by teams in response to these external factors.
  • Player Biometrics and Movement Data: Incorporating player movement data, biometrics, and fitness levels could open new research areas into player performance, injury prevention, and game strategy.

Levels of Data Accessibility

To truly democratize access to cricket analytics, we envision structuring data across multiple levels of accessibility, catering to a wide range of users from casual fans to professional analysts:

  • Level 1 - Match Overviews: At the most accessible level, providing match summaries that include key outcomes, highlights, and player of the match awards could engage casual fans looking for quick insights without delving into complex data.
  • Level 2 - Detailed Scorecards and Player Performances: Building upon match overviews by offering detailed scorecards and player performance metrics would cater to enthusiasts and amateur analysts eager to explore beyond surface-level statistics.
  • Level 3 - Advanced Analytics: For professional analysts and teams, creating a dataset that includes granular details on fielding, bowling, partnerships, and conditions, supplemented with advanced metrics and predictive models, would provide the tools necessary for deep strategic analysis.
  • Level 4 - Biometrics and Movement Analytics: At the cutting edge, incorporating biometric and movement data would offer insights into player fitness, injury risks, and on-field strategies, appealing to sports scientists and high-performance coaches.


The vision for future work involves not only expanding the IPL dataset's granularity but also creating a multi-tiered framework for data accessibility. While time constraints have limited the scope of our current enhancements, these outlined directions offer a roadmap for transforming how cricket analytics are approached, making it more inclusive, detailed, and insightful. This ambitious endeavor seeks to bridge the gap between raw data and actionable insights, enabling a broader spectrum of cricket fans and professionals to engage with the game at a deeper level.