Pandas is a powerful and popular . It provides tools for handling and manipulating large and complex datasets, and is widely used in fields such as finance, economics, statistics, and data science. library for working with data in Python Pandas is built on top of NumPy, a library for working with numerical data in Python, and provides a high-level interface for working with structured data. It provides two primary data structures: The Series The DataFrame The Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet while the DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a whole spreadsheet or a SQL table. Pandas is particularly useful for cleaning, transforming, and manipulating data in preparation for analysis. It provides a wide range of functions and methods for filtering, grouping, and aggregating data and handling missing or incomplete data. Pandas is also often used in conjunction with other data visualization and machine learning libraries, making it a valuable tool for data scientists and analysts. Data structures in pandas It provides two primary data structures. Series A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet. A Series is created by passing a list of data to the function and specifying an index, which is a list of labels for the data. The thing in the case of giving your labels is that the length of the labels list should be equal to the length of the data list. pd.Series() Otherwise, the error will be thrown. The index is optional; a default index will be created if not specified. So when we have to make large datasets, then it is better to leave the labels to the pandas. For Example: import pandas as pd data = [1, 2, 3, 4, 5] s = pd.Series(data, index=['a', 'b', 'c', 'd', 'e']) print(s) Output: a 1 b 2 c 3 d 4 e 5 dtype: int64 DataFrame A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table. A DataFrame is created by passing a dictionary of Series or dictionaries to the function. The keys in the dictionary are used as column names, and the values are the data for the corresponding columns. pd.DataFrame() For Example: import pandas as pd data = {'a': pd.Series([1, 2, 3]), 'b': pd.Series([4, 5, 6])} # or data = {'a': [1, 2, 3], 'b': [4, 5, 6]} df = pd.DataFrame(data) print(df) Output: a b 0 1 4 1 2 5 2 3 6 Importing and exporting data: Pandas provides several functions for importing and exporting data from various sources. Some of the most common sources of data include CSV files, Excel files, and SQL databases. You can use the function to import data from a CSV file. pd.read_csv() This function takes the file path or URL as an argument and returns a DataFrame. For Example: import pandas as pd df = pd.read_csv('data.csv') To import data from an Excel file, you can use the function. This function takes the file path or URL and the name of the sheet as arguments and returns a DataFrame. pd.read_excel() For Example: import pandas as pd df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # sheet name - if we have many sheets in the file. To import data from a SQL database, you can use the function. This function takes a SQL query and a connection object as arguments and returns a DataFrame. pd.read_sql() For Example: import pandas as pd import pyodbc cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};" "Server=server_name;" "Database=database_name;" "Trusted_Connection=yes;") query = "SELECT * FROM table_name" df = pd.read_sql(query, cnxn) In addition to importing data, pandas also provides functions for exporting data. To export a DataFrame to a CSV file, you can use the method. To export a DataFrame to an Excel file, you can use the method. df.to_csv() df.to_excel() For Example: import pandas as pd # Export DataFrame to CSV file df.to_csv('data.csv', index=False) # Export DataFrame to Excel file df.to_excel('data.xlsx', sheet_name='Sheet1', index=False) Data cleaning and preparation: Pandas is a useful tool for cleaning and preparing data for analysis. It provides several functions and methods for tasks such as removing duplicates, handling missing values, and reformatting data. Dealing with Duplicates To remove duplicates from a DataFrame, you can use the method. This method removes rows with duplicate values in all or a subset of the columns. You can specify the columns to consider for duplicate values using the subset argument, or specify to keep the first or last occurrence of duplicates using the keep argument. df.drop_duplicates() For Example: import pandas as pd df = pd.DataFrame({'A': [1, 1, 2, 3, 3], 'B': [2, 2, 3, 4, 5], 'C': [3, 4, 5, 6, 7]}) # Remove duplicates in all columns df_deduplicated = df.drop_duplicates() # Remove duplicates in columns 'A' and 'B' df_deduplicated = df.drop_duplicates(subset=['A', 'B']) # Keep the first occurrence of duplicates df_deduplicated = df.drop_duplicates(keep='first') # Keep the last occurrence of duplicates df_deduplicated = df.drop_duplicates(keep='last') Dealing with Missing Values. To handle missing values in a DataFrame, you can use the method to identify missing values, and the method to remove rows or columns with missing values. You can also use the method to fill in missing values with a specified value. df.isnull() df.dropna() df.fillna() For Example: import pandas as pd df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [2, 3, 4, 5, 6], 'C': [3, 4, 5, 6, 7]}) # Add missing values df.loc[1, 'A'] = None df.loc[3, 'B'] = None # Identify missing values df_missing = df[df.isnull().any(axis=1)] # Remove rows with missing values df_cleaned = df.dropna() # Fill missing values with 0 df_cleaned = df.fillna(0) Data visualization: Pandas can be used in conjunction with libraries like Matplotlib and Seaborn to create visually appealing and informative plots and charts. These libraries provide a wide range of plotting functions that can be easily used with pandas data structures. To create a simple line chart using pandas, you can use the method and specify the argument as 'line'. The method takes several optional arguments that allow you to customize the appearance of the chart, such as the , the , and the . df.plot() kind df.plot() x and y axis labels title legend For Example: import pandas as pd import matplotlib.pyplot as plt # Create a DataFrame df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [3, 4, 5, 6]}) # Plot the DataFrame as a line chart df.plot(x='x', y='y', kind='line', title='Line Chart') plt.show() To create a bar chart using pandas, you can use the same method and specify the kind argument as 'bar'. df.plot() For Example: import pandas as pd import matplotlib.pyplot as plt # Create a DataFrame df = pd.DataFrame({'x': ['A', 'B', 'C', 'D'], 'y': [3, 4, 5, 6]}) # Plot the DataFrame as a bar chart df.plot(x='x', y='y', kind='bar', title='Bar Chart') plt.show() In addition to the simple line and bar charts, pandas also provides functions for creating more advanced charts, such as scatter plots, histograms, and box plots. You can use the method with different combinations of arguments to create these charts, or you can use the functions provided by Matplotlib and Seaborn directly. df.plot() For example, to create a scatter plot using pandas, you can use the method and specify the kind argument as 'scatter': df.plot() import pandas as pd import matplotlib.pyplot as plt # Create a DataFrame df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [3, 4, 5, 6]}) # Plot the DataFrame as a scatter plot df.plot(x='x', y='y', kind='scatter', title='Scatter Plot') plt.show() To create a histogram using pandas, you can use the method and specify the kind argument as 'hist': df.plot() import pandas as pd import matplotlib.pyplot as plt # Create a Series s = pd.Series([1, 2, 3, 3, 4, 5, 6, 6, 7, 8]) # Plot the Series as a histogram s.plot(kind='hist', title='Histogram') plt.show() Conclusion So overall, we can say that Pandas is a powerful and popular library for working with data in Python. It provides tools for handling and manipulating large and complex datasets, and is widely used in fields such as finance, economics, statistics, and data science. Pandas is built on top of NumPy, a library for working with numerical data in Python, and provides a high-level interface for working with structured data. It provides two main data structures: the Series and the DataFrame. Pandas is particularly useful for cleaning, transforming, and manipulating data in preparation for analysis. It provides a wide range of functions and methods for filtering, grouping, and aggregating data and handling missing or incomplete data. It is also often used in conjunction with other libraries for data visualization and machine learning. Pandas can be used to import and export data from various sources, such as CSV files, Excel files, and SQL databases, using functions such as , , and pd.read_csv() pd.read_excel() pd.read_sql() This was all the basics of pandas and all that needs to be known to get you started. If You have any questions or queries, please leave a comment. Thanks and have a nice day. Also published here.