Python Excel Tutorial: The Definitive Guide

Originally published at https://www.datacamp.com/community/tutorials/python-excel-tutorial You will probably already know that Excel is a spreadsheet application developed by Microsoft. You can use this easily accessible tool to organize, analyze and store your data in tables. What’s more, this software is widely used in many different application fields all over the world. And, whether you like it or not, this applies to data science. You’ll need to deal with these spreadsheets at some point, but you won’t always want to continue working in it either. That’s why developers have implemented ways to read, write and manipulate not only these files, but also many other types of files. Python Today’s tutorial will give you some insights into how you can work with Excel and Python. It will provide you with an overview of packages that you can use to load and write these spreadsheets to files with the help of Python. You’ll learn how to work with packages such as , , , and . pandas openpyxl xlrd xlutils pyexcel It might also be interesting for you to take a look at DataCamp’s . If you also want to know more about how to read files into R, consider taking DataCamp’s . Importing Data in Python course R Tutorial on Reading and Importing Excel Files into R Starting Point: The Data When you’re starting a project, you will often work from data that you have gathered maybe from web scraping, but probably mostly from datasets that you download from other places, such as , , etc. data science Kaggle Quandl But more often than not, you’ll also find data on Google or on repositories that are shared by other users. This data might be in an Excel file or saved to a file with extension, … The possibilities can seem endless sometimes. But whenever you have data, your first step should be to make sure that you’re working with a qualitative data. .csv In the case of a spreadsheet, you should corroborate that it’s qualitative because you might not only want to check if this data can answer the research question that you have in mind but also if you can trust the data that the spreadsheet holds. Quality of Your Excel Spreadsheet To check the overall quality of your spreadsheet, you can go over the following checklist: Does the spreadsheet represent static data? Does your spreadsheet mix data, calculation, and reporting? Is the data in your spreadsheet complete and consistent? Does your spreadsheet have a systematic worksheet structure? Did you check if the live formulas in the spreadsheet are valid? This list of questions is to make sure that your spreadsheet doesn’t ‘sin’ against the best practices that are generally accepted in the industry. Of course, the above list is not exhaustive: there are many more general rules that you can follow to make sure your spreadsheet is not an ugly duckling. However, the questions that have been formulated above are most relevant for when you want to make sure if the spreadsheet is qualitative. Quality of Your Data Previous to reading in your spreadsheet in Python, you also want to consider adjusting your file to meet some basic principles, such as: The first row of the spreadsheet is usually reserved for the header, while the first column is used to identify the sampling unit; Avoid names, values or fields with blank spaces. Otherwise, each word will be interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set. Consider using underscores, dashes, Camel case, or concatenating words. Short names are preferred over longer names; Try to avoid using names that contain symbols such as , , , , , , , , , , , , , , , , , , , , and ; ? $ % ^ & * ( ) - # ? , / | \ [ ] { } Delete any comments that you have made in your file to avoid extra columns or NA’s to be added to your file; and Make sure that any missing values in your data set are indicated with NA. Next, after you have made the necessary changes or when you have taken a thorough look at your data, make sure that you save your changes if you have made any. By doing this, you can revisit the data later to edit it, to add more data or to change them, while you preserve the formulas that you maybe used to calculate the data, etc. If you’re working with Microsoft Excel, you’ll see that there are a considerable amount of options to save your file: besides the default extension or , you can go to the “File” tab, click on “Save As” and select one of the extensions that are listed as the “Save as Type” options. The most commonly used extensions to save datasets for data science are and (as tab-delimited text file). Depending on the saving option that you choose, your data set’s fields are separated by tabs or commas, which will make up the “field separator characters” of your data set. .xls .xlsx .csv .txt Now that have checked and saves your data, you can start with the preparation of your workspace! Prepping Your Workspace Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory. When you’re working in the terminal, you might first navigate to the directory that your file is located in and then start up Python. That also means that you have to make sure that your file is located in the directory that you want to work from! But perhaps more importantly, if you have already started your Python session and you’ve got no clue of the directory that you’re working in, you should consider executing the following commands: # Import `os` import os # Retrieve current working directory (`cwd`) cwd = os.getcwd() # Change directory os.chdir("/path/to/your/folder") # List all files and directories in current directory os.listdir('.') Great, huh? You’ll see that these commands are pretty vital not only for loading your data but also for further analysis. For now, let’s just continue: you have gone through all the checkups, you have saved your data and prepped your workspace. Can you already start with reading the data in Python? Unfortunately, you’ll still need to do one more last thing. Even though you don’t have an idea yet of the packages that you’ll need to import your data, you do have to make sure that you have everything ready to install those packages when the time comes. Pip That’s why you need to have and installed. If you have Python 2 >=2.7.9 or Python 3 >=3.4 installed, you won’t need to worry because then you’ll normally already have it ready. In such cases, just make sure you have upgraded to the latest version. pip setuptools To do this, run the following command in your terminal: # For Linux/OS X pip install -U pip setuptools # For Windows python -m pip install -U pip setuptools In case you haven’t installed installed yet, run the script that you can find . Additionally, you can follow the installation instructions on the page if you need more help to get everything installed properly. pip python get-pip.py here Anaconda Another option that you could consider if you’re using Python for data science is installing the Anaconda Python distribution. By doing this, you’ll have an easy and quick way to get started with doing data science because you don’t need to worry about separately installing the packages that you need to do data science. This is especially handy if you’re a beginner, but even for more seasoned developers, it’s a way to quickly test out some stuff without having to install each package separately. Anaconda includes for data science and several open source development environments such as Jupyter and Spyder. If you’d like to start working with Jupyter Notebook after this tutorial, go to . 100 of the most popular Python, R and Scala packages this page You can go to install Anaconda. Follow the instructions to install and you’re ready to start! here That was all you needed to do to set up your environment! Now, you’re set to start importing your files. Importing Excel Files with Pandas One of the ways that you’ll often use to import your files when you’re working with them for data science is with the help of the Pandas package. The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language. This powerful and flexible library is very frequently used by (aspiring) data scientists to get their data into data structures that are highly expressive for their analyses. If you already have Pandas available through Anaconda, you can just load your files in Pandas DataFrames with pd.Excelfile(). If you didn’t install Anaconda, just execute to install the Pandas package in your environment and then execute the commands that are included in the code chunk above. pip install pandas A piece of cake, right? To read in files, you have a similar function to load the data in a DataFrame: . .csv read_csv() The delimiter that this function will take into account is a comma by default, but you can specify an alternative delimiter if you want to. Go to the to find out which other arguments you can specify to make your import successful! documentation Note that there are also and functions to read in general delimited files and tables of fixed-width formatted lines into DataFrames. For the first function, the default delimiter is the tab, but you can again override this and also specify an alternative separator character. What’s more, there are also other functions that you can use to get your data in DataFrames: you can find them . read_table() read_fwf() here Let’s say that after your analysis of the data, you want to write the data back to a new file. There’s also a way to write your Pandas DataFrames back to files with the function. [to_excel()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html) But, before you use this function, make sure that you have the installed if you want to write your data to multiple worksheets in an file. You pass the variable to the function and you also specify the sheet name. This way, you add a sheet with the data to an existing workbook: you can use the to save multiple, (slightly) different DataFrames to one workbook. XlsxWriter .xlsx writer to_excel() ExcelWriter This all means that if you just want to save one DataFrame to a file, you can also go without installing the package. Then, you just don’t specify the argument that you would pass to the function. The rest of the steps stay the same. XlsxWriter engine pd.ExcelWriter() Similarly to the functions that you used to read in files, you also have a function to write the results back to a comma separated file. It again works much in the same way as when you used it to read in the file: .csv to_csv() # Write the DataFrame to csv df.to_csv("example.csv") If you want to have a tab separated file, you can also pass a to the argument to make this clear. Note that there are various other functions that you can use to output your files. You can find all of them . \t sep here Besides the Pandas package, which you will probably use very often to load in your data, you can also use other packages to get your data in Python. Our overview of the available packages is based on , which includes a list of packages that you can use to work with Excel files in Python. this page In what follows, you’ll see how to use these packages with the help of some real-life but simplified examples. The general advice for installing these packages is to do it in a Python virtualenv without system packages. you can use virtualenv to create isolated Python environments: it creates a folder which contains all the necessary executables to use the packages that a Python project would need. To start working with , you first need to install it. Then, go to the directory in which you want to put your project. Create a virtualenv in this folder and load in a specific Python version if you need it. Then, you activate the virtual environment. After that, you can start loading in other packages, start working with them, etc. virtualenv : don’t forget to deactivate the environment when you’re done! Tip Note that the virtual environment might seem a bit troublesome at first when you’re just starting out with your data science project with Python. And, especially when you have only one project to think about, you might not see clearly why you would need a virtual environment at all. But consider how easy it will be when you have multiple projects running at the same time and you don’t want them to share the same Python installation. Or when your projects have conflicting requirements, then the virtual environment will come in handy! Now you can finally start installing and importing the packages that you have read about to load in your spreadsheet data. Openpyxl This package is generally recommended if you want to read and write , , and files. .xlsx, xlsm xltx xltm Install using : you saw how to do it in the previous section! openpyxl pip The general advice for installing this package is to do it in a Python virtual environment without system packages. you can use virtual environment to create isolated Python environments: it creates a folder which contains all the necessary executables to use the packages that a Python project would need. Go to the directory in which your project is located and re-activate the virtual environment . Then proceed to install with to make sure that you can read and write files with it. venv openpyxl pip But what is that data exactly? The workbook with the data that you’re trying to get in Python has the following sheets: Sheet 1 Sheet 2 Sheet 3 The function takes, as you can see, the filename as an argument and return a object, which represents the file. You can check this by running . Remember to make sure that you’re in the right directory where your spreadsheet is located. Otherwise, you will get an error while importing! load_workbook() workbook type(wb) that you can change the working directory with the help of . Remember os.chdir() You see that the code chunk above returns the sheet names of the workbook that you loaded in Python. Next, you can use this information to also retrieve separate sheets of the workbook. You can also check which sheet is currently active with . wb.active You’ll see that with these objects, you won’t be able to do much at first sight. However, you can retrieve values from certain cells in your workbook's sheet by using square brackets , to which you pass the exact cell from which you want to retrieve the value. Worksheet [] Note that this seems very similar to selecting, getting and indexing NumPy arrays and Pandas DataFrames, yet this is not all that you need to do to get the value; You need to add the attribute Besides , there are also other attributes that you can use to inspect your cell, namely , and . value; value row column coordinate The attribute will give back ; row 2 Adding the attribute to will give you , and column c 'B' The will give back . coordinate 'B2' You can also retrieve cell values by using the function. Pass the and the arguments and add values to these arguments that correspond to the values of the cell that you want to retrieve and, of course, don’t forget to add the attribute Note that if you don’t specify the attribute , you’ll get back , which doesn’t tell you anything about the value that is contained within that particular cell. cell() row column value. value You see that you use a for loop with the help of the function to help you to print out the values of the rows that have values in column 2. If those particular cells are empty, you’ll just get back . If you want to know more about for loops, consider taking our course. range() None Intermediate Python for Data Science What’s more, there are also special functions that you can call to get certain other values back, like and . get_column_letter() column_index_from_string The two functions already state more or less what you can retrieve by using them, but for clarity it’s best to make them explicit: while you can retrieve the letter of the column with the former, you can do the reverse or get the index of a column when you pass a letter to the latter. Lastly, there are some attributes that you can use to check up on the result of your import, namely and . These attributes are of course general ways of making sure that you loaded in the data correctly, but nonetheless they can and will be useful. max_row max_column This is all very good, but I can almost hear you thinking now that this seems to be an awfully hard way to work with these files, especially if you want to still manipulate the data. There must be something easier, right? You’re right! You can also use the function from the Pandas package to put the values of a sheet into a DataFrame. Next, you can start manipulating the data with all the functions that the Pandas package has to offer. But, remember that you’re in a virtual environment, so if the package is not yet present, you’ll need to install it again via . DataFrame() pip also that you shouldn’t forget to deactivate the virtual environment when you’re done. Remember Nevertheless, there are also other packages that you can use to get your spreadsheet data in Python. Read on to get to know more! xlrd This package is ideal if you want to read data and format data from files with the or extension. When you don’t want to consider the whole workbook, you might want to use functions such as or to retrieve the sheets that you do want to use in your analysis. .xls .xlsx sheet_by_name() sheet_by_index() Lastly, you also see that you can retrieve the value at certain coordinates, which you express with indices, from your sheet. Continue to and to know more about how they relate to the package! xlwt xlutils xlrd xlwt If you want to create spreadsheets that have your data in them, you can also use the package, apart from the package. is ideal to write data and format information to files with an extension. xlwt XlsxWriter xlwt .xls If you want to write the data to a file, but you don’t want to go through to the trouble of doing everything yourself, you can always resort to a for loop to automatize the whole process a little bit. Compose a script in which you initialize a workbook and to which you add a sheet. Specify a list with the columns and one with values that will be filled in the worksheet. Now that you have seen how the and can possibly collaborate together, it’s time to look at a package that is closely linked to these two: . xlrd xlwt xlutils xlutils This package is basically a collection of utilities that require both and , which includes the ability to copy and modify or filter existing files. Generally speaking, these use cases are now covered by . xlrd xlwt openpyxl Go back to the section on to get more information on how to use this package to get data in Python. openpyxl pyexcel Another package that you can use to read spreadsheet data in Python is ; It’s a Python Wrapper that provides one API for reading, manipulating and writing data in , , , and files. Of course, for this tutorial, you will just focus on the and files. pyexcel .csv .ods .xls .xlsx .xlsm .xls .xls To get your data in an array, you can use the function that is contained within the package. You can also get your data in an ordered dictionary of lists. You can use the function. get_array() pyexcel get_dict() However, you also see that if you want to get back a dictionary of two-dimensional arrays or, stated differently, obtain all the workbook sheets in a single dictionary, you can resort to . get_book_dict() Be aware that these two data structures that were mentioned above, the arrays and dictionaries of your spreadsheet, allow you to create DataFrames of your data with . This will make it easier to handle your data! pd.DataFrame() Lastly, you can also just retrieve the records with thanks to the function. Just pass the argument to the function and you should be getting back a list of dictionaries. To learn how you can manipulate Python lists, check out our . pyexcel get_records() file_name 18 Most Common Python List Questions Just like it’s easy to load your data into arrays with this package, you can also easily export your arrays back to a spreadsheet. Use the function and pass the array and the name of the destination file to the argument. Note that if you want to specify a delimiter, you can add the argument and pass the symbol that you want to use as a delimiter in between . save_as() dest_file_name dest_delimiter "" If, however, you have a dictionary, you’ll need to use the function. Pass the two-dimensional dictionary to and specify the file name and you’re good. Something that you should keep in mind when you use the code that is printed in the code chunk above is that the order of your data in the dictionary will not be kept. If you don’t want this, you will need to make a small detour. You can read all about it . save_book_as() bookdict here If you’re still looking for packages that allow you to load in and write data to files besides Pandas, you can best use the package. Note also that the NumPy package has a function that allows you to load in the data that is contained within files in arrays which you can then put in DataFrames. You can find out more about this function in DataCamp’s . .csv csv genfromtxt() .csv NumPy tutorial Final Check of Your Data When you have the data available, don’t forget the last step: checking whether the data has been loaded in correctly. If you have put your data in a DataFrame, you can easily and quickly check whether the import was successful. : make use of DataCamp’s when you’re considering loading files as Pandas DataFrames. For more guidance on how to manipulate Python DataFrames, take our . Tip Pandas Cheat Sheet Pandas Tutorial: DataFrames in Python If you have the data in an array, you can inspect it by making use of the following array attributes: , , , etc. shape ndim dtype If you want to know more on how you can leverage NumPy arrays for data analysis, consider also going through our and definitely don’t forget to use our ! NumPy tutorial NumPy cheat sheet What’s Next? Congratulations! You have successfully gone through our tutorial that taught you all about how to read Excel files in Python. But importing data is just the start of your data science workflow. Once you have the data from your spreadsheets in your environment, you can focus on what really matters: analyzing your data. If you have loaded your data in DataFrames, consider taking our or courses which are both taught by Dhavide Aruliah, Director of Training at Continuum Analytics and the creator and driving force behind Anaconda. Pandas Foundations course Manipulating DataFrames with Pandas If you want to continue working on this topic, however, consider checking out , which enables to write functions in Python and call them in Excel. PyXll Originally published at www.datacamp.com .