We all have to deal with data, and we try to learn about and implement machine learning into our projects. But everyone seems to forget one thing... it's far from perfect, and there is ! Don't worry, we'll discuss every little step, from start to finish 👀. so much to go through All you'll need are these fundementals The Story Behind it All We all start with either a dataset or a goal in mind. Once we found, collected or scraped our data , we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words 😨! A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess... but a quick search reveals the host of tasks we'll need to consider before training a model 😱! Once we overcome the shock of our unruly data we look for ways to battle our formidable nemesis 🤔. We start with trying to get our data into Python. It is relatively simple on paper, but the process can be slightly... . Nonetheless, a little effort was all that was needed (lucky us). involved Without wasting any time we begin to get rid of the bogus and expose the beautiful. Our methods start simple - observe and remove. It works a few times, but then we realise... it really doesn't do us justice! To deal with the mess though, we find a powerful tool to add to our arsenal: charts! With our graphs, we can get a feel for our data, the patterns within it and where things are missing. We can (fill in) or removing missing data. data cleaning interpolating Finally, we approach our highly anticipated 😎 challenge, data modelling! With a little research, we find out which tactics and models are commonly used. It is a little difficult to decipher which one we should use, but we still manage to get through it and figure it all out! We can't finish a project without doing something impressive though. So, a final product, website, app or even a report will take us far! We know first impressions are important so we fix up the GitHub repository and make sure everything's well documented and explained. Now we are 😎! finally able to show off our hard work to the rest of the world The epochs Chapter 1 - Importing Data Data comes in all kinds of shapes and sizes and so the process we use to get everything into code often varies. Let's be real, importing data seems easy, but sometimes... it's a little pesky. The hard part about data cleaning isn't the coding or theory, but instead our preparation! When we first start a new project and download our dataset, it can be tempting to open up a code editor and start typing... but this won't do us any good. If we want to get a head start we need to prepare ourselves for the best and worst parts of our data. To do this we'll need to start basic, by manually inspecting our spreadsheet/s. Once we understand the basic format of the data (filetype along with any particularities) we can move onto getting it all into Python. When we're lucky and just have one spreadsheet we can use the Pandas function (letting it know where our data lies): read_csv pd.read_csv( ) "file_path.csv" In reality, we run into way more complex situations, so look out for: File stars with unneeded information (which we need to skip) We only want to import a few columns We want to rename our columns Data includes dates We want to combine data from multiple sources into one place Data can be grouped together Although we're discussing a range of scenarios, we normally only deal with a few at a time. Our first few problems (importing specific parts of our data/renaming columns) are easy enough to deal with using a few parameters, like the number of rows to skip, the specific columns to import and our column names: pd.read_csv( , skiprows= , usecols=[ , ], names=[ , ]) "file_path.csv" 5 0 1 "Column1" "Column2" Whenever our data is spread across multiple files, we can combine them using Pandas function. The function combines a list of 's together: concat concat DataFrame my_spreadsheets = [pd.read_csv( ), pd.read_csv( )] pd.concat(my_spreadsheets, ignore_index= ) "first_spreadsheet.csv" "second_spreadsheet.csv" True We parse to a list of spreadsheets (which we import just like before). The list can, of course, be attained in any way (so a fancy list comprehension or a casual list of every file both work just as well), but just remember that ! concat we need dataframes, not filenames/paths If we don't have a CSV file Pandas still works! We can just for , or another . swap out read_csv read_excel read_sql option After all the data is inside a Pandas dataframe, we need to double-check that our data is . In practice, this means checking each series datatype, and making sure they are not generic objects. We do this to ensure that we can utilize Pandas inbuilt functionality for numeric, categorical and date/time values. formatted correctly To look at this just run DataFrame.dtypes. If the output seems reasonable (i.e. numbers are numeric, categories are categorical, ect), then we should be fine to move on. However, this normally is not the case, and as we need to change our datatypes! This can be done with Pandas . If this doesn't work, there should be another more Pandas function for that specific conversion: DataFrame.astype data[ ] = data[ ].as_type( ) data[ ] = pd.to_numeric(data[ ]) data[ ] = pd.to_datetime(data[ ]) data[ ] = pd.to_datetime(data[[ , , , , ]]) "Rating" "Rating" "category" "Number" "Number" "Date" "Date" "Date" "Year" "Month" "Day" "Hour" "Minute" If we need to analyse separate groups of data (i.e. maybe our data is divided by country), we can use Pandas groupby. We can use groupby to select particular data, and to run functions on each group separately: data.groupby( ).get_group( ) data.groupby( ).mean() "Country" "Australia" "Country" Other more niche tricks like multi/hierarchical indices can also be helpful in specific scenarios, however, are more tricky to understand and use. Chapter 2 - Data Cleaning Data is useful, data is necessary, however, it ! If our data is everywhere, it simply won't be of any use to our machine learning model. needs to be clean and to the point Everyone is driven insane by missing data, but there's always a light at the end of the tunnel. The easiest and quickest way to go through data cleaning is to ask ourselves: What features within our data will impact our end-goal? By end-goal, we mean whatever variable we are working towards predicting, categorising or analysing. The point of this is to narrow our scope and not get bogged down in useless information. Once we know what our primary objective features are, we can try to find patterns, relations, missing data and more. An easy and intuitive way to do this is graphing! Quickly use Pandas to sketch out each variable in the dataset, and try to see where everything fits into place. Once we have identified potential problems, or trends in the data we can try and fix them. In general, we have the following options: Remove missing entries Remove full columns of data Fill in missing data entries Resample data (i.e. change the resolution) Gather more information To go from identifying missing data to choosing what to do with it we need to consider how it affects our end-goal. With missing data we remove anything which doesn't seem to have a major influence on the end result (i.e. we couldn't find a meaningful pattern), or where there just seems . Sometimes we also decide to remove very small amounts of missing data (since it's easier than filling it in). too much missing to derive value If we've decided to get rid of information, Pandas can be used. It removes columns or rows from a dataframe. It is quite easy to use, but remember that , so must be specified. It may be useful to note that the axis parameter specifies whether rows or columns are being removed. DataFrame.drop Pandas does not modify/remove data from the source dataframe by default inplace=True When not removing a full column, or particularly targeting missing data, it can often be useful to rely on a few nifty Pandas functions. For removing null values, DataFrame.dropna can be utilized. Do keep in mind though that by default, dropna completely removes all missing values. However, setting either the parameter how to all or setting a threshold (thresh, representing how many null values are required for it to delete) can compensate for this. If we've got small amounts of irregular missing values, we can fill them in several ways. The simplest is which sets the missing values to some preset value. The more complex, but flexible option is interpolation using . Interpolation essentially allows anyone to simply set the they would like to replace each null value with. DataFrame.fillna DataFrame.interpolate method These include the previous/next value, linear and time (the last two deduce based on the data). Whenever working with time, time is a natural choice, and otherwise make a reasonable choice based on how much data is being interpolated and how complex it is. data[ ].fillna( , inplace= ) data[[ ]] = data[[ ]].interpolate(method= ) "Column" 0 True "Column" "Column" "linear" (otherwise an error will be thrown). As seen above, interpolate needs to be passed in a dataframe purely containing the columns with missing data Resampling is useful can whenever we see regularly missing data or have multiple sources of data using different timescales (like ensuring measurements in minutes and hours can be combined). It can be slightly difficult to intuitively understand resampling, but it is essential when you average measurements over a certain timeframe. For example, we can get monthly values by specifying that we want to get the mean of each month's values: data.resample( ).mean() "M" The "M" stands for month and can be replaced with "Y" for year and other options. Although the data cleaning process can be quite challenging, if we remember our initial intent, it becomes a far more logical and straight forward task! If we still don't have the needed data, we may need to go back to phase one and collect some more. Note that missing data indicates a problem with data collection, so it's useful to carefully consider, and note down where occurs. For completion, the Pandas unique and value_counts functions are useful to decide which features to straight-up remove and which to graph/research. Chapter 3 - Visualisation Visualisation sounds simple and it is, but it's hard to... . It's far too easy for us to consider plots as a chore to create. Yet, these bad boys do one thing very, very well - intuitively demonstrate the inner workings of our data! not overcomplicate Just remember: We graph data to find and explain how everything works. Hence, when stuck for ideas, or not quite sure what to do, we basics can always fall back on . It may seem iffy 🥶, but it is really useful. identifying useful patterns and meaningful relationships Our goal isn't to draw fancy hexagon plots, but instead to picture what is going on, so absolutely anyone can simply interpret a complex system! A few techniques are undeniably useful: Resampling when we have too much data Secondary axis when plots have different scales Grouping when our data can be split categorically To get started graphing, simply use Pandas on any series or dataframe! .plot() When we need more, we can delve into MatPlotLib, Seaborn or an interactive plotting library. data.plot(x= , y= , kind= , figsize=( , )) data.plot(x= , y= , secondary_y= ) data.hist() data.groupby( ).boxplot() "column 1 name" "column 2 name" "bar" 10 10 "column 1 name" "column 3 name" True "group" 90% of the time, this basic functionality will suffice ( ), and where it doesn't a search should reveal how to 😏. more info here draw particularly exotic graphs Chapter 4 - Modelling A Brief Overview Now finally for the fun stuff - deriving results. It seems ! So, let's be honest here, not every dataset, nor model are equal. so simple to train a scikit learn model, but no one goes into the details Our approach to modelling will vary widely based on our data. There are three especially important factors: of problem Type of data Amount of data Complexity Our type of problem comes down to whether we are trying to predict a class/label (called ), a value (called ), or to group data (called ). If we are trying to train a model on a dataset where we already have examples of what we're trying to predict we call our model , if not, . classification regression clustering supervised unsupervised The amount of available data, and how complex it is foreshadows how simple a model will suffice. . Data with more features (i.e. columns) tends to be more complex The point of interpreting complexity is to understand which models are too good or too bad for our data Models informs us on this! If a model struggles to interpret our data (too simple) we can say it , and if it is completely overkill (too complex) we say it . We can think of it as a spectrum from learning nothing to memorising everything. goodness of fit underfits overfits We need to , to ensure our model is to new information. This is typically known as the bias-variance tradeoff. strike balance able to generalise our conclusions Note that complexity also affects model interpretability. , especially with large datasets. So, upgrade that computer, run the model overnight, and chill for a while 😁! Complex models take a substantially greater time to train Preperation/Splitting up data Before training a model it is important to note that we will need some dataset to test it on (so we know how well it performs). Hence, we often divide our dataset into . This allows us to test . This normally works because we know our data is decently representative of the real world. separate training and testing sets how well our model can generalise to new unseen data The actual amount of test data doesn't matter too much, but 80% train and 20% test is often used. In Python with Scikit learn the function does this: train_test_split train_data, test_data = train_test_split(data) Cross-validation is where a dataset is split into several folds (i.e. subsets or portions of the original dataset). This tends to be more robust and than using a single test/validation set! Several Sklearn functions help with , however, it's normally done straight through a grid or random search (discussed below). resistant to overfitting cross-validation cross_val_score(model, input_data, output_data, cv= ) 5 Hyperparameter tuning There are some factors our model cannot account for, and so we . These vary model to model, but we can either find optimal values through manual trial and error or a simple algorithm like grid or random search. set certain hyperparameters With grid search, we try all possible values (brute force 😇) and with random search random values from within some distribution/selection. Both approaches typically use cross-validation. in Sklearn works through a dictionary. Each entries key represents the hyperparameter to tune, and the value (a list or tuple) is the selection of values to chose from: Grid search parameters parameters = { :( , ), :[ , ]} model = = SVC() grid = GridSearchCV(model, param_grid=parameters) 'kernel' 'linear' 'rbf' 'C' 1 10 After we've created the grid, we can use it to train the models, and extract the scores: grid.fit(train_input, train_output) best_score, best_depth = grid.best_score_, grid.best_params_ The important thing here is to remember that we need to . Even though cross-validation is used to test the models, we're ultimately trying to get the best fit on the training data and will proceed to test each model on the testing set afterwards: train on the training and not testing data test_predictions = grid.predict(test_input) in Sklearn works similarly but is slightly more complex as we need to know what type of distribution each hyperparameter takes in. Although it, in theory, , that changes from situation to situation. Random search can yield the same or better results faster For simplicity it is likely best to stick to a grid search. Model Choices With Sklearn, it's as simple as finding our desired model name and then just creating a variable for it. Check the links to the documentation for further details! For example: support_vector_regressor = SVR() Basic Choices Linear/Logistic Regression is trying to to our data. It is the most basic and fundamental model. There are several variants of linear regression, like lasso and ridge regression (which are regularisation methods to prevent overfitting). Polynomial regression can be used to fit curves of higher degrees (like parabolas and other curves). Linear regression fit a straight line Logistic regression is another variant which can be used for classification. Support Vector Machines Just like with linear/logistic regression, try to fit a line or curve to data points. However, with SVM's the aim is to maximise the distance between a boundary and each point (instead of getting the line/curve to go through each point). support vector machines (SVM's) The main advantage of support vector machines is their ability to . A kernel is a function which calculates similarity. These kernels allow for both linear and non-linear data, whilst staying decently efficient. The kernels map the input into a higher dimensional space so a boundary becomes present. use different kernels This process is typically not feasible for large numbers of features. A neural network or another model will then likely be a better choice! Neural Networks All the buzz is always about deep learning and . They are complex, slow and resource-intensive models which can be used for complex data. Yet, they are extremely useful when encountering large unstructured datasets. neural networks When using a neural net, make sure to watch out for overfitting. An easy way is through tracking changes in error with time (known as learning curves). Deep learning is an extremely rich field, so there is far too much to discuss here. In fact, Scikit learn is a machine learning library, with little deep learning abilities (compared to or ). PyTorch TensorFlow Decision Trees are simple and quick ways to model relationships. They are basically a which help to decide on what class or label a datapoint belongs too. Decision trees can be used for regression problems too. Decision trees tree of decisions Although simple, to avoid overfitting, several hyperparameters must be chosen. These all, in general, relate to how deep the tree is and how many decisions are to be made. K-Means We can group unlabeled data into several using . Normally the number of clusters present is a chosen hyperparameter. clusters k-means K-means works by trying to optimize (reduce) some criterion (i.e. function) called inertia. It can be thought of like trying to minimize the distance from a set of to each data point. centroids Ensembles Random Forests Random forests are combinations of multiple decision trees trained on random subsets of the data (bootstrapping). This process is called bagging and allows random forests to obtain a good fit (low bias and low variance) with complex data. The rationale behind this can be likened to democracy. One voter may vote for a bad candidate, but we'd hope that the majority of voters make informed, positive decisions For problems, we average each decision tree's outputs, and for , we choose the most popular one. This (especially with large datasets with multiple columns). regression classification might not always work, but we assume it in general will Another advantage with random forests is that insignificant features shouldn't negatively impact performance because of the democratic-esc bootstrapping process! Hyperparameter choices are the same as those for decision trees but with the number of decision trees as well. For the aforementioned reasons, more trees equal less overfitting! Note that random forests use random subsets with the replacement of rows and columns! Gradient Boosting Ensemble models like AdaBoost or work by stacking one model on top of another. The assumption here is that each successive weak learner will correct for the flaws of the previous one (hence called boosting). Hence, the combination of models should provide the advantages of each model without its potential pitfalls. XGBoost The iterative approach means previous models performances effects current models, and better models are given a higher priority. Boosted models perform slightly better than bagging models (a.k.a random forests), but are also slightly more likely to overfit. Sklearn provides AdaBoost for and . classification regression Chapter 5 - Production This is the last but potentially most important part of the process 🧐. We've put in all this work, and so we need to go the distance and ! create something impressive There are a variety of options . is an exciting option for data-oriented websites, and tools like Kotlin, Swift and Dart can be used for Android/IOS development. JavaScript with frameworks like VueJS can also be used for extra flexibility. Streamlit After trying most of these I honestly would recommend sticking to Streamlit , since it is so much easier than the others! Here it is important to start with a vision (simpler the better) and try to find out which parts are most important. Then try and specifically work on those. Continue till completion! For websites, a hosting service like will be needed, so the rest of the world can see the amazing end-product of all our hard work 🤯😱. Heroku Even if none of the above options above suite the scenario, a report or article covering what we've done, what we've learnt and any suggestions/lessons learnt along with a well documented GitHub repository are indispensable! Make sure that readme file is up to date. THANKS FOR READING! I really hope this article has helped you out! For updates . follow me on Twitter If you enjoyed this, you may also like which goes through each and every practical coding tool you'll need to know. If you're lost considering what projects to take on, consider checkout out my zero-to-hero guide on and . The Complete Coding Practitioners Handbook choosing a project collecting your own dataset through webscraping The goodness of fit graph is a modified version of the . Photos by National Cancer Institute , Dane Deaner , ThisisEngineering RAEng , Adam Nowakowski and Guilherme Caetano on Unsplash. Sklearn documentation Previously published at https://www.kamwithk.com/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx