Figure 1: Photo by Luke Chesser on Unsplash
Machine learning is an area of artificial intelligence (AI) and computer science that focuses on using data and algorithms to mimic the way humans learn, intending to steadily improve accuracy.
Machine learning is a crucial part of the rapidly expanding discipline of data science. Algorithms are trained to generate classifications or predictions using statistical approaches, revealing crucial insights in data mining initiatives. Following that, these insights drive decision-making within applications and enterprises, with the goal of influencing important growth KPIs. As big data expands and grows, the demand for data scientists will rise, necessitating their assistance in identifying the most relevant business questions and, as a result, the data needed to answer them.
Machine Learning projects are essentially based on two things: Data and Model. Thus, they are the foundation of any Machine Learning system.
MODEL CENTRIC APPROACH
Machine Learning is an iterative process that entails creating empirical tests to improve the model’s performance. This entails selecting the best model architecture and training technique from a vast array of options to arrive at a superior answer.
DATA-CENTRIC APPROACH
This entails systematically altering/improving datasets to increase the accuracy of your AI system. This is frequently ignored, and data collecting is considered a one-time effort.
To get the best results, you must fix the model or code and iteratively enhance the data quality. The information must be well-defined. Annotation and labelling should have defined norms and meanings. This may necessitate the participation of many labelers and subject matter experts.
Figure 2: Differences between a Data-Centric and Model-Centric Approach to Machine Learning Projects
CASE STUDY
Agricultural financing has played a critical role in supporting farm output in India, according to our case study. Though the breadth and amount of agricultural credit have grown over time, several flaws have emerged, threatening the profitability and sustainability of these organizations. A stealthy transition is taking place in rural areas because of shifts in consumption and dietary habits from cereals to non-cereal items, requiring diversification in agricultural output and value addition processes to protect the jobs and incomes of the rural population.
It is concluded that most, short and medium-term loans for agricultural purposes are taken for marginal land size groups in India. The percentage of indebted agricultural households to total agricultural households increases as land size increases. The percentage of holdings is less than the percentage of indebted agricultural households in Punjab, Uttar Pradesh, Andhra Pradesh, West Bengal, Karnataka, Odisha, and Rajasthan.
Confirming or validating the above findings, brief and non-intensive research was done into this case study.
The ‘loan_data_final’ table is a combination of weather data that considers the ‘Total Replenishable groundwater resource’ per state for the year 2003, a remote sensing data of the number of fire points communicated by the forest survey of India by remote sensing for the year 2014, total yield numbers for maize by the state for the year 2014 and a state-wise agricultural loan disbursement to farmers by commercial banks. Loan status manually generated randomly.
Most of the data were obtained from the Open Government Data (OGD) Platform India and the Indian Water Portal, while the others were manually formulated.
Figure 3: Notebook that shows the entire machine learning process from data collection to training and testing models
As seen above in cell [39], there is no loan status column and as such, for the purpose of this study they were manually generated.
A very important part of data preprocessing is checking for outliers, missing values and inconsistencies and this can be observed in cells [40] to [44].
The next step would be to adequately fix these tables by filling them or removing them based on relevancy. The mean value of each column was opted for here. All of these were done in cells [45] to [51] and the result of the changes can be seen in cell [52], with all the rows adequately filled.
Now that we have a somewhat credible data table, we take our attention to cells [57] to [66] where training and testing were done with Logistic Regression, splitting the train and test set by a 7:3 split. Evidently, we have a train score of 0.76 and a test score of 0.66.
Out of 2 ‘Y’ outcomes, 0 were right and 2 were wrong similarly, for ’N’, 4 were right and 0 were wrong.
INFERENCES
This looks decent for a make-shift study of the topic in question, but if the process is observed intricately, it does not value the type of data used and gives all the power to the model. Under the following pointers, the issues with this process will be discussed.
Consistency of data
The data is without a doubt inconsistent right off the bat. The weather data was gotten from 2003 and the production data is of maize quotas in 2014.
The remote sensing data was from 2014 but as seen, when a merge was attempted the number of rows was reduced from 30 to 18. This is because there is inconsistency in the primary identifier, in this case, the names of the states.
If one or two of these were considered during the data preprocessing, for sure the model would have done a better job.
Size of data
The amount of data we had to work with here is minimal and was not sufficient for training and testing and was most evidently seen during the confusion matrix. There was not a lot to use to compare.
Quality of data
It is very common to see a lot of missing values in data for various reasons, but with a data size so small, filling that many rows with make-shift values like mean and mode can seriously skew the quality of the data.
The primary identifier should have been the crop type and the states should have been a feature, that way there are a lot more variables to compare with.
Also, it is not very easy for an analyst to see how the ‘Total Replenishable Groundwater Resource’ would be sufficient to predict production, and then loan status, much less a machine learning model. It would have helped a great deal if more research was carried out and relevant data was provided.
Data is what makes or breaks your model. It’s not about the quantity but rather the quality. Therefore, your dataset needs to be diverse enough to provide your model with all the information it requires. Continuous validation from your production environment enables your organization to create a machine learning model that is constantly learning from its own behavior while adapting to new situations never encountered before
It all boils down to the most important point that all models rely on data, and you can’t get them to production or have them function optimally without data. So, in essence, models and AI have always been data-centric but the approach to using data is now changing. Instead of Data Scientists talking about their models, they’re focusing more on the quality of the data that constructs their models. There is only so much your model can do.
To get a more convincing outlook at how and why a data-centric approach is important to the development of AI, where options for improving data quality like Data Profiling, Synthetic Data and Data Labelling are discussed, refer to Data-Centric AI Community website and the resources in the awesome data-centric-ai at the GitHub repository.