paint-brush
Comparison of Machine Learning Methods: Abstract and Introductionby@hashfunction
New Story

Comparison of Machine Learning Methods: Abstract and Introduction

tldt arrow

Too Long; Didn't Read

This study proposes a set of carefully curated linguistic features for shallow machine learning methods and compares their performance with deep language models.
featured image - Comparison of Machine Learning Methods: Abstract and Introduction
How Hash Functions Function HackerNoon profile picture

Authors:

(1) Busra Tabak [0000 −0001 −7460 −3689], Bogazici University, Turkey {[email protected]};

(2) Fatma Basak Aydemir [0000 −0003 −3833 −3997], Bogazici University, Turkey {[email protected]}.

Abstract

Software issues contain units of work to fix, improve, or create new threads during the development and facilitate communication among the team members. Assigning an issue to the most relevant team member and determining a category of an issue is a tedious and challenging task. Wrong classifications cause delays and rework in the project and trouble among the team members. This paper proposes a set of carefully curated linguistic features for shallow machine learning methods and compares the performance of shallow and ensemble methods with deep language models. Unlike the state-of-the-art, we assign issues to four roles (designer, developer, tester, and leader) rather than to specific individuals or teams to contribute to the generality of our solution. We also consider the level of experience of the developers to reflect the industrial practices in our solution formulation. We collect and annotate five industrial data sets from one of the top three global television producers to evaluate our proposal and compare it with deep language models. Our data sets contain 5324 issues in total. We show that an ensemble classifier of shallow techniques achieves 0.92 for issue assignment in accuracy which is statistically comparable to the state-of-the-art deep language models. The contributions include the public sharing of five annotated industrial issue data sets, the development of a clear and comprehensive feature set, the introduction of a novel label set, and the validation of the efficacy of an ensemble classifier of shallow machine learning techniques.


Keywords: issue assignment · software management · natural language processing · machine learning · IT management

1 Introduction

Software project development refers to the process of creating a software product from start to finish, including planning, designing, coding, testing, and maintenance. It involves a team of developers, often with different specializations, working together to produce a working software product. Software project management involves overseeing the development process, ensuring that the project is completed on time, within budget, and to the expected quality standards [7]. This includes managing resources, schedules, and budgets, as well as communicating with stakeholders and ensuring that the project meets its objectives. Effective project management is necessary for successful software development.


One of the primary responsibilities of a project manager is to identify and address software issues as they arise throughout the development process [49]. These issues can include technical challenges, quality assurance problems, or unexpected delays. The project manager must work with the development team to find solutions to these issues, prioritize tasks, and make adjustments to the project plan as needed. By effectively managing software issues, the project manager can help ensure that the development process stays on track, that the software product is delivered on time and to the expected quality standards, and that the project stays within budget.


Issue Tracking Systems (ITS) are designed to help development teams track and manage software issues throughout the development process. These systems allow developers to identify, report, and prioritize software issues and assign them to team members for resolution [33]. Issue tracking systems often include features such as issue tracking, bug reporting, status tracking, and reporting tools, enabling developers to manage issues effectively and ensure that they are resolved in a timely manner. Issues can be created by users with different roles such as software developers, team leaders, testers, or even customer support teams in these tools. Bertram et al. [7] carry out a qualitative study of ITS as used by small software development teams. They demonstrate that ITS plays a key critical role in the communication and collaboration within software development teams based on their interviews with various stakeholders.


Text classification is an important problem that is the task of assigning a label to a given text [45]. Text classification has started to be used as a tool to produce solutions in many studies in various fields due to the abundance and diversity of data known as big data. The main focus of this paper is to address the issue classification problem through an issue assignment approach where we assign the identified issues to appropriate team members or departments for further resolution. To accomplish this, we treat the problem as a text classification challenge. We leverage machine learning algorithms and natural language processing techniques to analyze and classify the text data of the issues. By applying these techniques, we are able to extract relevant information from the issue descriptions, such as the issue severity, context, and other important details. Overall, by tackling the issue classification problem through this approach, we aim to provide a more comprehensive and effective solution for issue management and resolution.


The issue assignment approach enables us to allocate the issues to the most suitable team members or departments. This helps to streamline the resolution process and ensure that the issues are addressed by the right people, thereby improving the overall efficiency and effectiveness of the support system. We decide that assigning issues to groups of employees who can perform the same activities is preferable to the individuals. Some employees in the issue history may not have been able to complete the task that is automatically assigned to them in that planning time due to a variety of factors, including seasonal spikes in workload, illness, or employee turnover [25]. To effectively manage the employees in our data set, we have grouped them based on the fields they work in. This approach has resulted in the identification of four main teams in the data set, namely the software developer, UI/UX designer, software tester, and team leader. The software developer team represents the majority of the data set, making them a crucial focus of our analysis. To improve time management and issue resolution, it is important to assign the right issues to the right developers. To achieve this, we have categorized the Software Developers using sub-labels that are generally accepted in the industry, such as senior, mid, and junior software developer levels. This categorization helps us identify the experience level and skill set of each developer, allowing us to allocate the most appropriate tasks to each team member. These teams may differ according to the project or the company. For example, new teams such as business analysts, and product owners can be added or some teams can be split or removed. At this point, we expect the side that will use the system to separate the individuals according to the teams. After a newly opened issue is automatically classified among these classes, it can be randomly assigned to the individuals in the relevant team, or the individuals in the team or the team leader can make this assignment manually.


In our study, we use a closed-source data set for our analysis contrary to the majority of studies in the literature. We obtain five projects from the company’s Jira interface for analysis. We focus exclusively on the main language of the issues. To prepare the data set for this study, we determine the label values by changing the people assigned to the issue according to the fields they work in, based on information we receive from the company.


ITS often contain a wealth of valuable data related to software issues. In our study, we set out to analyze this data using NLP methods, with the goal of creating a feature set that would be simpler and more successful than the word embedding methods typically used in text classification. To create our feature set, we use a range of NLP techniques to analyze the language used in software issues like part-of-speech tagging and sentiment analysis. We then compare our feature set with commonly used word embedding methods and apply a range of machine learning, ensemble learning, and deep-learning techniques to our annotated data set. This allows us to evaluate the efficiency of our approach using a range of standard metrics, including accuracy, precision, recall, and F1-score.


We have made several significant contributions to the state of the art in issue classification.


– Data set: We provide a closed-source issue data set from the industry in both Turkish and English. This data set is publicly available for further research, and to the best of our knowledge, there is no shared commercial issue data set for both languages in the literature.


– Feature set: We develop an understandable feature set that is extracted from the information in the issues, which can be applied to all issue classification procedures with high accuracy and low complexity.


– Label set: We introduce novel labels for issue assignment. By incorporating these new labels, we expand the boundaries of current research and offer unique insights into the underlying themes, contributing to a more comprehensive understanding of the domain.


The remainder of this paper is structured as follows: Section 2 describes the background of this study including the structure of software issues in issue tracking systems. In Section 3, we present our experimental setup and approach, followed by our results and analysis in Section 4. In Section 5, we discuss the threats to validity and user evaluation. In Section 6, we discuss related work and similar classification endeavors with Turkish issue reports. Section 7 concludes our work and discusses future work.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.