paint-brush
The Data We Acquired From Using LLMs to Support Thematic Analysisby@textmodels
116 reads

The Data We Acquired From Using LLMs to Support Thematic Analysis

tldt arrow

Too Long; Didn't Read

In our experiments, we used a dataset of 785 facts descriptions from cases of Czech courts decided in 2017. The theft in a shop (29.0%) and breaking into another object (17.5%) are the most prevalent themes. We removed 49 cases from the dataset because they were used in a pilot study or due to them containing errors.
featured image - The Data We Acquired From Using LLMs to Support Thematic Analysis
Writings, Papers and Blogs on Text Models HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;

(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;

(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.

Abstract & Introduction

Related Work

Dataset

Proposed Framework

Experimental Design

Results and Discussion

Conclusions, Future Work and References

3. Dataset

In our experiments, we used a dataset of 785 facts descriptions from cases of Czech courts decided in 2017. From the Prosecution Service, we received 834 cases that found an adult defendant guilty of theft. In Czechia, theft also includes burglary and pickpocketing.[2] We slightly over-represented the most serious offenses to ensure a sufficient number of cases in the dataset.


We removed 49 cases from the dataset because they were used in a pilot study or due to them containing errors. We extracted text describing the facts. Each extracted text was anonymized and shortened or partially re-written if necessary.


The resulting text snippets range from 73 to 29,695 characters in length (1Q 447, median 782, 3Q 1,462 characters). Figure 2 shows an example (automated translation).


Figure 2. The categories from the theft types dataset (shown at the top) and their distribution (right). An example of case facts description from the theft from a car category is shown on the left.


A group of three law students under the supervision of one of the authors of this paper manually conducted an unstructured variant of thematic analysis.[3] The group arrived at 14 high-level themes focused on modus operandi and target of committed thefts (Figure 2).


For each facts description a single theme was independently chosen by two students according to specified rules.


The disagreements were resolved by one of the students following careful re-reading of the case. The distribution of the themes over the 785 facts descriptions included in the dataset is presented in Figure 2. The theft in a shop (29.0%) and breaking into another object (17.5%) are the most prevalent themes.


[2] ICCS codes 0501 and 0502 except for 0502212 [9].


[3] We did not rigorously adhere to the process described in [5].


This paper is available on arxiv under CC 4.0 license.6