The Results of Our Experiment: Using LLMs fo Thematic Analysis

cover
22 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;

(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;

(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.

Abstract & Introduction

Related Work

Dataset

Proposed Framework

Experimental Design

Results and Discussion

Conclusions, Future Work and References

6. Results and Discussion

Table 1 reports the results of the experiments focused on the prediction of initial codes. After the first round, 72.6% of the 785 predicted codes were deemed reasonable, i.e., they described how and what. 13.2% of the codes appeared to lack the focus on how, and at least 14.1% did not seem to describe what was stolen.

After the expert feedback was provided (Section 5), 88.8% of the codes were perceived as reasonable (+16.2% improvement).

Table 2 compares example initial codes before and after the feedback, highlighting the improvements in coding the information of interest. [6] The results strongly suggest that the LLM can perform the initial coding of the data with reasonable quality (RQ1), and further improve the codes upon receiving feedback from a subject matter expert (RQ2).

Hence, it appears that the proposed framework could become a valuable tool for supporting phase 2 of thematic analysis in ELS.

The performance on the task of predicting themes (specified upfront) for the individual facts descriptions is described in Table 3. The prediction was performed using the list of 14 manually discovered themes (see Section 3).

The overall R@1 of .66 and R@3 of .82 appear to suggest that the proposed approach is promising but clear limitations exist. This is largely consistent with prior related studies [21,22]. For some of the themes, e.g., theft in a shop or theft of energy, the automatic prediction worked remarkably well.

However, there were also themes, e.g., theft from an open-access place or robbing of cellars where the performance was rather low. The promising results (RQ3) warrant investigations into the effects of providing expert feedback at this stage.

This could either be done via providing additional custom instructions in the prompt and/or having the experts label a limited number of data points to be used in fine-tuning of the model.

Some of the potential themes were so similar that they should have been likely collapsed together. Other potential themes were overly specific (e.g., workplace theft followed by various instances of what was stolen). Interestingly, the multiplicity of offending and/or stage of completion were present in some of the potential themes, despite specific instructions during the initial codes prediction not to focus on these aspects.

Hence, an additional expert intervention in predicting potential themes might be warranted. The analysis strongly suggests that the LLM performs well in the end-to-end task of discovering and predicting themes from the raw data (RQ4).

However, subject matter expert interventions might be desirable at various stages of the processing to improve the quality of the resulting themes and their alignment with the research questions.

This echos with the cautioning sentiments expressed by De Paoli [11] and Jiang et al. [25] who reported that researchers performing qualitative analysis require full agency over the process. Moreover, the black-box nature of the proprietary LLMs is especially problematic from this point of view.

Figure 5. The graph shows mapping between themes discovered by subject matter expert (left) and the themes discovered by the proposed framework (right).


[6] We admit a possible limitation of this experiment in that the author knew in which round the initial code was produced.


This paper is available on arxiv under CC 4.0 license.6