This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;
(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;
(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.
Table of Links
Conclusions, Future Work and References
4. Proposed Framework
Model The framework relies on OpenAI’s GPT-4 model’s capabilities to perform complex NLP tasks in zero-shot settings [24]. We set the temperature of the model to 0.0, which corresponds to no randomness. Higher temperature leads to more creative, but potentially less factual, output.
We set max_tokens to various values depending on the expected size of the output (a token roughly corresponds to a word). GPT-4 has an overall token length limit of 8,192 tokens, comprising both the prompt and the completion.
We set top_p to 1, as is recommended when temperature is set to 0.0. We set frequency_penalty and presence_penalty to 0, which ensures no penalty is applied to repetitions and to tokens appearing multiple times in the output.
Resources We utilize the definition of thematic analysis and the individual phases from [5]. For example, the 15-point checklist of criteria for good thematic analysis have been adopted verbatim as well as selected excerpts defining the analysis, its flow and the phases. Together with specifications of the expected outputs from various stages of the processing pipeline, these can be considered as general resources, i.e., invariant to the performed thematic analysis.
In addition, the framework requires context-specific resources, i.e., different for each analysis. These include research questions, the parameters specifying the type of analysis (e.g., semantic/latent patterns, focus on a specific topic), the specification of what counts as a theme, and various sets of custom requirements.
Processing Flow The proposed framework is depicted in Figure 3. The analyzed dataset is automatically segmented into batches whereas many data points as can be fitted into the model’s prompt are batched together, using the tiktoken Python library.[4] The data points are processed from the shortest to the longest.
If a single data point exceeds the size of the prompt it is truncated to fit the limit by taking its starting and ending sequence (both half the limit) and placing the “[...]” token in between them.
Each batch is inserted into the user message, alongside the research questions, other context-specific information about the analysis as well as any custom requirements. We also include a random sample of the initial codes predicted in the batches preceding the current one. The system message consists of the general resources.
Using the openai Python library,[5] the system and the user messages are then submitted to the LLM that generates a JSON file with the predicted initial codes. This is repeated until the whole dataset is labeled.
The subject matter expert may review the predicted initial codes. Then, they may provide further custom instructions to the system as to what aspects of the data to focus on and which to disregard. These instructions are appended to the custom requirements. This process can be repeated until the predicted initial codes match the expectations.
The predicted initial codes are collated into potential themes. This stage of the processing is similar to the preceding one with the notable difference that the system operates on the batches of initial codes instead of the raw data points. The most common 20 potential themes predicted in the batches preceding the current one are included in the user message.
As a result, each data point gets associated with a candidate theme. The candidate themes, which could be many, are then further collated into a compact set of high-level themes.
The whole set of candidate themes is provided in the user message and submitted to the LLM. While this may depend on the specific analysis there is most likely no need to supply these in batches as all of them are likely to fit in the prompt. The output of this stage are the high-level themes (with candidate themes as sub-themes).
The final stage of the pipeline is focused on labeling the data points with the discovered themes. Note that this step differs from the prediction of the initial codes or potential themes because, here, the LLM is used to predict the labels from the provided list of themes (i.e., to perform deductive coding).
The result of the whole process is the original data points being associated with (semi-)automatically discovered themes. This artifact can be utilized by the subject matter expert as a starting point for the subsequent phase of the thematic analysis (reviewing themes).
[4] Github: Tiktoken. Available at: https://github.com/openai/tiktoken [Accessed: 2023-04-30]
[5] GitHub: OpenAI Python Library. Available at: https://github.com/openai/openai-python [Accessed 2023-08-16]
This paper is available on arxiv under CC 4.0 license.6