How to Use Large Language Models for Empirical Legal Research

Legal scholars have long annotated cases by hand to summarize and learn about developments in jurisprudence. Dramatic recent improvements in the performance of large language models (LLMs) now provide a potential alternative. This article demonstrates how to use LLMs to analyze legal documents. It evaluates best practices and suggests both the uses and potential limitations of LLMs in empirical legal research. In a simple classification task involving Supreme Court opinions, it finds that GPT-4 performs approximately as well as human coders and significantly better than a variety of prior-generation natural language processing (NLP) classifiers, with no improvement from supervised training, finetuning, or specialized prompting.