When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

By Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, Daniel E. Ho

One of the most recent significant advances in natural language processing (NLP) has been the advent of self-supervised language models like Google’s BERT or OpenAI’s GPT-3. Such models are “pretrained” on a large corpus of general texts — Google Books and Wikipedia articles — resulting in significant gains on a wide range of tasks with much smaller datasets. Researchers have noted that these gains are amplified in specialized domains. In a recent study, a version of BERT, SciBERT, pretrained on scientific papers outperformed both existing approaches and a general-language BERT pretrained on Google Books and Wikipedia on a suite of scientific NLP tasks.

This has piqued the interest of the legal community. Advances in applying NLP models to legal corpora (“Legal NLP”) could alleviate longstanding disparities in access to justice. Puzzlingly however, existing work has noted that pretraining on legal corpora often fails to yield statistically significant improvements. On existing legal benchmarks, law-specific models may appear no better than general language models.

Our recent paper attempts to explain this phenomenon. In short, we hypothesize that existing legal benchmarks are simply too easy or have relatively low legal domain specificity, and thus fail to capture the benefits of domain specific pretraining. Legal text typically requires specialized legal knowledge to annotate, so it can be costly to construct a large structured dataset for the legal domain; as a result, manual annotation efforts have often been dedicated to the construction of tasks for which methods pre-dating the rise of self-supervised learning already perform well and returns can be guaranteed to offset expenses. To address this gap in large, publicly available legal datasets for U.S law and illustrate the conditions under which domain specific pretraining can help, we developed CaseHOLD — a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. We find that a law pretrained model (“Legal-BERT”) outperforms a general corpus BERT model on CaseHOLD, while failing to achieve similar improvements on other benchmarks. Finally, we offer a simple domain specificity metric that correlates with the expected gains of Legal-BERT over BERT. Our findings inform when researchers should engage resource-intensive pretraining and show that Transformer-based architectures, too, learn representations suggestive of distinct legal language.

The CaseHOLD dataset

Citations to cases in legal writing often contain a holding statement, which states the conclusion of the cited case as relevant to the citing case (see Figure 1). CaseHOLD is a multiple choice question answering task derived from legal citations in judicial rulings. The citing context from the judicial decision serves as the prompt for the question. The answer choices are holding statements derived from citations following text in a legal decision. There are five answer choices for each citing text. The correct answer is the holding statement that corresponds to the citing text. The four incorrect answers are other holding statements (see Figure 2). CaseHOLD consists of ~53,000 questions, mined from American case law. We leverage the standardized system of case citation to extract all legal citations and holding statements, often indicated by the keyword “holding” and provided in parenthetical propositions accompanying U.S. legal citations, to match citing contexts to holdings through an automated, rule-based process.

Figure 1: An excerpt from a case. The holding statement for the citation *People v Burnham* is underlined.

Figure 2: We use the text around the holding statement as the prompt. The original holding statement is regarded as the “correct” answer. We add four “incorrect” answers by extracting holding statements from other cases associated with other citations. One such incorrect holding statement is provided above.

In essence, CaseHOLD is similar to the challenge of identifying the holding of a case. Holdings are, of course, central to the common law system. They represent the governing legal rule when the law is applied to a particular set of facts. The holding is precedential and what litigants can rely on in subsequent cases. So central is the identification of holdings that it forms a canonical task for first-year law students to identify, state, and reformulate the holding.

Comparing Legal-BERT and BERT

We create a pretraining dataset consisting of 3,446,187 legal decisions across all federal and state courts post-1965 (approximately 37GB of text). In addition to CaseHOLD, we evaluate on two additional tasks:

Overruling: a collection of 2,400 sentences from legal cases, annotated by attorneys at Casetext as to whether the sentence nullifies a previous case decision as a precedent
Terms of Service (ToS): a collection of 9,414 clauses from consumer contracts, each labelled as to whether they constitute an “unfair” term.

Dataset statistics. See below for more details on “DS”.

We compare performance on these tasks between four models:

Baseline: a simple LSTM model with 300 dimensional word2vec embeddings
BERT: a BERT model, trained for 1 million steps on Google Books and Wikipedia
BERT (double): a BERT model, trained for 2 million steps on Google Books and Wikipedia, controlling on pretraining steps for comparability to Legal-BERT variants
Legal-BERT: a BERT model, trained for 1 million steps on Google Books and Wikipedia, and then trained for an additional 1 million steps on our case law corpus
Custom Legal-BERT: a BERT model with law specific vocabulary, trained from scratch for 2 million steps on our case law corpus

The table above contains a comparison between all models. For Overruling — the easiest task– we find that general language BERT models achieve high performance and the Legal-BERT variants only marginally outperform the general language BERT models. For ToS — the intermediate difficulty task — we find that BERT (double) with further pretraining BERT on the general domain corpus increases performance over base BERT by a 5.1% difference in macro F1, but the Legal-BERT variants with domain-specific pretraining do not outperform BERT (double) substantially. This is likely because Terms of Service is constructed from consumer contract text and thus, has relatively low legal domain-specificity compared to the other tasks constructed from case law, so pretraining on legal domain-specific text does not help the model learn information that is highly relevant to the task. For CaseHOLD however — the hardest and most domain-specific task — legal pretraining and custom legal vocabulary offer statistically significant gains.

Can we anticipate when domain specific pretraining helps?

Pretraining large language models like BERT is expensive and time consuming. It’s thus important that practitioners know beforehand whether training such a model is worth the investment. In short, is there some way by which we can predict whether a particular legal task is specialized enough to warrant domain specific pretraining?

In addition to looking at the difficulty of a task (as measured by the baseline performance), we can attempt to measure the domain specificity (“DS”) of each task. We define DS as the average difference in pretrain loss between Legal-BERT and BERT across tasks. Intuitively, when the difference is large, the general corpus does not predict legal language very well. The Table below contains the DS scores for the different tasks.

We find that CaseHOLD — the hardest and most domain-specific task (highest DS) — is also the one where we see the largest improvements in legal pretraining.

DS offers a heuristic for task legal domain specificity and may be readily extended to estimate domain specificity of tasks in other domains with existing pretrained models (e.g., scientific domain specificity using SciBERT pretrain loss). The DS scores of the three legal tasks we examine outline an increasing relationship between the degree of legal domain specificity of a task and the extent to which prior legal knowledge learned by a language model can improve performance. Additionally, our Overruling results suggest that there is an interplay between the difficulty of a task and its domain specificity, where easy tasks that achieve high performance on base BERT, even those with intermediate DS, may not warrant domain-specific pretraining.

Moreover, we see that across the board, Custom Legal-BERT achieves the highest performance, but the relative gain above other models varies depending on the characteristics of the task. Our Custom Legal-BERT results on CaseHOLD suggest that for other difficult and high DS legal tasks, experimentation with custom, task relevant approaches, such as leveraging corpora from a task-specific subdomain of law for pretraining or constructing a custom vocabulary, may yield substantial gains. Recent work highlights the significant environmental and financial cost of training large language models and in particular, transferring an existing model to a new task or developing new models, which can multiply training costs by thousands of times, since these workflows require retraining to experiment with different model architectures and hyperparameters. DS provides a quick metric for future practitioners to evaluate when resource intensive experimentation on custom or new models is warranted for other legal tasks.

Future Directions?

These results suggest important future research directions. First, we hope that the new CaseHOLD dataset will spark interest in solving the challenging environment of legal decisions. Not only are many available benchmark datasets small or not publicly accessible, but they may also be biased toward solvable tasks. After all, a company would not invest in the Overruling task (baseline F1 with BiLSTM of 0.91), without assurance that there are significant gains to paying attorneys to labeling the data. Our results show that domain pretraining may enable a much wider range of legal tasks to be solved.

Try it out

Our code and models are available on Github. For more details, see the full paper. We’re also excited to be presenting this work at ICAIL 2021!

Reference

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL ’21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL].