A key task in legal writing is identifying the legal authority governing a particular set of facts. While a flurry of research and especially private-sector tools claim to use machine learning to assist lawyers in quickly finding the proper authority to cite, no benchmark dataset currently exists for training and evaluating citation recommendation tools. Part of the reason for that is the difficulty of capturing legal citations, which tend to break ordinary sentence parsers.
To help fill this gap, we present a dataset based on 324,309 decisions issued by the Board of Veterans’ Appeals (BVA) between 2009 and 2017. The decisions are selected from the universe of BVA appeals because they contain a single legal issue. These decisions form the basis for the results presented in Huang, Low, Teng, Zhang, Ho, Krass & Grabmair (2021). Any use of these data should cite Huang et al. (2021) as the source.
Our data release contains two components.
First, our release contains a pre-processed version of each opinion, in which the legal citations have been cleaned and regularized. Using the associated vocabulary dataset, researchers can match each citation ID to a unique identifier from Harvard’s Case Law Access project. This dataset is unique: Researchers can get right to predicting citations without the headache of identifying citation forms. And with each citation’s CaseLawAccess ID, researchers can find the full text of each source and an extensive set of metadata. The pre-processed opinions can be fed directly into the data loader class found in our code repository to generate paired examples of text and target citations.
For researchers interested in replicating our processing pipeline, we also provide the raw text of the opinions alongside a set of limited administrative metadata (the year, topic, and author of the opinion). Our code repository provides code to reproduce the processing pipeline. As we note in Huang, Low, Teng, Zhang, Ho, Krass & Grabmair (2021), reproducing the pipeline requires researchers to have access to the CaseLawAccess metadata.
References
Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair. 2021. Context-Aware Legal Citation Recommendation using Deep Learning. In Eighteenth International Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paolo, Brazil. ACM, New York. https://doi.org/10.1145/3462757.3466066