Machine Learning and Legal Research
Felix B. Chang - Corporate Law Center, University of Cincinnati.
-- Law scholars are on the cusp of a sea change in the way we conduct research. Machine learning, which already abounds in industry and other academic fields, is being incorporated more fully into legal scholarship. Our community has experimented with machine learning techniques from corpus linguistics, and information technology firms are now offering machine learning services for hire. With the launch of Harvard’s Caselaw Access Project (the “Caselaw Project”), we can tap into even more datasets for algorithmic analysis.
Unveiled in October 2018, the Caselaw Project provides free and open access to all published decisions from almost all U.S jurisdictions up to 2018. It portends disruption to the legal research market, which is currently dominated by Westlaw and Lexis. Yet for law scholars, the Caselaw Project’s major contribution is its trove of big data for research projects. We can unleash machine learning upon the Caselaw Project’s 6.7 million cases in innovative ways, on projects that simply could not have been undertaken by extracting cases in batches from Westlaw and Lexis.
With financial support from the Mellon Foundation, the Digital Scholarship Center at the University of Cincinnati has built a platform that can process large amounts of data and create visualizations to illuminate hidden patterns. For case law, the patterns would be how certain terms in legal decisions tend to cluster together. The platform utilizes topic modeling algorithms to present visualizations that reveal macrotrends in tens of thousands of cases. Along with my co-researchers James Lee, Erin McCabe, Zhaowei Ren, and Josh Beckelhimer (who hail from computer science, digital humanities, English, and library sciences), we are beginning to model how federal courts approach two issues in antitrust law: the measure of market power and the balance between antitrust and regulation.
The challenge, however, will be to convince law scholars that machine learning adds value to legal research. Existing literature has already laid the groundwork to bridge machine learning and legal research, and our community has adapted techniques from corpus linguistics, which can show word frequency and usage conventions. By contrast, we deploy topic modeling, an altogether different method that depicts the probable distribution of terms and their co-occurrences within a dataset. Topic modeling shows the statistical dispersal of topics across a dataset, where each topic is comprised of words that are statistically most likely to be relevant to that topic. It has been successfully used in many settings, most prominently the construction of the Stanford Dissertation Browser.
Among other trends identified by topic modeling, we find that market power and antitrust–regulation cases have reified over time (see Figure 1). Notably, cases pertaining to bank mergers and the Interstate Commerce Commission (“ICC”) have declined. The fall of ICC cases coheres with the trend commonly known as deregulation, where rate setting was replaced by regulatory frameworks that herald competition and open access above everything else.
Figure 1: Histograms of Market Power (top) and Regulation (bottom) Cases
The above histograms affirm what law scholars have posited about the march toward deregulation, but they also complicate our understanding. For instance, if the decline of ICC and bank merger cases coincide with a rise in patent cases and class actions, that does not necessarily spell the end of regulation. Instead, it may mean that certain regulatory regimes (e.g., patent law) remain robust, and in those regimes, market power analysis and the antitrust–regulation balance is nuanced. Concomitantly, regulation may in certain settings be deferring to private antitrust suits, especially class actions. Above all, market power inquiries and the antitrust–regulation balance may unfold differently depending on the setting (or topic), such as patents, telecommunications, financial services, or health care.
There is much to be done to validate machine learning. As my co-researchers and I press ahead, we will be mindful that the heavy lifting will come as much from explaining our methodology and inferences as from refining the technical capabilities of the platform.