From the abstract for Matthew Sag, The New Legal Landscape for Text Mining and Machine Learning, Journal of the Copyright Society of the USA, Vol 66 (2019):

Individually and collectively, copyrighted works have the potential to generate information that goes far beyond what their individual authors expressed or intended. Various methods of computational and statistical analysis of text — usually referred to as text data mining (“TDM”) or just text mining — can unlock that information. However, because almost every use of TDM involves making copies of the text to be mined, the legality of that copying has become a fraught issue in copyright law in United States and around the world. One of the most fundamental questions for copyright law in the Internet age is whether the protection of the author’s original expression should stand as an obstacle to the generation of insights about that expression. How this question is answered will have a profound influence on the future of research across the sciences and the humanities, and for the development of the next generation of information technology: machine learning and artificial intelligence.

This Article consolidates a theory of copyright law should that I have advanced in a series of articles and amicus briefs over the past decade. It explains why applying copyright’s fundamental principles in the context of new technologies necessarily implies that copying expressive works for non-expressive purposes should not be counted as infringement and must be recognized as fair use. The Article shows how that theory was adopted and applied in the recent high-profile test cases, Authors Guild v. HathiTrust and Authors Guild v. Google, and takes stock of the legal context for TDM research in the United States in the aftermath of those decisions.

The Article makes important contributions to copyright theory, but is also integrates that theory with a practical assessment various interrelated legal issues that text mining researchers and their supporting institutions must confront if they are to realize the full potential of these technologies. These issues range from the enforceability of website terms of service, the effect of laws prohibiting computer hacking and the circumvention of technological protection measures (i.e., encryption and other digital locks), and cross-border copyright issues.

From the abstract for Rebecca Giblin, et al., Available – But not Accessible? Investigating Publisher e-lending Licensing Practices, Forthcoming, Information Research (expected June 2019):

Introduction: We report our mixed-methods investigation of publishers’ licensing practices, which affect the books public libraries can offer for e-lending.

Method: We created unique datasets recording pricing, availability and licence terms for sampled titles offered by e-book aggregators to public libraries across Australia, New Zealand, Canada, the United States and United Kingdom. A third dataset records dates of availability for recent bestsellers. We conducted follow-up interviews with representatives of 5 e-book aggregators.

Analysis: We quantitatively analysed availability, licence terms and price across all aggregators in Australia, snapshotting the competitive playing field in a single jurisdiction. We also compared availability and terms for the same titles from one aggregator across five jurisdictions, and measured how long it took for a sample of recent bestsellers to become available for e-lending. We used data from the aggregator interviews to explain the quantitative findings.

Results: Contrary to aggregator expectations, we found considerable intra-jurisdictional price and licence differences. We also found numerous differences across jurisdictions.

Conclusions: While availability was better than anticipated, licensing practices make it infeasible for libraries to purchase certain kinds of e-book (particularly older titles). Confidentiality requirements make it difficult for libraries to shop (and aggregators to compete) on price and terms.

H/T beSpacific.

Margaret Hagan, Jameson Dempsey & Jorge Gabriel Jiménez propose to build a “Legal Data Commons” to harness available data from legal aid organizations, courts, legal technology companies, and others to enable research and development that promotes access to justice. “We believe that a legal data commons -— built with privacy and accountability ‘by design’ — could solve the data issue and advance research and innovation objectives while addressing legitimate confidentiality concerns.” Here’s the first part in their forthcoming three part series of articles on their proposal.

From the abstract for Stefan H. Krieger & Katrina Fischer Kuh, Accessing Law: An Empirical Study Exploring the Influence of Legal Research Medium (Vanderbilt Journal of Entertainment & Technology Law, Vol. 16, No. 4, 2014):

The legal profession is presently engaged in an uncontrolled experiment. Attorneys now locate and access legal authorities primarily through electronic means. Although this shift to an electronic research medium radically changes how attorneys discover and encounter law, little empirical work investigates impacts from the shift to an electronic medium.

This Article presents the results of one of the most robust empirical studies conducted to date comparing research processes using print and electronic sources. While the study presented in this Article was modest in scope, the extent and type of the differences that it reveals are notable. Some of the observed differences between print and electronic research processes confirm predictions offered, but never before confirmed, about how the research medium changes the research process. This Article strongly supports calls for the legal profession and legal academy to be more attentive to the implications of the shift to electronic research.

On Politico, Seamus Hughes, deputy director of George Washington University’s Program on Extremism, calls out PACER: “I’m here to tell you that PACER—Public Access to Court Electronic Records—is a judicially approved scam. The very name is misleading: Limiting the public’s access by charging hefty fees, it has been a scam since it was launched and, barring significant structural changes, will be a scam forever.” Read The Federal Courts Are Running An Online Scam (Mar. 20, 2019) here.

H/T to beSpacific for calling attention to Gov404: The Sunlight Foundation Web Integrity Project’s Censorship Tracker. Gov404 aggregates and verifies examples of the most significant cases of online information censorship on the federal Web since November 2016. The cases come from reporting by the Web Integrity Project team, the news media, and other accountability organizations.

H/T to Bob Ambrogi for reporting that Fastcase will be adding ABA publications:

Steve Errick, chief operating officer at Fastcase, told me that he is working with the ABA to add publications from different sections one at a time, with family law, health, trial, IP, and criminal law among the first sections in the pipeline. He did not specify the titles to be added but said the arrangement would average 30-60 titles per section.

Subscribers will have access to these titles from directly within the Fastcase 7 platform, but they will be required to purchase the titles to which they want access. Individual titles will be sold at the ABA’s retail price, while firms that purchase multiple or enterprise subscriptions will be eligible for discounts based on number of titles purchased and number of firm users.

Even though individual titles will be priced the same as purchasing them from the ABA, subscribers get two benefits by purchasing them through Fastcase, Errick said. One is ease of access to the titles directly from the platform and the other is the addition within the books of links to cases and regulations.

From the press release:

“The vision for Fastcase is to make it easy for users to connect the legal research workflow dots, from primary law and public records, dockets, expert witness, legal analytics, and legal news,” Errick said. “The collection includes law review articles from HeinOnline, alerts, digests and blogs from LexBlog, and now our fast-growing collection of more than 1,000 market-leading expert treatises. To see it all come together and be able to showcase these fantastic books represents the culmination of 20 years of effort, and we feel like we’re really just getting started,” he added.

From the abstract for Ronen Avraham, Database of State Tort Law Reforms (6.1):

This manuscript of the Database of State Tort Law Reforms (6th) (DSTLR) updates the DSTLR (5th) and contains the most detailed, complete and comprehensive legal dataset of the most prevalent tort reforms in the United States between 1980 and 2018. The DSTLR has been downloaded more than 2700 times and has become the standard tool in empirical research of tort reform. The dataset records state laws in all fifty states and the District of Columbia over the last several decades. For each reform we record the effective date, a short description of the reform, whether or not the jury is allowed to know about the reform, whether the reform was upheld or struck down by the states’ courts, as well as whether it was amended by the state legislator. Scholarship studying the empirical effects of tort reforms relies on various datasets, (tort reforms datasets and other legal compilations). Some of the datasets are created and published independently and some of them are created ad-hoc by the researchers. The usefulness of these datasets frequently suffers from various defects. They are often incompatible and do not accurately record judicial invalidation of laws. Additionally, they frequently lack reforms adopted before 1986, amendments adopted after 1986, court-based reforms, and effective dates of legislation. It is possible that some of the persisting variation across empirical studies about the effects of tort reforms might be due to the variations in legal datasets used by the studies. This dataset builds upon and improves existing data sources. It does so through a careful review of original legislation and case law to determine the exact text and effective dates. The fifth draft corrects errors that were found in the fourth draft, focuses only on the most prevalent reforms, and standardizes the descriptions of the reforms. A link to an Excel file which codes ten reforms found in DSTLR (6th) can be found here.
It is hoped that creating one “canonized” dataset will increase our understanding of tort reform’s impacts on our lives.

State and local bar partners, consumer bankruptcy customers, and AmLaw 250 subscribers have been asking Fastcase for risk solutions that include public records data according to Fastcase President Phil Rosenthal and Fastcase COO Steve Errick. To satisfy the request, Fastcase has partnered with TransUnion. In a nutshell, Fastcase users who sign up with TransUnion to access its TLOxp platform can use TransUnion information to perform due diligence, conduct litigation support, locate witnesses, track ownership of assets, verify identities, and conduct other investigations.

The integration will be available in Fastcase 7 when it comes out of beta sometime later this month or early in April. Both Bob Ambrogi and Jean O’Grady have the details. Recommended.

On Wednesday, Mark Zuckerberg, the CEO of Facebook, described a sweeping new vision for his platform. “The future of communication,” he wrote, “will increasingly shift to private, encrypted services where people can be confident what they say to each other stays secure.” From the 3,200 word blog post:

“This privacy-focused platform will be built around several principles:

Private interactions. People should have simple, intimate places where they have clear control over who can communicate with them and confidence that no one else can access what they share.

Encryption. People’s private communications should be secure. End-to-end encryption prevents anyone — including us — from seeing what people share on our services.

Reducing Permanence. People should be comfortable being themselves, and should not have to worry about what they share coming back to hurt them later. So we won’t keep messages or stories around for longer than necessary to deliver the service or longer than people want them.

Safety. People should expect that we will do everything we can to keep them safe on our services within the limits of what’s possible in an encrypted service.

Interoperability. People should be able to use any of our apps to reach their friends, and they should be able to communicate across networks easily and securely.

Secure data storage. People should expect that we won’t store sensitive data in countries with weak records on human rights like privacy and freedom of expression in order to protect data from being improperly accessed.

Over the next few years, we plan to rebuild more of our services around these ideas.”

The post raised all kinds of questions about Facebook’s business model and strategies, as well as the trade-offs the company could face. And so after the post went live, Zuckerberg spoke with WIRED about his vision. Here’s the interview.

Bridget J. Crawford’s Information for Submitting to Online Law Review Companions (Feb. 2019) “contains information about submitting essays, commentaries, reviews, responses, and other writings to online companions to the main law reviews and journals at selected law schools. The document includes word-count limitations, subject matter specifications, preferred submission methods and other information of possible interest to authors. It covers 20 online companions to main law reviews.”

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. Wikidata also provides support to many other sites and services beyond just Wikimedia projects. The content of Wikidata is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web. For an introduction to Wikidata, visit here.

Back in 2017, Venture Beat reported that LexisNexis was testing chatbots for legal search. Bob Ambrogi now reports that implementation of a chatbot for Lexis Advance is coming sooner rather than later although no launch date has been announced.

The chatbot’s goal, LexisNexis said, is to give users the option to take more of a conversational approach to search, rather than the “typing keywords into a search bar” approach. A Lexis Advance chatbot could have two key uses. The bot can guide researchers unfamiliar with a topic to sources people typically look at for that topic. The second use is when revisiting prior research. The bot can present it back to searchers, pointing out that, three months ago, they did similar research, and offering to show it to them again. Also, it is claimed that the bot will get better over time at predicting a user’s intent as the user interacts with the system.

Wait ‘n see.

The ‘Future Book’ Is Here, but It’s Not What We Expected from Wired notes that despite the seeming certainty of predictions that digital technology would have by now revolutionized books by incorporating all kinds of interactive features, so far it hasn’t happened. Digital books haven’t changed much at all since their introduction more than 15 years ago. And they still haven’t supplanted the demand for traditional books which remains strong for the same reason that the design hasn’t changed since Gutenberg’s invention; the book blends form and function so perfectly that it nearly defies improvement. According to Wired, however, what has changed is the publishing industry itself and the ease with which an author can get her work into print.

Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use. The company launched the service on September 5, 2018 in beta, and the product is targeted at scientists and data journalists. Institutions that publish their data online, like universities and governments, will need to include metadata tags in their webpages that describe their data, including who created it, when it was published, how it was collected, and so on. This information will then be indexed by Dataset Search and combined with input from Google’s Knowledge Graph.

The initial release of Dataset Search will cover the environmental and social sciences, government data, and datasets from news organizations like ProPublica. However, if the service becomes popular, the amount of data it indexes should quickly snowball as institutions and scientists scramble to make their information accessible. Check out Dataset Search here.

From Lisa DeLuca, Where Do FOIA Responses Live? Electronic Reading Rooms and Web Sources, C&RL News (2019; 80.1):

The “Electronic Freedom of Information Act Amendments of 1996” required that agencies needed to make eligible records available electronically. As a result, there are dozens of FOIA Libraries and Electronic Reading Rooms that are repositories for responses to agency FOIA requests. These documents are also known as responsive documents. Documents are often posted by agencies with redactions to protect personal privacy, national security, and other FOIA exemptions and exclusions. It is important for researchers, journalists, and citizens to use the terms “FOIA Libraries” and “Electronic Reading Rooms” as part of their search terminology. This will ensure they can find documents that might not be findable through a regular Google search.

There is no shortage of literature analyzing the challenges and administrative components of FOIA, including response wait times, complaints about excessive redactions, and lawsuits over access to government files. The purpose of this article is to describe where FOIA responses can be located. Searchable government FOIA information varies by agency. This column includes descriptions of several agency Electronic Reading Rooms, government sources (including Presidential Libraries), and the National Archives and Records Administration (NARA), as well as nongovernment sources, such as FOIA Mapper and MuckRock. The sources listed in this column are excellent starting points to locate current and historical FOIA content.

H/T Gary Price’s INFOdocket post.

From the press release:

The Government Publishing Office (GPO) makes available a subset of enrolled bills, public and private laws, and the Statutes at Large in Beta United States Legislative Markup (USLM) XML, a format that makes documents easier to download and repurpose.

The documents available in the Beta USLM XML format include enrolled bills and public laws beginning with the 113th Congress (2013) and the Statutes at Large beginning with the 108th Congress (2003). They are available on govinfo, GPO’s one-stop site to authentic, published Government information. www.govinfo.gov/bulkdata

H/T Gary Price, InfoDocket

govinfo is a redesign of the FDsys public website, with a focus on implementing feedback from users and improving overall search and access to electronic Federal Government information. The redesigned, mobile-friendly website incorporates innovative technologies and includes several new features for an overall enhanced user experience. GPO’s Federal Digital System (FDsys) website will be retired and replaced with govinfo on Dec. 14, 2018. Here’s answers to frequently ask questions about the transition.