Skip to main content

Text Mining and Analysis in the Humanities

This guide provides background information on data and text mining

Digital Text Corpora

A number of research organizations and publishers of large digital archives are making their texts and metadata available for text mining and analysis.   The following are some examples. Contact your subject librarian for further information.

 

EEBO-TCP  Early English Books Online - Text Creation Partnership.  25,000 texts from the first phase of EEBO-TCP were made freely available as open data in the public domain from January 2015.

 

Gale Digital Collections Includes the 17th & 18th Century Burney Collections Newspapers, 19th Century British Library Newspapers, 19th Century UK Periodicals, the Economist Historical Archive 1843-2011, Eighteenth Century Collections Online, Sabin Americana, 1500-1926, The Times Digital Archive, 1785-1985, and others. Consult the FAQ for more information about textmining in the Gale Collections.

 

HathiTrust Data Sets -  HathiTrust makes the texts of public domain works in its corpus available for research purposes.  "HathiTrust announces the release of a significantly expanded open dataset, the HathiTrust Research Center (HTRC) Extracted Features (EF) Dataset <https://analytics.hathitrust.org/datasets>, Version 1.0. This dataset provides researchers with open access to data extracted from the full text of the HathiTrust Digital Library <https://www.hathitrust.org/> (HTDL) at an unprecedented scale."

 

Proceedings of the Old Bailey: London's Central Criminal Court, 1674-1913Criminal.  The Old Bailey API  allows you to work directly with the text of both the individual trials and sessions published as part of the Proceedings. You can either use the Old Bailey API Demonstrator to build queries and export texts to Voyant Tools; or else address the underlying text directly through the API.