Skip to Main Content

Text Mining and Analysis in the Humanities

This guide provides background information on data and text mining

Digital Text Corpora

A number of research organizations and publishers of large digital archives are making their texts and metadata available for text mining and analysis.   The following are some examples. Contact your subject librarian for further information.

EEBO-TCP  Early English Books Online - Text Creation Partnership.  25,000 texts from the first phase of EEBO-TCP were made freely available as open data in the public domain. Evans-TCP are also available to everyone.  The full 35,000 texts are available only to EEBO-TCP partners.  For further information contact EEBO-TCP directly.

Gale Digital Collections Includes the 17th & 18th Century Burney Collections Newspapers, 19th Century British Library Newspapers, 19th Century UK Periodicals, the Economist Historical Archive 1843-2011, Eighteenth Century Collections Online, Sabin Americana, 1500-1926, The Times Digital Archive, 1785-1985, and others. Consult the FAQ for more information about textmining in the Gale Collections.

Gale Digital Scholar Lab  provides access to tools to analyze resources from within Gale primary sources licensed at UMD for digital humanities scholarship. Use of the Digital Scholar Lab requires creation of a free user account on first login.

HathiTrust Data Sets -  HathiTrust makes the texts of public domain works in its corpus available for research purposes.  "HathiTrust announces the release of a significantly expanded open dataset, the HathiTrust Research Center (HTRC) Extracted Features (EF) Dataset <https://analytics.hathitrust.org/datasets>, Version 1.0. This dataset provides researchers with open access to data extracted from the full text of the HathiTrust Digital Library <https://www.hathitrust.org/> (HTDL) at an unprecedented scale."

Proceedings of the Old Bailey: London's Central Criminal Court, 1674-1913Criminal.  The Old Bailey API  allows you to work directly with the text of both the individual trials and sessions published as part of the Proceedings. You can either use the Old Bailey API Demonstrator to build queries and export texts to Voyant Tools; or else address the underlying text directly through the API.

JSTOR Data For Research provides datasets of content on JSTOR for use in research and teaching. Researchers may use DfR to define and submit their desired dataset to be automatically processed. Data available through the service includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR.