Portal to Texas History Newspaper OCR Text Datasets

Overview:

A week or so ago I had a faculty member at UNT ask if I could work with one of his students to get a copy of the OCR text of several titles of historic Texas newspapers that we have on The Portal to Texas History.

While we provide public access to the full-text for searching and discovering newspapers pages of interest to users, we don’t have a very straightforward way to publicly obtain the full-text for a given issue let along full titles that may be many tens of thousands of pages in size.

At the end of the week I had pulled roughly 79,000 issues of newspapers comprised of over 785,000 pages of OCR text. We are making these publicly available in the UNT Data Repository under a CC0 License so that others might be able to make use of them.  Feel free to jump over to the UNT Digital Library to grab a copy.

Background:

The UNT Libraries and The Portal to Texas History have operated the Texas Digital Newspaper Program for nine years with a goal of preserving and making available as many newspapers published in Texas as we are able to collect and secure rights to.  At this time we have nearly 3.5 million pages of Texas newspapers ranging from the 1830’s all the way to 2015. Jump over to the TDNP collection in the Portal to take a look at all of the content there including a list of all of the titles we have digitized.

The titles in the datasets were chosen by the student and professor and seem to be a fairly decent sampling of communities that we have in the Portal that are both large in size and have a significant number of pages of newspapers digitized.

Here is a full list of the communities, page count, issue count, and links to the dataset itself in the UNT Digital Library.

Dataset Name Community County Years Covered Issues Pages
Portal to Texas History Newspaper OCR Text Dataset: Abilene Abilene Taylor County 1888-1923 7,208 62,871
Portal to Texas History Newspaper OCR Text Dataset: Brenham Brenham Washington County 1876-1923 10,720 50,368
Portal to Texas History Newspaper OCR Text Dataset: Bryan Bryan Brazos County 1883-1922 5,843 27,360
Portal to Texas History Newspaper OCR Text Dataset: Denton Denton Denton County 1892-1911 690 4,686
Portal to Texas History Newspaper OCR Text Dataset: El Paso El Paso El Paso County 1881-1921 17,104 177,640
Portal to Texas History Newspaper OCR Text Dataset: Fort Worth Fort Worth Tarrant County 1883-1896 4,146 36,199
Portal to Texas History Newspaper OCR Text Dataset: Gainesville Gainesville Cooke County 1888-1897 2,286 9,359
Portal to Texas History Newspaper OCR Text Dataset: Galveston Galveston Galveston County 1849-1897 8,136 56,953
Portal to Texas History Newspaper OCR Text Dataset: Houston Houston Harris County 1893-1924 9,855 184,900
Portal to Texas History Newspaper OCR Text Dataset: McKinney McKinney Collin County 1880-1936 1,568 12,975
Portal to Texas History Newspaper OCR Text Dataset: San Antonio San Antonio Bexar County 1874-1920 6,866 130,726
Portal to Texas History Newspaper OCR Text Dataset: Temple Temple Bell County 1907-1922 4,627 44,633

Dataset Layout

Each of the datasets is a gzipped tar file that contains a multi-level directory structure.  In addition there is a README.txt created for each of the datasets. Here is an example of the Denton README.txt

Each of the datasets is organized by title. Here is the structure for the Denton dataset.

Denton
└── data
    ├── Denton_County_News
    ├── Denton_County_Record_and_Chronicle
    ├── Denton_Evening_News
    ├── Legal_Tender
    ├── Record_and_Chronicle
    ├── The_Denton_County_Record
    └── The_Denton_Monitor

Within each of the title folders are subfolders for each year that we have a newspaper issue for.

Denton/data/Denton_County_Record_and_Chronicle/
├── 1898
├── 1899
├── 1900
└── 1901

Finally within each of the year folders contain folders for each issue present in The Portal to Texas History on the day the dataset was extracted.

Denton
└── data
    ├── Denton_County_News
    │   ├── 1892
    │   │   ├── 18920601_metapth502981
    │   │   ├── 18920608_metapth502577
    │   │   ├── 18920615_metapth504880
    │   │   ├── 18920622_metapth504949
    │   │   ├── 18920629_metapth505077
    │   │   ├── 18920706_metapth501799
    │   │   ├── 18920713_metapth502501
    │   │   ├── 18920720_metapth502854

Each of these issue folders has the date of publication in the yyyymmdd format and the ARK identifier from the Portal for the folder name.

Each of these folders is a valid BagIt bag that can be verified with tools like bagit.py. Here is the structure for an issue.

18921229_metapth505423
├── bag-info.txt
├── bagit.txt
├── data
│   ├── metadata
│   │   ├── ark
│   │   ├── metapth505423.untl.xml
│   │   └── portal_ark
│   └── text
│       ├── 0001.txt
│       ├── 0002.txt
│       ├── 0003.txt
│       └── 0004.txt
├── manifest-md5.txt
└── tagmanifest-md5.txt

The OCR text is located in the text folder and three metadata files are present in the metadata folder. A file called ark that contains the ark identifier for this item. There is a file called portal_ark that contains the URL to this issue in The Portal to Texas History, and finally a metadata file in the UNTL metadata format.

I hope that these datasets are useful to folks interested in trying their hand at working with a large collection of OCR text from newspapers. I should remind everyone that this is uncorrected OCR text and will most likely need a fair bit of pre-processing because it is far from perfect.

If you have questions or comments about this post,  please let me know via Twitter.