Overview:

A week or so ago I had a faculty member at UNT ask if I could work with one of his students to get a copy of the OCR text of several titles of historic Texas newspapers that we have on The Portal to Texas History.

While we provide public access to the full-text for searching and discovering newspapers pages of interest to users, we don’t have a very straightforward way to publicly obtain the full-text for a given issue let along full titles that may be many tens of thousands of pages in size.

At the end of the week I had pulled roughly 79,000 issues of newspapers comprised of over 785,000 pages of OCR text. We are making these publicly available in the UNT Data Repository under a CC0 License so that others might be able to make use of them. Feel free to jump over to the UNT Digital Library to grab a copy.

Background:

The UNT Libraries and The Portal to Texas History have operated the Texas Digital Newspaper Program for nine years with a goal of preserving and making available as many newspapers published in Texas as we are able to collect and secure rights to. At this time we have nearly 3.5 million pages of Texas newspapers ranging from the 1830’s all the way to 2015. Jump over to the TDNP collection in the Portal to take a look at all of the content there including a list of all of the titles we have digitized.

The titles in the datasets were chosen by the student and professor and seem to be a fairly decent sampling of communities that we have in the Portal that are both large in size and have a significant number of pages of newspapers digitized.

Here is a full list of the communities, page count, issue count, and links to the dataset itself in the UNT Digital Library.

Dataset Name	Community	County	Years Covered	Issues	Pages
Portal to Texas History Newspaper OCR Text Dataset: Abilene	Abilene	Taylor County	1888-1923	7,208	62,871
Portal to Texas History Newspaper OCR Text Dataset: Brenham	Brenham	Washington County	1876-1923	10,720	50,368
Portal to Texas History Newspaper OCR Text Dataset: Bryan	Bryan	Brazos County	1883-1922	5,843	27,360
Portal to Texas History Newspaper OCR Text Dataset: Denton	Denton	Denton County	1892-1911	690	4,686
Portal to Texas History Newspaper OCR Text Dataset: El Paso	El Paso	El Paso County	1881-1921	17,104	177,640
Portal to Texas History Newspaper OCR Text Dataset: Fort Worth	Fort Worth	Tarrant County	1883-1896	4,146	36,199
Portal to Texas History Newspaper OCR Text Dataset: Gainesville	Gainesville	Cooke County	1888-1897	2,286	9,359
Portal to Texas History Newspaper OCR Text Dataset: Galveston	Galveston	Galveston County	1849-1897	8,136	56,953
Portal to Texas History Newspaper OCR Text Dataset: Houston	Houston	Harris County	1893-1924	9,855	184,900
Portal to Texas History Newspaper OCR Text Dataset: McKinney	McKinney	Collin County	1880-1936	1,568	12,975
Portal to Texas History Newspaper OCR Text Dataset: San Antonio	San Antonio	Bexar County	1874-1920	6,866	130,726
Portal to Texas History Newspaper OCR Text Dataset: Temple	Temple	Bell County	1907-1922	4,627	44,633

Dataset Layout

Each of the datasets is a gzipped tar file that contains a multi-level directory structure. In addition there is a README.txt created for each of the datasets. Here is an example of the Denton README.txt

Each of the datasets is organized by title. Here is the structure for the Denton dataset.

Denton
└── data
    ├── Denton_County_News
    ├── Denton_County_Record_and_Chronicle
    ├── Denton_Evening_News
    ├── Legal_Tender
    ├── Record_and_Chronicle
    ├── The_Denton_County_Record
    └── The_Denton_Monitor

Within each of the title folders are subfolders for each year that we have a newspaper issue for.

Denton/data/Denton_County_Record_and_Chronicle/
├── 1898
├── 1899
├── 1900
└── 1901

Finally within each of the year folders contain folders for each issue present in The Portal to Texas History on the day the dataset was extracted.

Denton
└── data
    ├── Denton_County_News
    │   ├── 1892
    │   │   ├── 18920601_metapth502981
    │   │   ├── 18920608_metapth502577
    │   │   ├── 18920615_metapth504880
    │   │   ├── 18920622_metapth504949
    │   │   ├── 18920629_metapth505077
    │   │   ├── 18920706_metapth501799
    │   │   ├── 18920713_metapth502501
    │   │   ├── 18920720_metapth502854

Each of these issue folders has the date of publication in the yyyymmdd format and the ARK identifier from the Portal for the folder name.

Each of these folders is a valid BagIt bag that can be verified with tools like bagit.py. Here is the structure for an issue.

18921229_metapth505423
├── bag-info.txt
├── bagit.txt
├── data
│   ├── metadata
│   │   ├── ark
│   │   ├── metapth505423.untl.xml
│   │   └── portal_ark
│   └── text
│       ├── 0001.txt
│       ├── 0002.txt
│       ├── 0003.txt
│       └── 0004.txt
├── manifest-md5.txt
└── tagmanifest-md5.txt

The OCR text is located in the text folder and three metadata files are present in the metadata folder. A file called ark that contains the ark identifier for this item. There is a file called portal_ark that contains the URL to this issue in The Portal to Texas History, and finally a metadata file in the UNTL metadata format.

I hope that these datasets are useful to folks interested in trying their hand at working with a large collection of OCR text from newspapers. I should remind everyone that this is uncorrected OCR text and will most likely need a fair bit of pre-processing because it is far from perfect.

If you have questions or comments about this post, please let me know via Twitter.

mark e. phillips journal

Portal to Texas History Newspaper OCR Text Datasets

Overview:

Background:

Dataset Layout