The COVID-19 pandemic led to a flurry of executive orders (EO) and policy directives at the state and local level as the United States struggled to adapt policy to mitigate the public health crisis. It also resulted in an explosion of research on both the coronavirus itself and the pandemic’s societal impact. Unfortunately, many of those policies were written, signed, scanned, and uploaded in portable document format (PDF) to government websites. This resulted in their contents being digitized as images rather than as machine-readable text. To facilitate current research demands, we have used optical character recognition (OCR) tools to extract machine-readable text from these documents and made these “plain text” versions available online. Here we describe the pre and postprocessing steps as well as provide an evaluation of the resulting document quality. We suggest unsupervised methods for scoring output texts that can be applied to other optical character recognition tasks when ground truth plain texts are unavailable. We show that simple preprocessing modestly improves OCR performance on scanned orders and directives.
Digitizing COVID-19 Policy Documents and Measuring Plain Text Fidelity
August 6, 2020
Cite this Paper (BibTeX)