Quantitative approaches to historical texts: should you care about OCR? - Talk by Dr. Simon Hengchen, University of Helsinki
Quantitative methods for historical text analysis offer exciting opportunities for researchers interested in gaining new insights into long studied texts. However, the methodological underpinnings of these methods remains underexplored. In the first part of the talk, I will show and discuss, through the use of a case study, the effect the OCR process has on a range of quantitative text analyses.
In the second part of the talk, I will present a novel, and totally unsupervised, OCR post-correction method on the same dataset.
Hämäläinen, M. and Hengchen, S., 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Recent Advances in Natural Language Processing (pp. 432-437). INCOMA.
Hill, M.J. and Hengchen, S., 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34(4), pp.825-843.
Simon Hengchen is a postdoctoral researcher at the University of Helsinki, where he works within the COMHIS group. His main research focus is lexical semantic change in multilingual, unstructured, OCRed, historical textual data, but he is also interested in NLP for DH. Simon is also a part-time lecturer in DH at the University of Geneva.
DH Research Seminar
The DH Research Seminar is a series of talks organised by the Digital Humanities Institute given by researchers from a wide range of backgrounds and aiming at presenting the vast array of subjects covered by Digital Humanities.
Be sure to come. Listen to the talk and participate if you wish in the Q&A session, and continue to discuss the subject with the speaker and the other participants in a relaxed athmosphere during the apero that will follow the talk.