Quantitative approaches to historical texts: should you care about OCR? - Talk by Dr. Simon Hengchen, University of Gothenburg
Event details
Date | 18.11.2020 |
Hour | 12:15 › 13:15 |
Speaker | Dr. Simon Hengchen |
Location | Online |
Category | Conferences - Seminars |
Abstract:
Quantitative methods for historical text analysis offer exciting opportunities for researchers interested in gaining new insights into long studied texts. However, the methodological underpinnings of these methods remains under-explored. In the first part of the talk I will show and discuss, through the use of a case study, the effect the OCR process has on a range of quantitative text analyses.
In the second part of the talk, I will present a novel and totally unsupervised OCR post-correction method on the same dataset, as well as its most recent evolution on a highly-inflected language.
References:
Hämäläinen, M. and Hengchen, S., 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Recent Advances in Natural Language Processing (pp. 432-437). INCOMA.
Hill, M.J. and Hengchen, S., 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34(4), pp.825-843.
Bio:
Simon Hengchen is a researcher in NLP at the University of Gothenburg, where he works within the Language Change project. His main research focus is lexical semantic change in multilingual, unstructured, OCRed, historical textual data, but he is also interested in NLP for DH. Simon is also a part-time lecturer in DH at the University of Geneva.
DH Research Seminar
The DH Research Seminar is a series of talks organised by the Digital Humanities Institute given by researchers from a wide range of backgrounds and aiming at presenting the vast array of subjects covered by Digital Humanities.
Due to sanitary restrictions, the DH Research Seminar will be given exclusively on-line during the 2020 Fall semester.
Be sure to join, listen to the talk and participate in the Q&A session at the end of the presentation.
Quantitative methods for historical text analysis offer exciting opportunities for researchers interested in gaining new insights into long studied texts. However, the methodological underpinnings of these methods remains under-explored. In the first part of the talk I will show and discuss, through the use of a case study, the effect the OCR process has on a range of quantitative text analyses.
In the second part of the talk, I will present a novel and totally unsupervised OCR post-correction method on the same dataset, as well as its most recent evolution on a highly-inflected language.
References:
Hämäläinen, M. and Hengchen, S., 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Recent Advances in Natural Language Processing (pp. 432-437). INCOMA.
Hill, M.J. and Hengchen, S., 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities, 34(4), pp.825-843.
Bio:
Simon Hengchen is a researcher in NLP at the University of Gothenburg, where he works within the Language Change project. His main research focus is lexical semantic change in multilingual, unstructured, OCRed, historical textual data, but he is also interested in NLP for DH. Simon is also a part-time lecturer in DH at the University of Geneva.
DH Research Seminar
The DH Research Seminar is a series of talks organised by the Digital Humanities Institute given by researchers from a wide range of backgrounds and aiming at presenting the vast array of subjects covered by Digital Humanities.
Due to sanitary restrictions, the DH Research Seminar will be given exclusively on-line during the 2020 Fall semester.
Be sure to join, listen to the talk and participate in the Q&A session at the end of the presentation.
Practical information
- General public
- Free