Important New Developments in Arabographic Optical Character Recognition (OCR) by
Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling.
Highlights from the paper:"The OpenITI team—building on the foundational open-source OCR work of the Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties "
"The specific type of OCR software that we employed in our tests is an
open-source OCR program called Kraken, which was developed by Benjamin
Kiessling at Leipzig University’s Alexander von Humboldt Chair for Digital
Humanities. Unlike more traditional OCR approaches, Kraken relies on a neural
network—which mimics the way we learn—to recognize letters in the images of
entire lines of text without trying first to segment lines into words and then words
into letters."
"The most important advantage of Kraken is that its workflow allows one to train new
models relatively easily, including text-specific ones. In a nutshell, the process of
training requires a transcription of approximately 800 lines (the number will vary
depending on the complexity of the typeface) aligned with images of these lines as
they appear in the printed edition."
"The two rounds of testing presented here indicate that with a fairly modest amount
of gold standard training data (~800–1,000 lines) Kraken is consistently able to
produce OCR results for Arabic-script documents that achieve accuracy rates in the
high nineties."
"In the long term, we will are also planning to train models for other Islamicate languages (Ottoman Turkish,Urdu, Syriac, etc.). Our hope is that an easy-to-use and effective OCR pipeline will allow us all—collectively—to significantly enrich our collection of digital Islamicate texts and thereby enable us to understand better this fascinating and understudied textual tradition."
No comments:
Post a Comment