Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling.
Highlights from the paper:
"The OpenITI team—building
on the foundational open-source OCR work of the Leipzig University’s
(LU) Alexander von Humboldt Chair for Digital Humanities—has achieved
Optical Character Recognition (OCR) accuracy rates for classical
Arabic-script texts in the high nineties "
"The specific type of OCR software that we employed in our tests is an open-source OCR program called Kraken, which was developed by Benjamin Kiessling at Leipzig University’s Alexander von Humboldt Chair for Digital Humanities. Unlike more traditional OCR approaches, Kraken relies on a neural network—which mimics the way we learn—to recognize letters in the images of entire lines of text without trying first to segment lines into words and then words into letters."
"The most important advantage of Kraken is that its workflow allows one to train new models relatively easily, including text-specific ones. In a nutshell, the process of training requires a transcription of approximately 800 lines (the number will vary depending on the complexity of the typeface) aligned with images of these lines as they appear in the printed edition."
"The two rounds of testing presented here indicate that with a fairly modest amount of gold standard training data (~800–1,000 lines) Kraken is consistently able to produce OCR results for Arabic-script documents that achieve accuracy rates in the high nineties."
"In the long term, we will are also planning to train models for other Islamicate languages (Ottoman Turkish,Urdu, Syriac, etc.). Our hope is that an easy-to-use and effective OCR pipeline will allow us all—collectively—to significantly enrich our collection of digital Islamicate texts and thereby enable us to understand better this fascinating and understudied textual tradition."