Computer scientists and humanities students have been attempting for more than a decade to produce software that may precisely read Arabic text and rework it right into a digital format, a task that has thus far eluded them. However, artificial intelligence is changing that, opening up the potential of getting archives of newspapers, magazines, and books available to all on the Internet.
“For a long time, correct and dependable Arabic optical character recognition has remained sort of a mirage for academics (particularly within the humanities) and librarians. Advances within the field in recent years have nevertheless gradually been transforming it right into a reality,” writes Dominique Akhoun-Schwarb, curator of uncommon books and manuscripts at SOAS, University of London, in an e-mail.
Arabic text is harder for computers to learn than the Latin alphabet. Arabic and its related languages Persian, Ottoman Turkish, and Urdu are written as a continuous script; consonant letters have a wide range of shapes relying on their place in a word, and there are markings below and above letters that are important to a word’s meaning, however, can be hard to see.
Regardless of these challenges, Akram Khater, director of the Khayrallah Center for Lebanese Diaspora Studies at North Carolina State University in the US, says it’s an endeavor worth pursuing.
To have the ability to digitize Arabic printed text precisely, Khater says, “will open up thousands of pages of data which are presently inaccessible. It’ll facilitate analysis not only by scholars, however by the general public, and that’s why we want it.”