Research

Centre for Digital Humanities (CDH)

About

AKU-ISMC's Centre for Digital Humanities (CDH) advances research and teaching based on digital methods in the Humanities. It is unique in its focus on the history of the Arabographic written tradition and aims to expand this to include Persian, Ottoman Turkish, Urdu and other languages. The Centre addresses questions at the leading edge in Humanities and Computer Science, including Optical Character Recognition (OCR), Handwritten Text Recognition (HTR) and Natural Language Processing (NLP).

Project Accomplishments

The CDH has received funding from the European Research Council's Horizon 2020 Research and Innovation, the British Academy, the Qatar National Library, and the Andrew W. Mellon Foundation. Notable accomplishments include:

Creation of an Arabic corpus containing more than 2 billion words.
Adaptation and creation of algorithms to detect text reuse and citation networks across the corpus.
Development of an online platform for exploring text reuse data through visualisations.
Development of a preliminary reading environment showcasing project data.
Recognition for expertise in specific methods, including Optical Character Recognition and text reuse detection.

CDH Work

1. Corpus

The CDH team is a major partner in the Open Islamicate Text Initiative (OpenITI), which works on developing a curated corpus of machine-readable texts in Arabic, Persian, Urdu and other Islamicate languages The corpus is available on GitHub and is released periodically on Zenodo. The metadata and texts can be accessed through the KITAB app.

2. Methods

One of the main goals of the CDH is to develop computational methods to study the written tradition in the Arabic script. Some of the digital methods used by the CDH team have been adapted from approaches that are already commonly used in other disciplines, but which have not been developed with pre-modern languages written in Arabic script in mind. The CDH pairs computer science with the humanities to adapt digital methods for use with pre-modern languages written in Arabic script. This is an iterative and multi-staged process. For more on the development of these methods, see the CDH blog.

These methods produce datasets that enhance understanding of the corpus and Arabic book history, answering diverse research questions. Learn more about our data here.

3. KITAB Project

The KITAB (Knowledge, Information Technology, and the Arabic Book) project uses innovative methods such as text reuse detection to study the development of the Arabic written tradition in the long term. It aims to empower users to explore Arabic texts in innovative ways and advance knowledge about one of the world's largest and most intricate textual traditions. The KITAB project was funded by the European Research Council from 2018-2023 (grant no.: 772989). We are now also experimenting with text reuse in Persian. Click here to learn more about KITAB, and here to explore KITAB's text reuse data and visualisations.

4. Arabic- Script OCR and HTR

One of the main impediments to building a corpus of Arabic-script texts is the fact that the conversion of scans of printed or handwritten books into machine-readable text lags far behind that of Latin-script languages. Together with partners at the University of Maryland, Northeastern University and the University of California San Diego, the CDH is working on improving the performance of Arabic-script OCR (Optical Character Recognition), and HTR (Handwritten Text Recognition). This work is sponsored by the Andrew W. Mellon Foundation. The first phase of this project focused on improving OCR for Arabic-script typefaces used in pre-computer age printing; the second phase focuses on Arabic-script manuscripts.

5. Arabic Pasts

This annual workshop held since 2009, co-hosted with SOAS University and the University of Oxford, offers an opportunity to reflect on history writing in Arabic. Click here to learn more about the workshop.

6. CDH Internship Programme

Since 2022, the CDH offers annual, six-week internships to students from Queen Mary University London. The internship not only aims at providing an insight into the corpus work of the CDH, including OCR/HTR and metadata compilation, but also into algorithmic research and relevant workflows and tool building more broadly speaking.