At present, the OpenITI / KITAB corpus comprises 10,243 text files, 6,268 of which are unique titles. Such a large, and growing, number of texts makes quality control challenging. But at the same time, it is precisely this large number of texts that can be the basis for quantitative methods of quality control. In this blog, Dr Lorenz Nigst would like to briefly explore how we can make use of a large number of annotated lines of poetry for a simple quality check of such lines.
Many of the texts in the OpenITI/KITAB corpus contain poetic verses. Many of these verses are found in texts that carry the term dīwān in the title; others are not. Some of these verses are annotated and tagged as verses; others are not. Regardless of where verses are found, if they are annotated according to the OpenITI mARkdown system used by OpenITI/KITAB, the tag %~% is inserted either between two of the hemistichs of a verse, or before or after a verse in case there are not two hemistichs (see also here under “Verses of poetry”).