At present, the OpenITI / KITAB corpus comprises 10,243 text files, 6,268 of which are unique titles. Such a large, and growing, number of texts makes quality control challenging. But at the same time, it is precisely this large number of texts that can be the basis for quantitative methods of quality control. In this blog, Dr Lorenz Nigst would like to briefly explore how we can make use of a large number of annotated lines of poetry for a simple quality check of such lines.
Many of the texts in the OpenITI/KITAB corpus contain poetic verses. Many of these verses are found in texts that carry the term dīwān in the title; others are not. Some of these verses are annotated and tagged as verses; others are not. Regardless of where verses are found, if they are annotated according to the OpenITI mARkdown system used by OpenITI/KITAB, the tag %~% is inserted either between two of the hemistichs of a verse, or before or after a verse in case there are not two hemistichs (see also here under “Verses of poetry”).
In many instances, the tag is inserted manually during the course of the annotation work carried out by KITAB team members. In the vast majority of cases, however, the tag is inserted automatically by KITAB when new text files are converted to the OpenITI mARkdown format. However, if the tag %~% was inserted, the number of lines containing it is substantial and if we search our corpus for lines containing %~%, we get tens of thousands of results.
Continue readingDr Lorenz Nigst
Research Associate,
KITAB Corpus Management, AKU-ISMC.
At present, the OpenITI / KITAB corpus comprises 10,243 text files, 6,268 of which are unique titles. Such a large, and growing, number of texts makes quality control challenging. But at the same time, it is precisely this large number of texts that can be the basis for quantitative methods of quality control. In this blog, Dr Lorenz Nigst would like to briefly explore how we can make use of a large number of annotated lines of poetry for a simple quality check of such lines.
Many of the texts in the OpenITI/KITAB corpus contain poetic verses. Many of these verses are found in texts that carry the term dīwān in the title; others are not. Some of these verses are annotated and tagged as verses; others are not. Regardless of where verses are found, if they are annotated according to the OpenITI mARkdown system used by OpenITI/KITAB, the tag %~% is inserted either between two of the hemistichs of a verse, or before or after a verse in case there are not two hemistichs (see also here under “Verses of poetry”).
In many instances, the tag is inserted manually during the course of the annotation work carried out by KITAB team members. In the vast majority of cases, however, the tag is inserted automatically by KITAB when new text files are converted to the OpenITI mARkdown format. However, if the tag %~% was inserted, the number of lines containing it is substantial and if we search our corpus for lines containing %~%, we get tens of thousands of results.
Continue readingDr Lorenz Nigst
Research Associate,
KITAB Corpus Management, AKU-ISMC.