Read this new blog post by AKU-ISMC's Dr Peter Verkinderen on A Token Frequency Counter For OpenITI Texts:
One of the participants in our KITAB user group asked for an easy way to find out which are the most frequently used words in a text.
There are quite a lot of online tools that allow you to upload a file (or provide a link to a web page) and will produce a nice table with word counts.
Unfortunately, this specific user group participant works with the largest text in the KITAB corpus, Ibn al-ʿAsākir’s Ta’rīkh Madīnat Dimashq, which weighs in at about 75 MB in the OpenITI version, and none of the sites I tried out was willing to accept such a large file. Others failed to produce a frequency list even with smaller books, probably because they don’t deal well with Arabic.
There are a number of other options to create frequency lists from a text, but most involve using a programming language like Python or R. Since most of our user group participants do not have any background in programming, we thought about using this question as an opportunity to show the user group how to install Python or R, and run a simple script to create the frequency list (for example, using the fantastic R.stylo library).
However, even showcasing something simple like this to people who do not have Python or R installed and are working on different platforms always takes quite some time, time we didn’t have during our user group meeting.
So, after getting frustrated with a number of websites that promise to convert a text into a frequency list, I decided to build such an online tool for OpenITI myself.
This was quite straight-forward because I could build on work we have done for our openiti Python library and other sub-projects.
The tool can be found here. Continue reading
Dr Peter Verkinderen
A Post-doctoral Research Fellow at AKU-ISMC's KITAB working on the central regions of the Islamic lands. He studied Classics and Arabic and Islamic studies at Ghent University. His PhD dissertation, also at Ghent University, was a reconstruction of the fluvial landscape of early Islamic Lower Mesopotamia, based on (mostly Arabic) texts, satellite imagery and data from archaeological and geological research. He has worked as the Assistant Director of the Netherlands-Flemish Institute in Cairo (2009-2014) and as a Research Fellow in the ERC project “The Early Islamic Empire at Work” (Hamburg University, 2014-2019), where he focused on the position of Fārs (SW Iran) in the early Islamic empire.