AsoSoft Text Corpus
Text corpus is a large structured collection of text documents in which a portion of or the entire text documents are annotated. Text corpus a significant language resource used in a variety of NLP research themes and applications. For instance, it is used in information retrieval systems or to extract language model and lexicon from to be used in the automatic speech recognition system.
AsoSoft text corpus is the largest Kurdish text corpus so far. The first version of AsoSoft text corpus contains 190 Million tokens and is comprised of 458 thousand documents. The sources of collected documents include, but are not limited to, websites, books, and magazines. The documents of the corpus have been converted into the standard TEI format.
A great portion of our text corpus is publicized for research (Non-Commercial use). The AsoSoft text corpus repository on GitHub includes:
1. AsoSoft Text Corpus Large Version: This file contains 75 million tokens.
2. AsoSoft Text Corpus Small Version:: This file contains 5 million tokens.
3. AsoSoft topic annotated dataset.
Asosoft text corpus is applicable to the following research areas:
• NLP and Speech Processing
Extracting language models
Word vector representation
Extracting computational lexicons
How to Use
Common editors for work with large text files are EmEditor, TlCorpus, TextPad and so forth.
NOTE: If you are using our text corpus cite us.
Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini, (2018), Toward Kurdish Language Processing: Experiments in Collecting and Processing the AsoSoft Text Corpus. Digital Scholarship in the Humanities, Oxford University Press.
A portion of AsoSoft Text Corpus is accessible to researchers for research, non-commercial use.