言語商会

TopLabSNOW

Nagaoka Tigrinya Corpus

(Note that this page is written by Dr. Yemane Keleta Tedla, when he was a student in our lab. )

1. What is a corpus ?

“In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)” [Wikipedia].

2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0)

The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called “Haddas Ertra”. It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens. On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows.

3. Tagset design

The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows:

The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are:

  • Tigrinya Grammar by Adi Ghebre (2000)
  • A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and
  • Tigrinya Grammar by John Mason (1996)

4. Format of NTC

Tigrinya uses the Ge'ez Script as its writing system. The corpus is available in both Ge'ez script and Transliterated Latin script. SERA transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script). For machine readability and flexible manipulation, the corpus was pre-processed (cleaned) and encoded in TEI corpus format. The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, “” and (). The first three are specific to languages that use the Ge'ez script. In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts. For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/. This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing.

5. Downloads

NTC 1.0 can be used freely for research purposes.

6. Contact us

For any suggestions, corrections and usage of the corpus, please reach us at yemanekeleta@gmail.com. We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages.

(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact the admistrator)

 (感想・要望・情報提供)