言語商会

差分

このページの2つのバージョン間の差分を表示します。

この比較画面へのリンク

両方とも前のリビジョン前のリビジョン
次のリビジョン
前のリビジョン
snow:nagaoka_tigrinya_corpus [2021/09/07 16:15] adminsnow:nagaoka_tigrinya_corpus [2022/05/12 21:17] (現在) admin
行 1: 行 1:
-[[:eng:|Top]]>[[:lab:|旧研究室]]>[[:SNOW:|SNOW]]+[[:eng:|Top]]>[[:lab:|Lab]]>[[:SNOW:|SNOW]]
 ~~NOTOC~~ ~~NOTOC~~
  
 ===== Nagaoka Tigrinya Corpus ===== ===== Nagaoka Tigrinya Corpus =====
-==== 1. What is a corpus ? ====+ 
 +(Note that this page is written by [[:lab:phdthesis|Dr. Yemane Keleta Tedla]], when he was a student in [[:lab:|our lab]]. ) 
 + 
 +=== 1. What is a corpus ? ===
  
 "In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [[https://en.wikipedia.org/wiki/Text_corpus|[Wikipedia]]]. "In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [[https://en.wikipedia.org/wiki/Text_corpus|[Wikipedia]]].
  
-==== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ====+=== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ===
  
 The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called "[[https://shabait.com/2020/09/02/haddas-ertra_01092020/|Haddas Ertra]]" It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens.  On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows. The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called "[[https://shabait.com/2020/09/02/haddas-ertra_01092020/|Haddas Ertra]]" It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens.  On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows.
  
  
-==== 3. Tagset design ====+=== 3. Tagset design ===
 The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows: The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows:
  
 The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are: The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are:
  
-  Tigrinya Grammar by Adi Ghebre (2000) +  Tigrinya Grammar by Adi Ghebre (2000) 
-  A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and +  A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and 
-  Tigrinya Grammar by John Mason (1996)+  Tigrinya Grammar by John Mason (1996)
  
-==== 4. Format of NTC ====+=== 4. Format of NTC ===
  
 Tigrinya uses the Ge'ez Script as its writing system.  The corpus is available in both Ge'ez script and Transliterated Latin script. [[http://yacob.org/papers%2Fetinet.pdf|SERA]] transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script).  For machine readability and flexible manipulation, the corpus was pre-processed (cleaned) and encoded in [[https://tei-c.org/release/doc/tei-p5-doc/en/html/CC.html|TEI corpus format]].  The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, "" and (). The first three are specific to languages that use the Ge'ez script.   In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts.   For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/  This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing. Tigrinya uses the Ge'ez Script as its writing system.  The corpus is available in both Ge'ez script and Transliterated Latin script. [[http://yacob.org/papers%2Fetinet.pdf|SERA]] transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script).  For machine readability and flexible manipulation, the corpus was pre-processed (cleaned) and encoded in [[https://tei-c.org/release/doc/tei-p5-doc/en/html/CC.html|TEI corpus format]].  The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, "" and (). The first three are specific to languages that use the Ge'ez script.   In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts.   For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/  This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing.
行 28: 行 31:
 NTC 1.0 can be used freely for research purposes. NTC 1.0 can be used freely for research purposes.
  
-  [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script +  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script 
-  [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)+  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)
  
-==== 6. Contact us ====+=== 6. Contact us ===
  
-For any suggestions, corrections and usage of the corpus, please reach us at+For any suggestions, corrections and usage of the corpus, please reach us at [[yemanekeleta@gmail.com]].
-yemane@jnlp.org or yemanekeleta@gmail.com.+
 We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages. We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages.
  
-(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact to the admistrator of this Web site: yamamoto@jnlp.org)+(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact [[:eng:|the admistrator]])
  
 (感想・要望・情報提供)