言語商会

差分

このページの2つのバージョン間の差分を表示します。

この比較画面へのリンク

両方とも前のリビジョン前のリビジョン
次のリビジョン
前のリビジョン
最新のリビジョン両方とも次のリビジョン
snow:nagaoka_tigrinya_corpus [2021/09/07 16:15] adminsnow:nagaoka_tigrinya_corpus [2022/05/12 21:17] admin
行 1: 行 1:
-[[:eng:|Top]]>[[:lab:|旧研究室]]>[[:SNOW:|SNOW]]+[[:eng:|Top]]>[[:lab:|Lab]]>[[:SNOW:|SNOW]]
 ~~NOTOC~~ ~~NOTOC~~
  
 ===== Nagaoka Tigrinya Corpus ===== ===== Nagaoka Tigrinya Corpus =====
-==== 1. What is a corpus ? ====+ 
 +(Note that this page is written by [[:lab:phdthesis|Dr. Yemane Keleta Tedla]], when he was a student in [[:lab:|our lab]]. ) 
 + 
 +=== 1. What is a corpus ? ===
  
 "In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [[https://en.wikipedia.org/wiki/Text_corpus|[Wikipedia]]]. "In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [[https://en.wikipedia.org/wiki/Text_corpus|[Wikipedia]]].
  
-==== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ====+=== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ===
  
 The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called "[[https://shabait.com/2020/09/02/haddas-ertra_01092020/|Haddas Ertra]]" It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens.  On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows. The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called "[[https://shabait.com/2020/09/02/haddas-ertra_01092020/|Haddas Ertra]]" It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens.  On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows.
  
  
-==== 3. Tagset design ====+=== 3. Tagset design ===
 The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows: The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows:
  
 The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are: The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are:
  
-  Tigrinya Grammar by Adi Ghebre (2000) +  Tigrinya Grammar by Adi Ghebre (2000) 
-  A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and +  A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and 
-  Tigrinya Grammar by John Mason (1996)+  Tigrinya Grammar by John Mason (1996)
  
 ==== 4. Format of NTC ==== ==== 4. Format of NTC ====
行 28: 行 31:
 NTC 1.0 can be used freely for research purposes. NTC 1.0 can be used freely for research purposes.
  
-  - [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script +  - [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script 
-  - [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)+  - [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)
  
-==== 6. Contact us ====+=== 6. Contact us ===
  
-For any suggestions, corrections and usage of the corpus, please reach us at+For any suggestions, corrections and usage of the corpus, please reach us at [[yemanekeleta@gmail.com]].
-yemane@jnlp.org or yemanekeleta@gmail.com.+
 We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages. We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages.
  
-(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact to the admistrator of this Web site: yamamoto@jnlp.org)+(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact [[:eng:|the admistrator]])
  
 (感想・要望・情報提供)