言語商会

差分

このページの2つのバージョン間の差分を表示します。

この比較画面へのリンク

両方とも前のリビジョン前のリビジョン
次のリビジョン
前のリビジョン
snow:nagaoka_tigrinya_corpus [2022/05/12 21:16] adminsnow:nagaoka_tigrinya_corpus [2022/05/12 21:17] (現在) admin
行 4: 行 4:
 ===== Nagaoka Tigrinya Corpus ===== ===== Nagaoka Tigrinya Corpus =====
  
-(Note that this page is written by [[:lab:phdthesis|YEMANE KELETA TEDLA]] when he was a student in [[:lab:|our lab]]. )+(Note that this page is written by [[:lab:phdthesis|Dr. Yemane Keleta Tedla]]when he was a student in [[:lab:|our lab]]. )
  
 === 1. What is a corpus ? === === 1. What is a corpus ? ===
行 24: 行 24:
   * Tigrinya Grammar by John Mason (1996)   * Tigrinya Grammar by John Mason (1996)
  
-==== 4. Format of NTC ====+=== 4. Format of NTC ===
  
 Tigrinya uses the Ge'ez Script as its writing system.  The corpus is available in both Ge'ez script and Transliterated Latin script. [[http://yacob.org/papers%2Fetinet.pdf|SERA]] transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script).  For machine readability and flexible manipulation, the corpus was pre-processed (cleaned) and encoded in [[https://tei-c.org/release/doc/tei-p5-doc/en/html/CC.html|TEI corpus format]].  The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, "" and (). The first three are specific to languages that use the Ge'ez script.   In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts.   For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/  This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing. Tigrinya uses the Ge'ez Script as its writing system.  The corpus is available in both Ge'ez script and Transliterated Latin script. [[http://yacob.org/papers%2Fetinet.pdf|SERA]] transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script).  For machine readability and flexible manipulation, the corpus was pre-processed (cleaned) and encoded in [[https://tei-c.org/release/doc/tei-p5-doc/en/html/CC.html|TEI corpus format]].  The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, "" and (). The first three are specific to languages that use the Ge'ez script.   In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts.   For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/  This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing.
行 31: 行 31:
 NTC 1.0 can be used freely for research purposes. NTC 1.0 can be used freely for research purposes.
  
-  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script +  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script 
-  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)+  [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)
  
 === 6. Contact us === === 6. Contact us ===
 (感想・要望・情報提供)