差分

このページの2つのバージョン間の差分を表示します。

--- snow:nagaoka_tigrinya_corpus [2021/09/07 16:15] – admin
+++ snow:nagaoka_tigrinya_corpus [2022/05/12 21:17] – admin
@@ 行 1: / 行 1: @@
-[[:eng:|Top]]＞[[:lab:|旧研究室]]＞[[:SNOW:|SNOW]]
+[[:eng:|Top]]＞[[:lab:|Lab]]＞[[:SNOW:|SNOW]]
 ~~NOTOC~~
 ===== Nagaoka Tigrinya Corpus =====
-==== 1. What is a corpus ? ====
+(Note that this page is written by [[:lab:phdthesis|Dr. Yemane Keleta Tedla]], when he was a student in [[:lab:|our lab]]. )
+=== 1. What is a corpus ? ===
 "In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [[https://en.wikipedia.org/wiki/Text_corpus|[Wikipedia]]].
-==== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ====
+=== 2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0) ===
 The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. This text corpus is compiled at Nagaoka university of Technology. The corpus is a collection of news articles from an Eritrean newspaper called "[[https://shabait.com/2020/09/02/haddas-ertra_01092020/|Haddas Ertra]]".  It contains about 100 articles published between March 2013 and December 2013. The current release of NTC (NTC 1.0) has a total of 72,080 tokens.  On average, one sentence contains 15 words. The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows.
-==== 3. Tagset design ====
+=== 3. Tagset design ===
 The corpus is manually tagged for part of speech tags with few enhancements done automatically. This released NTC 1.0 is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows:
 The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are:
-  - Tigrinya Grammar by Adi Ghebre (2000)
+  * Tigrinya Grammar by Adi Ghebre (2000)
-  - A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and
+  * A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and
-  - Tigrinya Grammar by John Mason (1996)
+  * Tigrinya Grammar by John Mason (1996)
 ==== 4. Format of NTC ====
@@ 行 28: / 行 31: @@
 NTC 1.0 can be used freely for research purposes.
-  - [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script
+  - [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_tig_T2.rar|Download NTC 1.0]] - TEI format in Ge'ez script
-  - [[https://filedn.com/lit4DCIlHwxfS1gj9zcYuDJ/SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)
+  - [[https://www.jnlp.org/cgi-priv/download.cgi?id=SNOW/NagaokaTigrinyaCorpus_1.0_rom_T2.rar|Download NTC 1.0]] - TEI format in Latin script (Transliterated)
-==== 6. Contact us ====
+=== 6. Contact us ===
-For any suggestions, corrections and usage of the corpus, please reach us at:
+For any suggestions, corrections and usage of the corpus, please reach us at [[yemanekeleta@gmail.com]].
-yemane@jnlp.org or yemanekeleta@gmail.com.
 We appreciate your input to help us improve the quality of NTC. We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages.
-(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact to the admistrator of this Web site: yamamoto@jnlp.org)
+(NOTE: In case e-mail addresses above are unreachable or download links are dead, please contact [[:eng:|the admistrator]])

言語商会

ページ用ツール

サイト用ツール

ユーザ用ツール

差分