Difference between revisions of "Runyankore-Rukiga Corpus"

Revision as of 18:34, 8 January 2020

  Runyakitara is standard language based on four closely related languages of western Uganda. These four languages are ((Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. 
  These languages are spoken in south-western Uganda    by approximately 6 million people according to the Uganda National Population and Housing Census  report
  (2014). (Ru)nyankore (ISO 639-3 nyn) and  (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region respectively.

Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga.

Download

Error creating thumbnail: Unable to save thumbnail to destination

(The Download is under preparation --Dorothee (talk) 11:14, 8 January 2020 (UTC))

The material consists of 298 sentences [ -- packaged -- time stamped ---], which were taken from naturally occurring data (narration, conversations), as well as from prior annotated linguistic sentence collections. The data is provided in the TC-XML format. fro

Description of the of the TypeCraft Runyankore-Rukiga corpus

Creation

The TypeCraft Runyankore-Rukiga corpus of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. ^[2] We also digitised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997) ^[3]. The data was created by native-speaker linguistics graduates as part of their class work, or in the context of their master’s thesis between 2006 and 2013. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim) . The main student contributors were Justus Turamyomwe, Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus please go to the TypeCraft.database. ^[4].

Size and Format

The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. A table over the most frequent word forms in the corpus shows that the corpus is biased. While it has been created by several users of TypeCraft users working on graduate project addressing different topics the corpus seems to contain more sentences containing locative words than one probably would expect to find in naturally occuring text. Between the 20 most frequent word forms are mostly words belonging to the functional word classes. This is expected.

Table 1. Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus

Error creating thumbnail: Unable to save thumbnail to destination

Annotations and Standards

We have used two layers of annotation for the labeling of the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common. Short definitions of the POS symbols can be found here: TypeCraft POS tags

Table 2. Part of Speech tags used for the annotation of Runyankore-Rukiga'

Error creating thumbnail: Unable to save thumbnail to destination

The TypeCraft editor supports the in-depth word-by-word annotation for which TypeCraft provided a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequenctly used glosses are Initial- and Final-Vowel. The legend of the pie chart in Figure 1. lists the Glosses in the order of their frequency from the left to the right, starting from the top.

Short definitions of the Gloss symbols can be found here: TypeCraft GLOSS tags.

Figure 1. Glosses used for the annotation of Runyankore-Rukiga

Error creating thumbnail: Unable to save thumbnail to destination

↑ TypeCraft allows users to create their own sentence collections from an existing corpus which allows them to apply new annotations within the bounds set by the system.
↑ Today Orumuri can still be found on Facebook Orumuri, but most of the articles presented are now in English.
↑ Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.
↑ You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.

[1] TypeCraft allows users to create their own sentence collections from an existing corpus which allows them to apply new annotations within the bounds set by the system.

[2] Today Orumuri can still be found on Facebook Orumuri, but most of the articles presented are now in English.

[3] Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.

[4] You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.

[1]

[2]

[3]

[4]

@@ Line 35: / Line 35: @@
 ====Annotations and Standards====
-We have used two layers of annotation for the labeling of  the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities with also lead to a closer cooperation between linguistics and computer science, the annotation of corpora resulting from linguistic work has become more common.
+We have used two layers of annotation for the labeling of  the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common.
 Short definitions of the POS symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/POSTags/ TypeCraft POS tags]
@@ Line 43: / Line 43: @@
-The TypeCraft editor supports the in-depth word-by-word annotation for which TypeCraft provided a list of over 300 glosses. Project working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft standard Glossing list, using 74 different tags. 13 different noun class tags were used. The legend of the pie chart in Figure 1.  lists the GLOSSES in the order of their frequency from the left to the right, starting from the top.
+The TypeCraft editor supports the in-depth word-by-word annotation for which TypeCraft provided a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequenctly used glosses are ''Initial-''  and ''Final-Vowel''. The legend of the pie chart in Figure 1.  lists the Glosses in the order of their frequency from the left to the right, starting from the top.
-Short definitions of the Gloss symbols  can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/GlossTags/ TypeCraft GLOSS tags].
+Short definitions of the Gloss symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/GlossTags/ TypeCraft GLOSS tags].
 '''Figure 1.  Glosses used for the annotation of Runyankore-Rukiga'''
 [[File:RR glosses-09-2018.png]]

Difference between revisions of "Runyankore-Rukiga Corpus"

Revision as of 18:34, 8 January 2020

Contents

Purpose of the data collection

Download

Description of the of the TypeCraft Runyankore-Rukiga corpus

Creation

Size and Format

Annotations and Standards