Classroom:LING2208 - Annotating Norwegian Bokmål/Agreement statistics
Classroom:LING2208 - Annotating Norwegian Bokmål
Tagging of gender in Norwegian Bokmål
Regarding the gloss tags of adjectives, there is variation in the conventions for Norwegian Bokmål corpora of TypeCraft. Adjectives are at times glossed with grammatical gender tags, at other times not.
A possible theory for why some adjectives are tagged with gender, might be the neuter form. In Norwegian, the neuter form of adjectives more often than not is distinctly different from the masculine or feminine form. In fact, the masculine and feminine forms are indistinguishable from each other, and also indistinguishable from the base form. One could therefore expect NEUT to be an overrepresented tag among the adjectives tagged for gender.
Using TypeCraft's Phrase Search (for Norwegian Bokmål), performing three searches: [("POS:ADJ", "gloss:FEM"), ("POS:ADJ", "gloss:MASC"), ("POS:ADJ", "gloss:NEUT")], should result in three values. These are the number of adjectives that are tagged with each gender (summing them gives the total amount of gender-tagged adjectives).
In comparison, performing three searches just for the POS tags: [("gloss:FEM"), ("gloss:MASC"), ("gloss:NEUT")], should result in three new values. These are the total number of words in TypeCraft tagged with a gender (for Bokmål).
Gender | Adjectives | Total for all tags in TypeCraft |
---|---|---|
FEM | 0 (0%) | 33 (6.33%) |
MASC | 13 (21%) | 302 (58%) |
NEUT | 49 (79%) | 186 (35.7%) |
Total: | 62 (100%) | 521 (100%) |
The results of such a search, is evidence for the aforementioned hypothesis; NEUT being overrepresented as a gloss tag for adjectives, as opposed to NEUT as a tag for any POS.
These results can be compared to the distribution of genders among nouns in the NoWaC corpus [1]
Performing three queries (one for each of mask, fem and nøyt), do:
Start with a sum of 0.
for each record in the file[1], do:
if there is a substring in the fourth column matching subst AND a substring in the fourth collumn matching "<gender>", where <gender> is the tag being queried (mask, fem, or nøyt[2]) -- add the number in the first collumn to sum
Gender | Adjectives | Total for all tags in TypeCraft | Total for nouns in NoWaC | Ratio for ADJ to NoWaC |
---|---|---|---|---|
FEM | 0 (0%) | 33 (6.33%) | 20358360 (16.47%) | 0% |
MASC | 13 (21%) | 302 (58%) | 69209955 (56%) | 37.5% |
NEUT | 49 (79%) | 186 (35.7%) | 34026414 (27.53%) | 286.96% |
Total: | 62 (100%) | 521 (100%) | 123594729 (100%) | N/A |
The percentages in the first columns represent the ratio of each tag to the total for each count, (i.e: 56% of all nouns are tagged in NoWaC as masculine). The final column contains the compound ratio of the ratio of each gender in entries tagged with ADJ in TypeCraft and the ratio of each gender in entries tagged as nouns in NoWaC. This gives us an indication of whether some genders are more frequently glossed for adjectives than they naturally occur.
This also evidences that NEUT as a tag for adjectives seems overrepresented in TypeCraft.
Notes
- ↑ 1.0 1.1 UiO, Frequency lists from NoWaC (Frequency list of analyzed word forms) http://www.hf.uio.no/iln/om/organisasjon/tekstlab/tjenester/nowac-frequency.html
- ↑ some problems with the file's encoding resulted in this being queried as nøyt