Difference between revisions of "TypeGram"
Lars Hellan (Talk | contribs) (→The grammar) |
|||
Line 8: | Line 8: | ||
− | '''TypeGram''' (cf. Hellan 2010, Hellan and Beermann 2011, 2014, Bruland 2011) consists of a '''''grammar shell''''', and an application for feeding information from '''''Interlinear Glossed Text (IGT)''''' into the shell, to yield a partial grammar of the language represented in the IGT. The IGT comes from '''TypeCraft''' (cf. Beermann (2014, 2015), Beermann and Mihaylov (2011, 2013)). The grammar shell is called '''Global Grammar''' (cf. [[Media: Global_Grammar_an_Introduction.pdf|Global Grammar ]]), using the formalism of HPSG (cf. Pollard and Sag 1994), and the computational platform LKB (cf. Copestake 2002). The grammar comes with a ready-defined inventory of grammatical types and rules hypothesized to accommodate structures from most types of languages of the world. The application converts information contained in the IGT into material suited for a content-word lexicon, a function-word lexicon, and a file for inflectional rules for a grammar of the language in question; these files are technically added to ''Global grammar'', thereby defining a partial grammar of the language. The insertion of this material is incremental, and can thus be repeated for any new or increased set of IGT available for the language. | + | '''TypeGram''' (cf. Hellan 2010, Hellan and Beermann 2011, 2014, Bruland 2011) consists of a '''''grammar shell''''', and an application for feeding information from '''''Interlinear Glossed Text (IGT)''''' into the shell, to yield a partial grammar of the language represented in the IGT. The IGT comes from '''TypeCraft''' (cf. Beermann (2014, 2015), Beermann and Mihaylov (2011, 2013)). The grammar shell is called '''Global Grammar''' (cf. [[Media: Global_Grammar_an_Introduction.pdf|Global Grammar ]]), using the formalism of HPSG (cf. Pollard and Sag 1994), and the computational platform LKB (cf. Copestake 2002). The grammar comes with a ready-defined inventory of grammatical types and rules hypothesized to accommodate structures from most types of languages of the world. A description of the system is given in [[Media: Building the Global Grammar feature structures.pdf|Building Global Grammar ]] .The application converts information contained in the IGT into material suited for a content-word lexicon, a function-word lexicon, and a file for inflectional rules for a grammar of the language in question; these files are technically added to ''Global grammar'', thereby defining a partial grammar of the language. The insertion of this material is incremental, and can thus be repeated for any new or increased set of IGT available for the language. |
Recognizing the complexity of building computational grammars, the present design provides a possibility of addressing a language's lexicon and morphology through an intermediate level of representation where lexical items are prepresented by English glosses, and grammatical morphs by functional gloss tags representing their content. These specifications we call '''meta-specifications''', whereby the syntactic and semantic part of the grammar is defined partly independently of a full representation of the morphology of the language. In testing such a modular grammar definition, sentences of the language can be presented as strings composed of the gloss symbols, called ''meta-strings'': relative to the IGT of a given sentence, the meta-string of the sentence is thus the concatenation of gloss symbols occcurring in the IGT. The standardized set of Gloss- and POS tags in TypeCraft allows for this meta-level of representation to be defined as a closed inventory of labels. This in turn provides a transparent format for comparison of the syntactic-semantic structures of languages, whereby the intermediate 'meta-level' is not just a possible heuristic stepping-stone for the construction of a full grammar, but also a format for typological comparison. | Recognizing the complexity of building computational grammars, the present design provides a possibility of addressing a language's lexicon and morphology through an intermediate level of representation where lexical items are prepresented by English glosses, and grammatical morphs by functional gloss tags representing their content. These specifications we call '''meta-specifications''', whereby the syntactic and semantic part of the grammar is defined partly independently of a full representation of the morphology of the language. In testing such a modular grammar definition, sentences of the language can be presented as strings composed of the gloss symbols, called ''meta-strings'': relative to the IGT of a given sentence, the meta-string of the sentence is thus the concatenation of gloss symbols occcurring in the IGT. The standardized set of Gloss- and POS tags in TypeCraft allows for this meta-level of representation to be defined as a closed inventory of labels. This in turn provides a transparent format for comparison of the syntactic-semantic structures of languages, whereby the intermediate 'meta-level' is not just a possible heuristic stepping-stone for the construction of a full grammar, but also a format for typological comparison. |
Revision as of 08:18, 2 April 2020
--Lars Hellan 09:53, 9 February 2015 (UTC)
TypeGram Contributors: Lars Hellan, Tore Bruland, Dorothee Beermann (all NTNU)
Downloads from: http://regdili.hf.ntnu.no:8081/typegramusers/menu.
TypeGram (cf. Hellan 2010, Hellan and Beermann 2011, 2014, Bruland 2011) consists of a grammar shell, and an application for feeding information from Interlinear Glossed Text (IGT) into the shell, to yield a partial grammar of the language represented in the IGT. The IGT comes from TypeCraft (cf. Beermann (2014, 2015), Beermann and Mihaylov (2011, 2013)). The grammar shell is called Global Grammar (cf. Global Grammar ), using the formalism of HPSG (cf. Pollard and Sag 1994), and the computational platform LKB (cf. Copestake 2002). The grammar comes with a ready-defined inventory of grammatical types and rules hypothesized to accommodate structures from most types of languages of the world. A description of the system is given in Building Global Grammar .The application converts information contained in the IGT into material suited for a content-word lexicon, a function-word lexicon, and a file for inflectional rules for a grammar of the language in question; these files are technically added to Global grammar, thereby defining a partial grammar of the language. The insertion of this material is incremental, and can thus be repeated for any new or increased set of IGT available for the language.
Recognizing the complexity of building computational grammars, the present design provides a possibility of addressing a language's lexicon and morphology through an intermediate level of representation where lexical items are prepresented by English glosses, and grammatical morphs by functional gloss tags representing their content. These specifications we call meta-specifications, whereby the syntactic and semantic part of the grammar is defined partly independently of a full representation of the morphology of the language. In testing such a modular grammar definition, sentences of the language can be presented as strings composed of the gloss symbols, called meta-strings: relative to the IGT of a given sentence, the meta-string of the sentence is thus the concatenation of gloss symbols occcurring in the IGT. The standardized set of Gloss- and POS tags in TypeCraft allows for this meta-level of representation to be defined as a closed inventory of labels. This in turn provides a transparent format for comparison of the syntactic-semantic structures of languages, whereby the intermediate 'meta-level' is not just a possible heuristic stepping-stone for the construction of a full grammar, but also a format for typological comparison.
Reflecting this design, the download site provides a sample grammar where all of the elements mentioned are instantiated, and can be run on the computational platform. The download site also provides the converter for turning TypeCraft IGT into specifications that are incorporated in a grammar-to-be-constructed, both for 'object' and 'meta'-level construction. The conversion processor is called TypeGramUtil2 (cf. Bruland 2011), and is the item called TypeGram Software on the download site.
The IGT from TypeCraft is downloaded as XML directly from the Typecraft editor, but an example XML is entered at the download site.
Setting. The idea of a 'general core' grammar structure is present in many conceptions of 'Universal grammar'. For computational instantiations partly related to the present approach, see Bender et al. 2010 ('HPSG Grammar Matrix'), Ranta 2011 ('Grammatical Framework'), Müller, to appear ('CoreGram'). The idea of IGT import into HPSG grammars is addressed also in Bender 2013. Beermann 2015 presents a procedure of import from TypeCraft into the LFG-based application XLFG.
Contents
The grammar
So far, Global Grammar essentially contains verb construction specifications. Both syntax and semantics are included. The classification of construction types is done according to the 'Construction Labeling System' (CLS) - see Verbconstructions cross-linguistically - Introduction, The Construction Labeling system and Derivation in the Construction Labeling system . The general linkage between this system and the structures constituting the grammar is described in Global Grammar , section 3.
In the download site, the item Grammar includes the shell Global Grammar plus lexical and inflectional specifications corresponding to small corpora of active constructions in Norwegian and Ga (a Kwa language spoken in Ghana), representing more than 200 construction types in Norwegian and more than 100 in Ga; a similar inventory is initiated for Kistaninya (an Ethio-semitic language spoken in Ethiopia). In this capacity, Grammar contains the following files ('tdl' for 'type description language', a code suited for the computational system in question):
- 'types.tdl' -- the core assembly of types
- 'labeltypes.tdl' -- types defined for all labels in CLS, based on types.tdl
- 'gatemplates.tdl' -- construction types in Ga defined in terms of the types defined in labeltypes.tdl
- 'nortemplates.tdl' -- construction types in Norwegian defined in terms of the types defined in labeltypes.tdl
- 'kistanetemplates.tdl' -- construction types in Kistaninya defined in terms of the types defined in labeltypes.tdl
- 'rules.tdl' -- a small number of syntactic rules sufficient for the construction array in question
- 'lrules.tdl' -- a small number of lexical rules sufficient for the construction array and lexical types in question
- 'inflr.tdl' -- a small number of inflection rules sufficient for the construction array and lexical types in question, construed for the meta-level of specification
- 'lexicon' -- an assembly of lexical items construed for the meta-level of specification, with lexical types for verbs reflecting the construction type they head.
- 'test' -- appx. 500 meta-sentences, i.e., meta-strings as defined above, instantiating all the construction types represented in the corpora mentioned; each sentence thus consists of English stems combined with abstract symbols for functional and inflectional categories.
- 'results' -- batch parsing results showing how many parses are provided for each meta-sentence.
In testing the grammar, one can copy strings from 'test' into the LKB Parse window, and see the various modes of analytic display. (See Using the grammar below.)
It should be noted that the 'meta'-glosses used in this package are handbuilt according to a system from 2008 and slightly different from the gloss system presently used in TypeCraft, but functioning exactly according to the same principles explained above
The converter - technical guide
The converter works as follows (below see a practical guide for its deployment):
Install We assume that java version "1.7.0_25" or higher is installed on your computer. Copy the TypeGramUtil2 folder to a place on your harddisc. In Linux select the TypeGramUtil2.jar in the file browser and right-click; select "properties", then "permissions" and check "Execute" (allow executing file as program); or open a terminal application; in the terminal: change directory to the folder which contains TypeGramUtil2.jar; run: chmod u=rx TypeGramUtil2.jar.
Start application In Windows: double click on TypeGramUtil2.jar file (called 'start_win'). In Linux: right-click on TypeGramUtil2.jar file, Open with -> your java version or open a terminal application, in the terminal: change directory to the folder which contains TypeGramUtil2.jar you make it runnable with: chmod u=rx run_unix; start the run_unix bash script with: ./run_unix.
Use of the application Starting the TypeGramUtil2.jar file produces a graphical interface (GI). The application reads a downloaded XML file from TypeCraft and it creates/reads/updates the LKB files: prefix + FuncWord.tdl prefix + Infl.tdl prefix + Lex.tdl prefix + MetaFuncWord.tdl prefix + MetaInfl.tdl prefix + MetaLex.tdl prefix + Gloss.txt The downloaded XML file from TypeCraft is called tc2_export.xml or tc2_export(num).xml (a new number for each new download). The downloaded file is stored in the folder: In Windows: "My Documents/Downloads" In Unix: Home/Downloads We recommend to move the file to another folder and to rename it, see below. The GI has a button named "Input" that selects the downloaded XML file. Click on the button, and you can define exactly in which folder this file will be found. The GI has a button named "Output" that selects the destination for the LKB files. In the default case, it will be the same as specified for "Input" (but see below). The GI has a text field named "File Prefix" that sets the prefix for the LKB files. For example, the prefix "nor" gives the following files (given a selection of Norwegian IGT from TypeCraft): norFuncWord.tdl norInfl.tdl norLex.tdl norMetaFuncWord.tdl norMetaInfl.tdl norMetaLex.tdl norGloss.txt The button "Transfer" reads the XML file, reads the LKB files, updates the new entries, and writes the result back to the LKB files. An item is a duplicate if it is previously stored in a file, and duplicates are not written to the LKB files. After each transfer, a set of counters are updated for duplicate items and new items for each file. When all the "new" counters are zero, it means that no new information is found in the XML file. A message is displayed with the number of items written for each LKB file. For example: "Numbers saved. lex: 248, infl: 21, funk: 48, gloss:58".
When the application is closed, the "Input", "Output" and "File Prefix" values are saved in the TypeGram.ini file. Next time you start the application you get the previous values for "Input", "Output" and "File Prefix".
The converter - practical guide
(This set of steps refers to the use of the Windows version as described above.)
1. Make a folder which will be the habitat for a grammar of Ga, and name it GaG.
2. Download 'Grammar.zip' from 'TypeGram for Users'. It gets downloaded to 'Downloads'.
Extract all the zipped parts, resulting in a folder Grammar. Move this folder into GaG.
3. Download 'Example TypeCraft XML ga_export.xml' from 'TypeGram for Users'. Also it gets downloaded to 'Downloads', and it can be moved directly into GaG. (This is for demonstration - in the normal case you will bring this xml export directly from TypeCraft to this folder.)
4. Download 'TypeGram Software Java Jar file for Unix and Windows typegram.zip' from 'TypeGram for Users'. Like in step 2, unzip it inside of 'Downloads', and move the unzipped folder 'typeGram' to GaG.
5. Open 'typegram', and you find three items inside. Click on 'start_win', and the GI opens.
6. The line to the right of the button 'Input' describes the item to be used as input. The path corresponds to the folder chosen, but note that the item you see the first time the GI opens - 'default.xml - is only a placeholder name. Click on "Input", and select through the folder system the item 'ga_export.xml'.
7. We now want to define the place where the created lkb-files will end up. We could let this be Grammar, since this is where they will be used, but as an intermediate step, we create a folder Converted tdl-files, from which we in subsequent steps can make selections of files. This is now selected as output area, under "Output".
8. In the line to the right of 'Prefix', replace 'typegram' with 'ga'.
9. Click "Transfer". The folder Converted tdl-files now contains seven files: 'gaFuncWord', 'gaInfl' and 'gaLex' represent the relevant 'object language' items, whereas 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex' represent the gloss words and gloss tags of the IGTs. The file 'ga Gloss' contains the full gloss specification of each sentence assembled as wordsize units in a string, like in "1SGPOSSface AORblack 1SG" ('my face blackens me' = 'I get angry'), the items of which thereby match the items in the files 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex'. The GI provides counts of each file, as explained above.
10. Depending on whether you want to create an 'object grammar' of Ga or a 'metagrammar', you import either 'gaFuncWord', 'gaInfl' and 'gaLex', or 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex', into Grammar. For one caveat, see next point.
11. As Grammar is already defined, it contains test lexicon and inflection files which match a test suite called 'test', which is a set of meta-strings based on Norwegian and Ga. Before running with the new files, the old ones have to be given a prefix 'old' or so, with the new files taking over the names used for the previous ones (this is in order to make the grammar's loading definitions apply as normal).
Using the grammar
For Windows: When you open the folder Grammar, look for the item lkb, which is an executable file, and as such standing out in an emerald green color. If it appears just like a normal file, it has not yet undergone a last step of extraction, so then perform 'extract'. Now double-click on the lkb icon, and the program running the grammar opens, presenting you with the LKB Top interface.
For Linux: Download LKB from a site where it is offered. Grammar will behave like any other grammar folder that is used with LKB in linux, with a download presenting you with the LKB Top interface.
(From now on for both Windows and Linux:) When presented with the LKB Top interface, first select, under 'Load', the option 'Complete grammar'. A set of items now show up, one of the LKB; double-click, and a new set of options are presented, among them script', on which you click. The grammar now downloads, amongst others with the files mentioned above.
Among the functionalities offered at LKB Top, first, under 'Parse', select 'Parse input', where the default sentence showing up is The dog barks. If the grammar has downloaded as it should, the parse command will yield a parse tree (or two or three). The dog barks is the standard test sentence in a family of training grammars built around Copestake (op. cit.), and Global Grammar is indeed built upon these training grammars, retaining at its core the system offered for analysis of simple English sentences, but then extended for the purposes described. The steps of this extension are described in Building Global Grammar .
In order to access the files of the grammar, go back to the folder Grammar, where you find the files mentioned in the section The grammar above. The files are in text format, and in order to edit the files use emacs or notepad.
References
Beermann, D. and Pavel Mihaylov. (2011). e-Research for Linguists. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities.
Beermann, D. and Mihaylov, P. (2013). Collaborative databasing and Resource sharing for Linguists. In: Languages Resources and Evaluation. Springer.
Beermann, D. (2014). Data management and analysis for less documented languages. In Jones, M., and Connolly, C. (eds) Language Documentation and New Technology. Cambridge University Press.
Beermann, D. (2015) XLFG and the TypeCraft Database of Interlinear Glossed Text. Talk at PARGRAM meeting Warzaw.
Bender, E. M., Drellishak, S., Fokkens, A., Poulson, L. and Saleem, S. (2010). Grammar Customization. In Research on Language & Computation, Volume 8, Number 1, 23-72.
Bender, E., Goodman, M.W., Crowgey, J., and Xia, F. (2013) Towards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties. Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 74-83.
Copestake, A. (2002). Implementing Typed Feature Structure Grammars. CSLI Publications.
Hellan, L. (2008). Enumerating Verb Constructions Cross-linguistically. In Proceedings from COLING Workshop on Grammar Engineering Across frameworks. Manchester.
Hellan, L. and Dakubu, M.E.K. (2010). Identifying Verb Constructions Cross-linguistically. SLAVOB series 6.3, Univ. of Ghana (http://www.typecraft.org/w/images/d/db/1_Introlabels_SLAVOB-final.pdf).
Hellan, L. and D. Beermann (2011, 2014). Inducing grammars from IGT. Z. Vetulani and J. Mariani (eds.) Human Language Technologies as a Challenge for Computer Science and Linguistics. Springer.
Müller, St. (to appear). The CoreGram Project: Theoretical Linguistics, Theory Development and Verification. To appear 2015 in Journal of Language Modelling. http://hpsg.fu-berlin.de/~stefan/Pub/coregram.html
Pollard, C. and Sag, I. (1994). Head-Driven Phrase Structure Grammar. University of Chicago Press.
Ranta, A. (2011). Grammatical Framework: Programming with Multilingual Grammars, CSLI, Stanford.