Difference between revisions of "TypeGram"
Lars Hellan (Talk | contribs) |
Lars Hellan (Talk | contribs) |
||
Line 1: | Line 1: | ||
− | + | TypeGram | |
− | + | Contributors: Lars Hellan, Tore Bruland, Dorothee Beermann (all NTNU) | |
+ | TypeGram is an application for converting Interlinear Glossed Text (IGT) to Grammar | ||
− | The | + | specification. The IGT comes from TypeCraft (cf. Beermann and Mihaylov 2014; see |
− | + | http://typecraft.org/tc2wiki/Main_Page), and the grammar formalism is that of HPSG (cf. | |
− | + | Pollard and Sag 1994), using the LKB platform (cf. Copestake 2002). The application | |
− | + | converts information contained in the IGT of a set of sentences of a language L into | |
− | + | material suited for three of the files designed for the grammar of L, namely its content | |
+ | word lexicon, its function word lexicon, and its file for inflectional rules. The | ||
− | + | insertion of this material is incremental, and can be run over for any new or increased | |
+ | set of IGT available for L. In addition to the three files mentioned of the grammar, the | ||
− | + | grammar comes with a ready-defined inventory of grammatical types and rules hypothesized | |
− | + | to accommodate structures from most types of languages of the world. These items together | |
− | + | constitute the object 'Grammar' at the download site. For properties of the Grammar and | |
+ | the general architecture, see http://typecraft.org/tc2wiki/TypeGram. | ||
− | A | + | The IGT from TypeCraft is downloaded as XML directly from the Typecraft editor (see |
+ | |||
+ | below). | ||
+ | |||
+ | The conversion processor is called 'TypeGramUtil2', and is the item called 'TypeGram | ||
+ | |||
+ | Software' on the download site http://regdili.hf.ntnu.no:8081/typegramusers/menu. We now | ||
+ | |||
+ | describe its functionality (cf. Bruland 2011). | ||
+ | |||
+ | |||
+ | Install | ||
+ | We assume that java version "1.7.0_25" or higher is installed on your computer. | ||
+ | Copy the TypeGramUtil2 folder to a place on your harddisc. | ||
+ | In Linux | ||
+ | select the TypeGramUtil2.jar in the file browser and right-click; | ||
+ | select "properties", then "permissions" and check "Execute" (allow executing file as | ||
+ | |||
+ | program); | ||
+ | or | ||
+ | open a terminal application; | ||
+ | in the terminal: change directory to the folder which contains TypeGramUtil2.jar; | ||
+ | run: chmod u=rx TypeGramUtil2.jar. | ||
+ | |||
+ | Start application | ||
+ | In Windows: | ||
+ | double click on TypeGramUtil2.jar file (called 'start_win'). | ||
+ | In Linux: | ||
+ | right-click on TypeGramUtil2.jar file, Open with -> your java version | ||
+ | or | ||
+ | open a terminal application, | ||
+ | in the terminal: change directory to the folder which contains TypeGramUtil2.jar | ||
+ | you make it runnable with: chmod u=rx run_unix; | ||
+ | start the run_unix bash script with: ./run_unix. | ||
+ | |||
+ | Use of the application | ||
+ | Starting the TypeGramUtil2.jar file produces a graphical interface. | ||
+ | |||
+ | The application reads a downloaded XML file from TypeCraft and it creates/reads/updates | ||
+ | |||
+ | the LKB files: | ||
+ | prefix + Gloss.txt | ||
+ | prefix + MetaFuncWord.tdl | ||
+ | prefix + MetaInfl.tdl | ||
+ | prefix + MetaLex.tdl | ||
+ | |||
+ | The downloaded XML file from TypeCraft is called tc2_export.xml or tc2_export(num).xml | ||
+ | |||
+ | (a new number for each new download). | ||
+ | The downloaded file is stored in the folder: | ||
+ | In Windows: "My Documents/Downloads" | ||
+ | In Unix: Home/Downloads | ||
+ | We recommend to move the file to the folder data/export and to rename it. | ||
+ | |||
+ | The graphical interface has a button named "Input" that selects the downloaded XML | ||
+ | |||
+ | file. Click on the button, and you can define exactly in which folder this file will be | ||
+ | |||
+ | found. | ||
+ | |||
+ | The graphical interface has a button named "Output" that selects the destination for | ||
+ | |||
+ | the LKB files. In the default case, it will be the same as specified for "Input". | ||
+ | |||
+ | The graphical interface has a text field named "File Prefix" that sets the prefix for | ||
+ | |||
+ | the LKB files. | ||
+ | For example, the prefix "nor" gives the following files (given a selection of | ||
+ | |||
+ | Norwegian IGT from TypeCraft): | ||
+ | norGloss.txt | ||
+ | norMetaFuncWord.tdl | ||
+ | norMetaInfl.tdl | ||
+ | norMetaLex.tdl. | ||
+ | |||
+ | The button "Transfer" reads the XML file, reads the LKB files, updates the new entries, | ||
+ | |||
+ | and writes the result back to the LKB files. | ||
+ | An item is a duplicate if it is previously stored in a file, and duplicates are not | ||
+ | |||
+ | written to the LKB files. | ||
+ | After each transfer, a set of counters are updated for duplicate items and new items | ||
+ | |||
+ | for each file. | ||
+ | When all the "new" counters are zero, it means that no new information is found in the | ||
+ | |||
+ | XML file. | ||
+ | A message is displayed with the number of items written for each LKB file. | ||
+ | For example: "Numbers saved. lex: 248, infl: 21, funk: 48, gloss:58". | ||
+ | |||
+ | |||
+ | When the application is closed, the "Input", "Output" and "File Prefix" values are | ||
+ | |||
+ | saved in the TypeGram.ini file. | ||
+ | Next time you start the application you get the previous values for "Input", "Output" | ||
+ | |||
+ | and "File Prefix". | ||
+ | |||
+ | |||
+ | PRACTICAL EXAMPLE | ||
+ | 1. Make a folder which will be the habitat for a grammar of Ga, and name it 'GaG'. | ||
+ | 2. Download 'Grammar.zip' from 'TypeGram for Users'. It gets downloaded to 'Downloads'. | ||
+ | |||
+ | Extract all the zipped parts, resulting in a folder 'Grammar'. Move this folder into GaG. | ||
+ | 3. Download 'Example TypeCraft XML ga_export.xml' from 'TypeGram for Users'. Also it | ||
+ | |||
+ | gets downloaded to 'Downloads', and it can be moved directly into GaG. (This is for | ||
+ | |||
+ | demonstration - in the normal case you will bring this xml export directly from TypeCraft | ||
+ | |||
+ | to this folder.) | ||
+ | 4. Download 'TypeGram Software Java Jar file for Unix and Windows typegram.zip' from | ||
+ | |||
+ | 'TypeGram for Users'. Like in step 2, unzip it inside of 'Downloads', and move the | ||
+ | |||
+ | unzipped folder 'typeGram' to GaG. | ||
+ | 5. Open 'typegram', and you find three items inside. Click on 'start_win', and the | ||
+ | |||
+ | graphical interface (GI) opens. | ||
+ | 6. The line to the right of the button 'Input' describes the item to be used as input. | ||
+ | |||
+ | The path corresponds to the folder chosen, but the item is only a placeholder name. Click | ||
+ | |||
+ | on "Input", and select through the folder system the item 'ga_export.xml'. | ||
+ | 7. We now want to define the place where the created lkb-files will end up. We could let | ||
+ | |||
+ | this be 'Grammar', since this is where they will be used, but as an intermediate step, we | ||
+ | |||
+ | create a folder 'Converted tdl-files', from which we in subsequent steps can make | ||
+ | |||
+ | selections of files. This is now selected as output area, under "Output". | ||
+ | 8. In the line to the right of 'Prefix', replace 'typegram' with 'ga'. | ||
+ | 9. Click "Transfer". The folder 'Converted tdl-files' now contains 7 files: 'gaFuncWord', | ||
+ | |||
+ | 'gaInfl' and 'gaLex' represent the relevant 'object language' items, whereas | ||
+ | |||
+ | 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex' represent the gloss words and gloss tags | ||
+ | |||
+ | of the IGTs. The file 'ga Gloss' contains the full gloss specification of each sentence | ||
+ | |||
+ | assembled as wordsize units in a string, like in "1SGPOSSface AORblack 1SG" ('my face | ||
+ | |||
+ | blackens me' = 'I get angry'), the items of which thereby match the items in the files | ||
+ | |||
+ | 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex'. The GI provides counts of each file, as | ||
+ | |||
+ | explained above. | ||
+ | (The 'meta-level' grammar is descibed in Hellan 2010, Hellan and Beermann 2014.) | ||
+ | 10. Depending on whether you want to create an 'object' grammar of Ga or a 'metagrammar', | ||
+ | |||
+ | you import either 'gaFuncWord', 'gaInfl' and 'gaLex', or 'gaMetaFuncWord', 'gaMetaInfl' | ||
+ | |||
+ | and 'gaMetaLex', into 'Grammar'. For one caveat, see next point. | ||
+ | 11. As 'Grammar' is already defined, it contains test lexicon and inflection files which | ||
+ | |||
+ | match a test suite called 'test', which is a set of meta-strings based on Norwegian and | ||
+ | |||
+ | Ga. Before running with the new files, the old ones have to be given a prefix 'old' or | ||
+ | |||
+ | so, with the new files taking over the names used for the previous ones (this is in order | ||
+ | |||
+ | to make the grammar's loading definitions apply as normal). |
Revision as of 16:51, 2 February 2015
TypeGram Contributors: Lars Hellan, Tore Bruland, Dorothee Beermann (all NTNU)
TypeGram is an application for converting Interlinear Glossed Text (IGT) to Grammar
specification. The IGT comes from TypeCraft (cf. Beermann and Mihaylov 2014; see
http://typecraft.org/tc2wiki/Main_Page), and the grammar formalism is that of HPSG (cf.
Pollard and Sag 1994), using the LKB platform (cf. Copestake 2002). The application
converts information contained in the IGT of a set of sentences of a language L into
material suited for three of the files designed for the grammar of L, namely its content
word lexicon, its function word lexicon, and its file for inflectional rules. The
insertion of this material is incremental, and can be run over for any new or increased
set of IGT available for L. In addition to the three files mentioned of the grammar, the
grammar comes with a ready-defined inventory of grammatical types and rules hypothesized
to accommodate structures from most types of languages of the world. These items together
constitute the object 'Grammar' at the download site. For properties of the Grammar and
the general architecture, see http://typecraft.org/tc2wiki/TypeGram.
The IGT from TypeCraft is downloaded as XML directly from the Typecraft editor (see
below).
The conversion processor is called 'TypeGramUtil2', and is the item called 'TypeGram
Software' on the download site http://regdili.hf.ntnu.no:8081/typegramusers/menu. We now
describe its functionality (cf. Bruland 2011).
Install
We assume that java version "1.7.0_25" or higher is installed on your computer. Copy the TypeGramUtil2 folder to a place on your harddisc. In Linux select the TypeGramUtil2.jar in the file browser and right-click; select "properties", then "permissions" and check "Execute" (allow executing file as
program);
or open a terminal application; in the terminal: change directory to the folder which contains TypeGramUtil2.jar; run: chmod u=rx TypeGramUtil2.jar.
Start application
In Windows: double click on TypeGramUtil2.jar file (called 'start_win'). In Linux: right-click on TypeGramUtil2.jar file, Open with -> your java version or open a terminal application, in the terminal: change directory to the folder which contains TypeGramUtil2.jar you make it runnable with: chmod u=rx run_unix; start the run_unix bash script with: ./run_unix.
Use of the application Starting the TypeGramUtil2.jar file produces a graphical interface.
The application reads a downloaded XML file from TypeCraft and it creates/reads/updates
the LKB files:
prefix + Gloss.txt prefix + MetaFuncWord.tdl prefix + MetaInfl.tdl prefix + MetaLex.tdl The downloaded XML file from TypeCraft is called tc2_export.xml or tc2_export(num).xml
(a new number for each new download).
The downloaded file is stored in the folder: In Windows: "My Documents/Downloads" In Unix: Home/Downloads We recommend to move the file to the folder data/export and to rename it. The graphical interface has a button named "Input" that selects the downloaded XML
file. Click on the button, and you can define exactly in which folder this file will be
found.
The graphical interface has a button named "Output" that selects the destination for
the LKB files. In the default case, it will be the same as specified for "Input".
The graphical interface has a text field named "File Prefix" that sets the prefix for
the LKB files.
For example, the prefix "nor" gives the following files (given a selection of
Norwegian IGT from TypeCraft):
norGloss.txt norMetaFuncWord.tdl norMetaInfl.tdl norMetaLex.tdl. The button "Transfer" reads the XML file, reads the LKB files, updates the new entries,
and writes the result back to the LKB files.
An item is a duplicate if it is previously stored in a file, and duplicates are not
written to the LKB files.
After each transfer, a set of counters are updated for duplicate items and new items
for each file.
When all the "new" counters are zero, it means that no new information is found in the
XML file.
A message is displayed with the number of items written for each LKB file. For example: "Numbers saved. lex: 248, infl: 21, funk: 48, gloss:58".
When the application is closed, the "Input", "Output" and "File Prefix" values are
saved in the TypeGram.ini file.
Next time you start the application you get the previous values for "Input", "Output"
and "File Prefix".
PRACTICAL EXAMPLE
1. Make a folder which will be the habitat for a grammar of Ga, and name it 'GaG'.
2. Download 'Grammar.zip' from 'TypeGram for Users'. It gets downloaded to 'Downloads'.
Extract all the zipped parts, resulting in a folder 'Grammar'. Move this folder into GaG. 3. Download 'Example TypeCraft XML ga_export.xml' from 'TypeGram for Users'. Also it
gets downloaded to 'Downloads', and it can be moved directly into GaG. (This is for
demonstration - in the normal case you will bring this xml export directly from TypeCraft
to this folder.) 4. Download 'TypeGram Software Java Jar file for Unix and Windows typegram.zip' from
'TypeGram for Users'. Like in step 2, unzip it inside of 'Downloads', and move the
unzipped folder 'typeGram' to GaG. 5. Open 'typegram', and you find three items inside. Click on 'start_win', and the
graphical interface (GI) opens. 6. The line to the right of the button 'Input' describes the item to be used as input.
The path corresponds to the folder chosen, but the item is only a placeholder name. Click
on "Input", and select through the folder system the item 'ga_export.xml'. 7. We now want to define the place where the created lkb-files will end up. We could let
this be 'Grammar', since this is where they will be used, but as an intermediate step, we
create a folder 'Converted tdl-files', from which we in subsequent steps can make
selections of files. This is now selected as output area, under "Output". 8. In the line to the right of 'Prefix', replace 'typegram' with 'ga'. 9. Click "Transfer". The folder 'Converted tdl-files' now contains 7 files: 'gaFuncWord',
'gaInfl' and 'gaLex' represent the relevant 'object language' items, whereas
'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex' represent the gloss words and gloss tags
of the IGTs. The file 'ga Gloss' contains the full gloss specification of each sentence
assembled as wordsize units in a string, like in "1SGPOSSface AORblack 1SG" ('my face
blackens me' = 'I get angry'), the items of which thereby match the items in the files
'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex'. The GI provides counts of each file, as
explained above. (The 'meta-level' grammar is descibed in Hellan 2010, Hellan and Beermann 2014.) 10. Depending on whether you want to create an 'object' grammar of Ga or a 'metagrammar',
you import either 'gaFuncWord', 'gaInfl' and 'gaLex', or 'gaMetaFuncWord', 'gaMetaInfl'
and 'gaMetaLex', into 'Grammar'. For one caveat, see next point. 11. As 'Grammar' is already defined, it contains test lexicon and inflection files which
match a test suite called 'test', which is a set of meta-strings based on Norwegian and
Ga. Before running with the new files, the old ones have to be given a prefix 'old' or
so, with the new files taking over the names used for the previous ones (this is in order
to make the grammar's loading definitions apply as normal).