Difference between revisions of "India2011- Digital Linguistics"
(→The Architecture and Processing of Brahmi-Derived Scripts) |
|||
(119 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called '''India 2011''', with India as the theme. The focus will be on broad cooperation | |
− | in culture, research, higher education and business. [[File:India2011.gif| | + | in culture, research, higher education and business. [[File:India2011.gif|thumb|left]] |
− | + | The present arrangement between the University of Hyderabad, and the Department of Language and | |
+ | Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus. | ||
− | + | The arrangement of several talks and workshops, announced here, is part of NTNU's India week. | |
− | |||
<blockquote><span style="color:green">'''India is a continent of many languages. Ethnologue <ref> Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.</ref> refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.'''</span></blockquote> | <blockquote><span style="color:green">'''India is a continent of many languages. Ethnologue <ref> Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.</ref> refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.'''</span></blockquote> | ||
+ | |||
+ | |||
==Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages== | ==Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages== | ||
− | |||
− | |||
− | |||
+ | ''' Oktober 1 - 9 2011''' | ||
+ | In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India. | ||
+ | [[File:LanguagesIndia.jpeg|right]] | ||
+ | The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU. | ||
+ | |||
+ | The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks. | ||
Line 24: | Line 29: | ||
Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages | Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages | ||
− | === | + | ====The Architecture and Processing of Brahmi-Derived Scripts==== |
− | + | ||
+ | [[Image:Gautam.Sengupta.jpg|thumb|150px|right|Professor Sengupta ]] | ||
+ | '''Professor Gautam Sengupta, University of Hyderabad''' | ||
+ | The earliest material evidence of writing in India appears in the | ||
+ | Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. | ||
+ | All the major writing systems of India, with the sole exception of | ||
+ | Urdu, derive from this early Ashokan script known as Brāhmī. The | ||
+ | Brāhmī-derived scripts are often called alpha-syllabaries on account | ||
+ | of the fact that they are based upon the notion of orthographic | ||
+ | syllable or akṣara. This talk will be about the basic architecture of | ||
+ | the Brahmic scripts of India and how they are processed in reading. | ||
+ | ====A unitary system for formal multilingual classification and a digital platform for cross-level representation==== | ||
+ | [[Image:LarsByM.jpg|thumb|150px|left|Lars Hellan|Professor Hellan]] | ||
+ | This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as ''representing'' structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata. | ||
+ | ===The Syntax and Semantics of Non-nominative Subject in South Asian languages=== | ||
+ | [[Image:BookSubbaro.png|thumb|right|200px]] | ||
+ | '''Professor K.V. Subbarao''' | ||
+ | I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I provide a brief description of NNSs in SALs first, keeping in view the notion of subject. I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive]. | ||
+ | <br> | ||
+ | <br> | ||
+ | <br> | ||
+ | ==Workshop== | ||
+ | The workshop will be introduced by a talk on | ||
+ | === '''Collaborative corpus creation - qualitative and quantitative linguistic methods'''=== | ||
− | + | by | |
− | + | ||
− | + | ||
− | + | [[User:Dorothee Beermann|Associate Professor Dorothee Beermann, NTNU]]. | |
− | + | In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the | |
− | + | possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to | |
+ | its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of | ||
+ | the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a | ||
+ | valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. | ||
+ | In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results. | ||
− | + | <br> | |
+ | ==WORKSHOP PROGRAM== | ||
+ | <font size="1" face="Verdana"> The Workshop is supported by [http://www.ntnu.no/india2011 India 2011] </font> | ||
+ | {| border="1" cellpadding="2" | ||
+ | |-valign="top" | ||
+ | |width="5%"|'''October 3th-7th''' | ||
+ | |width="20%"|'''Monday''' | ||
+ | |width="20%"|'''Tuesday''' | ||
+ | |width="20%"|'''Wednesday''' | ||
+ | |width="20%"|'''Thursday''' | ||
+ | |width="20%"|'''Friday''' | ||
+ | |-valign="top" | ||
+ | | | ||
+ | Meals during | ||
+ | the day are | ||
+ | provided at | ||
+ | university facilities | ||
+ | | | ||
+ | 9.15-10.30 '''Language Documentation''' | ||
+ | ''Dorothee Beermann'' | ||
+ | (LinLab (Building 4, Dragvoll)) | ||
+ | <span style="color:orange"> Tea </span> | ||
+ | 10:45-12:00 ''' Keynote ''' | ||
+ | |||
+ | ''Gautam Sengupta'' | ||
+ | |||
+ | (LinLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Lunch </span> | ||
+ | |||
+ | |||
+ | |||
+ | 13.15-14.30 Hands-on: | ||
+ | |||
+ | '''Introduction to TypeCraft''' | ||
+ | |||
+ | (LinLab and CompLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Afternoon Tea </span> | ||
+ | |||
+ | |||
+ | 15:00-16:45 Hands-on: | ||
+ | |||
+ | '''Creation of research Corpora, | ||
+ | verb annotation and discussion''' | ||
+ | |||
+ | (LinLab and CompLab) | ||
+ | |||
+ | |||
+ | |||
+ | | | ||
+ | 9.15-10.30 '''Keynote''' | ||
+ | |||
+ | ''K.V. Subbarao'' | ||
+ | |||
+ | (LinLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Tea </span> | ||
+ | |||
+ | |||
+ | 10.45-12 Discussion | ||
+ | |||
+ | '''Indian constructions types''' | ||
+ | |||
+ | (LinLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Lunch </span> | ||
+ | |||
+ | |||
+ | |||
+ | 13:30-14:30 | ||
+ | |||
+ | Hands-on: | ||
+ | |||
+ | '''Classifying and annotating construction types''' | ||
+ | |||
+ | (CompLab and offices) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Afternoon Tea </span> | ||
+ | |||
+ | |||
+ | 15:00 - 16:45 | ||
+ | |||
+ | Second afternoon session | ||
+ | |||
+ | | | ||
+ | |||
+ | 9:15-10:30 '''Keynote''' | ||
+ | |||
+ | ''Lars Hellan'' | ||
+ | |||
+ | (LinLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Tea </span> | ||
+ | |||
+ | |||
+ | 10:45-12:00 '''Discussion AVMs''' | ||
+ | |||
+ | (LinLab) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Lunch </span> | ||
+ | |||
+ | |||
+ | |||
+ | 13:15-14:30 | ||
+ | |||
+ | Hands-on: | ||
+ | |||
+ | '''AVM construction, construction labeling, TC annotation''' | ||
+ | |||
+ | (CompLab and offices) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Afternoon Tea </span> | ||
+ | |||
+ | |||
+ | 15:00 - !6:45 | ||
+ | |||
+ | Second afternoon session | ||
+ | | | ||
+ | |||
+ | All day: | ||
+ | |||
+ | Hands-on: | ||
+ | |||
+ | '''AVM construction, construction labeling, TC annotation''' | ||
+ | |||
+ | (CompLab and offices) | ||
+ | |||
+ | |||
+ | <span style="color:orange"> with meals as usual </span> | ||
+ | |||
+ | |||
+ | |||
+ | '''EVENING SESSION''' | ||
+ | |||
+ | |||
+ | |||
+ | '''Public talk at ''Dokkhuset''''': | ||
+ | |||
+ | 19:00-19:30: | ||
+ | |||
+ | ''Gautam Sengupta'': | ||
+ | |||
+ | The Aksara-Based Script Systems of India | ||
+ | |||
+ | | | ||
+ | All day at Gløshaugen Campus, | ||
+ | |||
+ | The IT Building, room 054: | ||
+ | |||
+ | '''Interconnecting Digital Linguistics and NLP''' | ||
+ | |||
+ | Contributors: | ||
+ | |||
+ | ''Tore Bruland, Anil Kumar Singh'' | ||
+ | |||
+ | |||
+ | <span style="color:orange"> Lunch at Gløshaugen </span> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Discussion of digital tools and sustainability of | ||
+ | |||
+ | distributive research using community platforms. | ||
+ | |||
+ | |||
+ | Note: | ||
+ | |||
+ | [http://www.cicling.org/2012 COLING 2012] | ||
+ | |||
+ | |||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <br> | ||
+ | |||
+ | |||
+ | |||
+ | '''The workshop features two sections: | ||
+ | |||
+ | ===Multilingual text processing, interlinear annotation and formalisation of Indic languages=== | ||
+ | |||
+ | Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations. | ||
+ | |||
+ | |||
+ | {| class="wikitable" style="margin: 1em auto 1em auto" | ||
+ | |+ '''List of Workshop Languages''' | ||
+ | ! Language name || Language Family || Script | ||
+ | |- | ||
+ | | Banglā (Bengali)|| Indo-Aryan|| Banglā | ||
+ | |- | ||
+ | | Hindi || Indo-Aryan || Devanāgarī | ||
+ | |- | ||
+ | | Punjabi||Indo-Aryan||Gurmukhi | ||
+ | |- | ||
+ | |Malayālam||Dravidian||Malayāḷalipi | ||
+ | |- | ||
+ | |Khasi||Austro-Asiatic||Roman | ||
+ | |- | ||
+ | |Angami||Tibeto-Burman||Roman | ||
+ | |} | ||
+ | |||
+ | ===Grammatical construction types across Indian languages=== | ||
+ | |||
+ | Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed. | ||
==References== | ==References== | ||
+ | [[Media:Paul.pdf|Soma Paul Tross paper]] | ||
+ | |||
<references/> | <references/> | ||
+ | |||
+ | [http://www4.clustrmaps.com/user/8abdaf33 http://www4.clustrmaps.com/stats/maps-no_clusters/www.typecraft.org-thumb.jpg] |
Latest revision as of 12:03, 20 October 2011
In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called India 2011, with India as the theme. The focus will be on broad cooperation
in culture, research, higher education and business.The present arrangement between the University of Hyderabad, and the Department of Language and Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.
The arrangement of several talks and workshops, announced here, is part of NTNU's India week.
India is a continent of many languages. Ethnologue [1] refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.
Contents
Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages
Oktober 1 - 9 2011
In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.
The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU.
The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.
Keynote Talks
Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages
The Architecture and Processing of Brahmi-Derived Scripts
Professor Gautam Sengupta, University of Hyderabad
The earliest material evidence of writing in India appears in the Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. All the major writing systems of India, with the sole exception of Urdu, derive from this early Ashokan script known as Brāhmī. The Brāhmī-derived scripts are often called alpha-syllabaries on account of the fact that they are based upon the notion of orthographic syllable or akṣara. This talk will be about the basic architecture of the Brahmic scripts of India and how they are processed in reading.
A unitary system for formal multilingual classification and a digital platform for cross-level representation
This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as representing structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.
The Syntax and Semantics of Non-nominative Subject in South Asian languages
Professor K.V. Subbarao
I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I provide a brief description of NNSs in SALs first, keeping in view the notion of subject. I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].
Workshop
The workshop will be introduced by a talk on
Collaborative corpus creation - qualitative and quantitative linguistic methods
by
Associate Professor Dorothee Beermann, NTNU.
In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.
WORKSHOP PROGRAM
The Workshop is supported by India 2011
October 3th-7th | Monday | Tuesday | Wednesday | Thursday | Friday |
the day are provided at university facilities
|
9.15-10.30 Language Documentation Dorothee Beermann (LinLab (Building 4, Dragvoll))
Gautam Sengupta (LinLab)
13.15-14.30 Hands-on: Introduction to TypeCraft (LinLab and CompLab)
Creation of research Corpora, verb annotation and discussion (LinLab and CompLab)
|
9.15-10.30 Keynote K.V. Subbarao (LinLab)
Indian constructions types (LinLab)
13:30-14:30 Hands-on: Classifying and annotating construction types (CompLab and offices)
Second afternoon session |
9:15-10:30 Keynote Lars Hellan (LinLab)
(LinLab)
13:15-14:30 Hands-on: AVM construction, construction labeling, TC annotation (CompLab and offices)
Second afternoon session |
All day: Hands-on: AVM construction, construction labeling, TC annotation (CompLab and offices)
EVENING SESSION
Public talk at Dokkhuset: 19:00-19:30: Gautam Sengupta: The Aksara-Based Script Systems of India |
All day at Gløshaugen Campus, The IT Building, room 054: Interconnecting Digital Linguistics and NLP Contributors: Tore Bruland, Anil Kumar Singh
distributive research using community platforms.
|
The workshop features two sections:
Multilingual text processing, interlinear annotation and formalisation of Indic languages
Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.
Language name | Language Family | Script |
---|---|---|
Banglā (Bengali) | Indo-Aryan | Banglā |
Hindi | Indo-Aryan | Devanāgarī |
Punjabi | Indo-Aryan | Gurmukhi |
Malayālam | Dravidian | Malayāḷalipi |
Khasi | Austro-Asiatic | Roman |
Angami | Tibeto-Burman | Roman |
Grammatical construction types across Indian languages
Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.
References
- ↑ Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.