Treebanks: Building and Using Parsed Corpora
Ann Taylor, Mitchell Marcus, Beatrice Santorini (auth.), Anne Abeillé (eds.)Linguists and engineers in Natural Language Processing tend to use electronic corpora more and more. Most research has long been limited to raw (unannotated) texts or to tagged texts (annotated with parts of speech only), but these approaches suffer from a word by word perspective. A new line of research involves corpora with richer annotations such as clauses and major constituents, grammatical functions and dependency links. The first parsed corpora were the English Lancaster treebank and Penn Treebank. New ones have recently been developed for other languages.
This book:
provides a state of the art on work being done with parsed corpora;
gathers 21 papers on building and using parsed corpora raising many relevant questions;
deals with a variety of languages and a variety of corpora;
is for those working in linguistics, computational linguistics, natural language, syntax, and grammar.
- the file is damaged
- the file is DRM protected
- the file is not a book (e.g. executable, xls, html, xml)
- the file is an article
- the file is a book excerpt
- the file is a magazine
- the file is a test blank
- the file is a spam
Together we will make our library even better
Please note: you need to verify every book you want to send to your Kindle. Check your mailbox for the verification email from Amazon Kindle.
Related Booklists
|
|
Treebanks Text, Speech and Language Technology V O L U M E 20 Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universität Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France The titles published in this series are listed at the end of this volume. Treebanks Building and Using Parsed Corpora Edited by Anne Abeille Universite Paris 7, Paris, France Springer Science+Business Media, LLC A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN 978-1-4020-1335-5 ISBN 978-94-010-0201-1 (eBook) DOI 10.1007/978-94-010-0201-1 Printed on acid-free paper All Rights Reserved © 2003 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers 2003 Softcover reprint of the hardcover 1st edition 2003 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Contents Preface XI Introduction Anne Ab eilie I Bu ild ing Treebanks 2 Using treebanks Part I xiii Xv xix Building treebanks E NGLISH TREEBANKS Chapter I TH E P ENN TR EEBANK: AN OVERVIEW Ann Taylor, Mitchell Marcus, Beatrice Santorini I The annotation schemes 2 Methodology 3 Conclusion s 5 6 16 20 Chapter 2 THOUGHTS ON TWO DECADES OF DRAWING TREES Geoffrey Sampson I Historical background 2 Building treeb ank s 3 Exploiti; ng the S USANNE Treebank 4 Small is beautiful 5 Annotating a spoke n corpus 6 Using the CHRISTl NE Corpus 7 Conclusion 23 23 26 29 33 35 38 40 Chapter 3 43 BA NK OF ENGLISH AND BEYO ND Timo Jiirvinen I Introduction 2 Annotating 200 million words 3 ENGCG Syntax 4 FDG parser 5 Conclusion 43 44 52 54 56 v VI TREEBANKS Chapter 4 COMPLETING PARSED CORPORA FROM CORRECTION TO EYOLUTION Sean Wallis I Introduction 2 Conventional post-correction 3 A paradigm shift: transverse correction 4 Critique 61 61 63 65 68 GERMAN TREEBANKS Chapter 5 SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS 73 Thorsten Brants, Wojeieeh Skut, Hans Uszkoreit I Introduction 2 Treebank development 3 Corpus annotation 4 Applications 5 Conclusions Appendix: Tagsets 73 74 77 83 83 87 Chapter 6 ANNOTATION OF ERROR TYPES FOR A GERMAN NEWSGROUP CORPUS Markus Beeker, Andrew Bredenkamp, Berthold Crysmann, Juditn Klein I Introduction 2 Corpus Description 3 Annotation Strategy 4 Annotation Tools 5 Evaluation 6 First Results 7 Conclusion 89 89 90 91 93 96 98 99 SLAYIC TREEBANKS Chapter 7 THE POT: A 3-LEYEL ANNOTATION SCENARIO 103 Alena Bohmovd, fan Hajic, Eva Hajicovd, Barbora Hladkd I The Prague Dependency Treebank 2 Morphological Level 3 Analytical Level 4 Merging the Morphological and the Analytical Syntactic Level 5 Tectogrammatical Level 6 PDT versions 1.0 and 2.0 7 Conclusion Appendix 103 104 106 I 14 114 121 122 126 Contents VII Chapter 8 AN HPSG-ANNOTATED TEST SUITE FOR POLISH 129 Malgorzata Marciniak, Agnieszka Mykowiecka, Adam Przepiorkowski, Anna Kupsc I Aims and design constraints 129 2 Correctness and complexity markers 130 131 3 Linguistic phenomena 4 Annotation schema 136 5 Implementation issues 137 6 Conclusion 143 TREEBANKS FOR ROMANCE LANGUAGES Chapter 9 DEVELOPING A SPANISH TREEBANK 149 Antonio Moreno , Susana Lopez; Fernando Sdnchez; Ralph Grishman I Introduction 2 Data selection 3 Annotation scheme 4 Tools 5 Debugging and error statistics 6 Current state and future development Appendix: Sample of trees 149 150 151 157 158 159 163 Chapter 10 BUILDING A TREEBANK FOR FRENCH 165 Anne Abeille, Lionel Clement, Francois Toussenel I The tagging phase 2 The parsing phase 3 Current state and future work 4 Conclusion Appendix 166 173 180 181 185 Chapter 11 BUILDING THE ITALIAN SYNTACTiC-SEMANTIC TREEBANK 189 Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoleua Calzolari, Ornella Corazzari; Alessandro Lenci, Antonio Zampolli; Francesca Fanciulli , Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pozienra ; Dario Saracino, Fabio Zanrotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte I Introduction 190 2 ISST architecture 190 3 ISST corpus 191 4 ISST morpho-syntactic annotation 191 5 ISST syntactic annotation 192 6 ISST lexico-semantic annotation 196 7 The multi-level linguistic annotation tool 200 8 ISST evaluation 204 9 Conclusion 206 Appendix 209 VIII TREEBANKS Chapter 12 AUTOMATED CREATION OF A MEDIEVAL PORTUGUESE TREEBANK 211 Vitor Rocio, Mdrio Amado Alves, J. Gabriel Lopes, Maria Francisca Xavier; Gracia Vicente I Introduction 211 2 The parsed corpus of medieval portuguese texts 212 3 Tools and computational resources 215 Evaluation 222 4 Conclusion 224 5 TREEBANKS FOR OTHER LANGUAGES Chapter 13 231 Keh-Jiann Chen, Chi-Ching Luo, Ming-Chung Chang, Feng-Yi Chen, Chao-Jan Chen, Chu-Ren Huang, Zhao-Ming Gao I Introduction 231 2 Design criteria 232 3 Representation of lexico-grammatical information : ICG 233 4 Annotation guideline 235 5 Implementation 239 6 Representational issues : problematic cases and how they are solved 241 7 Current status of the sinica trecbank and future work 243 Appendix: Syntactic Categories 248 SINICA TREEBANK Chapter 14 BUILDING A JAPANESE PARSED CORPUS 249 Sadao Kurohashi, Makoto Nagao I Introduction 2 Overview of the project 3 Morphological analyzer JUMAN 4 Dependency structure analyzer KNP 5 Conclusion 249 250 253 255 259 Chapter 15 BUILDING A TURKISH TREEBANK Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tiir, Gokhan Tiir I Turkish: Morphology and syntax 2 What information needs to be represented? 3 The annotation tool 4 Some difficult issues 5 Conclusions and future work Appendix: Turkish Morphological Features 261 262 263 270 272 273 276 ix Contents Part II Using treebanks Chapter 16 ENCODING SYNTACTIC ANNOTATION 281 Nancy Ide, Laurent Romary I Introduction 2 XCES 3 Syntactic annotation: current practice 4 A model for syntactic annotation 5 Using the XCES scheme 6 Conclusion 281 283 284 286 291 293 EVAL UATION WITH TREEBANKS Chapter 17 PARSER EVALUATION 299 John Carroll, Guido Minnen , Ted Briscoe I Introduction 2 Grammatical relation annotation 3 Corpus annotation 4 Parser evaluation 5 Discussion 6 Summary 299 302 308 309 312 313 Chapter 18 DEPENDENCY-BASED EVALUATION OF MINIPAR 317 Dekang Lin I Introduction 2 Dependency-based parser evaluation 3 Evaluation of minipar with susanne corpus 4 Selective evaluation 5 Related work 6 Conclusions 317 318 320 323 326 328 GRAMMAR INDUCTION WITH TREEBANKS Chapter 19 EXTRACTING STOCHASTIC GRAMMARS FROM TREEBANKS 333 Rens Bod I 2 3 4 333 335 337 344 Introduction Summary of data-oriented parsing Simulating stochastic grammars by constraining the subtree set Discussion and conclusion x TREEBANKS Chapter 20 STOCHASTIC LEXICALlZED TREE GRAMMARS 351 Giinter Neumann I Introduction 2 Related work 3 Grammar extraction 4 SLTG from treebanks 5 SLTG from HPSG 6 Future steps: towards merging SLTGs 351 352 353 355 359 362 Chapter 21 FROM TREEBANK RESOURCES TO LFG F-STRUCTURES Anette Frank, Louisa Sadler, lose/van Genabith, Andy Way I Introduction 2 Methods for automatic f-structurc annotation 3 Two Experiments 4 Discussion and Current Research 5 Summary Appendix: Example of an Automatically Generated F-Structure (Susanne Corpus) 367 368 370 380 383 385 389 Contributing Authors 391 Index 398 Preface The papers in this volume are either original or based on presentation s at several workshops and conferences (LINC , LREC , EACL) , in particular the ATALA workshop on treebank s (Paris, 1999). The papers have been greatly reworked , and I thank the authors for rewriting their own, and reviewing each other's papers. I also thank three anonymous Kluwer reviewers, as well as the series editors Nancy Ide and Jean Veroni s for their helpful contribution. The introduction is meant to give a brief overview on the work being done on parsed corpora (or treebanks) and is by no means an exhau stive state of the art. This field is moving very fast and one can only hope to point out basic motivation and methodology, as well as the results and open problems, common to the different projects presented . I want to thank Nicolas Barrier for postediting, indexing and formatting , as well as Jesse Tseng for help with proofreading. I also thank Universit y Paris 7 for material help support . A NNE A I3EILLE Introduction Anne Abeille It is now quite easy to have acce ss to large corpora for both written and spoken language. Corpora have become popular resources for lingui sts and engineers developing applications in Natural Language Processing (NLP). Linguists typically look for variou s occurrences of specific word s or patterns, engineers extract lexicons and language model s (associating probabilities with sequences of words). Thi s kind of research is not new but has produced new result s with the availability of larger and larger corpora in electronic form, and with the growing memory of computers which can now easily handle text s of several million words. Nethertheless, as observed for example by (Manning 2002), corpus-ba sed lingui stics has been largely limited to phenomena that can be acce ssed via searches on particular word s. Inquiries about subject inversion or agentless passives are impo ssible to perform on commonly available corpora. More generally, corpus-ba sed research has faced two (related) kind s of problems: • Re sult s are nece ssarily limited as long as they are dealing with corpora with no lingui stic annotations. Such corpora may have some metalinguistic annotations such as paragraph boundaries or italic s but lack linguistic information such as part s of speech or clause boundaries. As ambiguity is pervasive in natural languages, most computations on such corpora are biased . To take just one example, from French, it is difficult to count how many word s the following contains, or how many occurrences of the word " le" ? Paul ouvre le sac de pommes de terre et le pose sur la table. "Paul open s the bag of potatoes and puts it on the table" In this sentence, "pommes de terre" is in fact a compound word (meaning potato, literarily ' apple of earth ') although no typographical sign sets A. Abeille (ed. ), Treebanks: Building and Using Parsed Corpora , xiii-xxvi. © 2003 Kluwer Academic Publishers. Printed in the Netherlands. xiv AN NE ABEILLE it apart from other word sequences. So the sentence has 12 words and not 14. In this sentence, the form "le" which occurs twice, corresponds in fact to two distinct words: the definite determiner (before the Noun "sac") and the accusative pronoun (before the Verb "pose"). So if one looks for other contexts of the same word, one must know whether one wants the determiner or the pronoun . This is why some linguistic annotations are needed, in order to make searches and computations more precise . • Automatic tools for annotating corpora exist for many languages, but make mistakes. One could , in principle, run some automatic part of speech tagger or lemmatizer on a given corpus , and take the resulting annotation into account before searching it, but the quality of the resulting corpus is not guaranteed and the annotation choices are usually not documented (the parser designer may not have foreseen a lot of cases encountered in corpora). The quality of the search will be highly dependent on the particular choice of a particular tagger. This is why people started to develop, in the late sixties, linguistically annotated corpora, to improve searches and computations. The idea is that these annotations are devised by experienced linguists or grammarians, fully documented for the end user, and fully checked to prevent remaining errors (introduced by a human annotator or by a machine). These constitute linguistically annotated reference corpora. The first to be built were for English (the IBM/Lancaster treebank or the British National Corpus) and they have helped the development of both English NPL tools and English corpus linguistics . They emerged as the result of a convergence between computational linguistics and corpus linguistics . Computational linguists, once grammatical formalisms and parsing algorithms had been successfully developed, became aware that real texts usually contain lots of peculiarities overlooked in usual grammar books. Corpus linguists have always tended to emphasize minute descriptions of a given language over general theorizing. A linguistically annotated reference corpus allows one to build (or improve) sizable linguistic resources (such as lexicons or grammars) and also to evaluate (or improve) most computational analyzers . One usually distinguishes tagged corpora (associating each token with a part of speech, and some morphological information) and parsed corpora (or treebanks) which provide, in addition to the category for each word, some of the following information: • constituent boundaries (clause, NP...), • grammatical function of words or constituents, INTRODUCTION xv • dependencies between words or constituents. This book gathers 21 papers on building and using such parsed corpora, raising among others the following questions: • how does one choose the corpus to be annotated ? • how does one choose the kind of annotation to be added? • is annotation best done manually or automatically? with which tools? with which formats? • how can one search such annotated corpora ? • what kind of knowledge can be extracted (learned) out of them? how is it better than extracting (or learning) from unannotated sources? • how can they be used to evaluate current NLP tools such as parsers or grammar checkers? The papers presented here deal with a variety of languages (Chinese, Czech , English, French , German, Italian , Japanese, Polish , Portuguese, Spanish, Turkish) I . The objective of this book is not to present a recipe for building your own treebank or using an existing one, but, in a more general perspective , to present an overview of the work being done in this area, the results achieved and the open questions. Unsurprisingly, many independent projects face some of the same issues, regarding corpus annotation or resource extraction . 1. BUILDING TREEBANKS The main issues facing treebank designers are corpu s choice , annotation choice, and annotation tools. 1.1 Motivation The creation of syntactically annotated corpora started about thirty years ago, for English, with mostly manual methods. The goal then was to provide the most complete annotation scheme possible, on empirical grounds , and to test it on a small corpus (just large enough to train statistical annotation programs) . The goal was to advocate linguistic empiricism against the linguistic theories of the time (cf Sampson, this volume) . In contrast, with the development of more mature linguistic models , the objective of some treebanks nowadays is to apply a given linguistic theory, such as a specific kind of dependency grammar for the Prague project (Bohmova et al. this volume), or HPSG for the Polish project (Marciniak et al. this volume). xvi AN NE ABEILLE But the most common goal, as naive as it may seem, is to provide a new resource, not committed towards a particular linguistic theory, and convertible into different linguistic models . Some projects also have a precise application in mind: The Bank of English (Jarvinen this volume) was built to extract an enriched kind of dictionary (with observed constructions associated with the words) . Other projects aim at training automatic tools or evaluating them , and this may affect their annotation choices (cf Becker et al. this volume). Corpora for evaluation allow for some remaining ambiguities but no errors , while the opposite may be true with corpora for training: the former may associate several plausible analyses with the same sentence, whereas the latter will try to always choose only one. 1.2 Corpus choice The choice of the corpu s depends on the objective . In most of the cases here, it consists of contemporary newspapers texts (Bohmova et aI., Moreno, Lopez, Brants et aI., Abeille et aI., Montemagni et aI., Chen et al.). Some authors use medieval texts for diachronic studie s (Rocio et al.), a sample of representative sentences for grammar evaluation (Marciniak et al.) or a corpus of mail messages for error annotation (Becker et al.). The corpus size can vary from a few hundred sentences (Turkish treebank) to hundreds of millions (Bank of English). Some are balanced with extract s from different types of texts (literary, scientific etc), while others are not. As noted by Manning (2002), newspapers constitute in this respect a pseudo-genre, with texts from different domains, often written in different styles. Some projects now involve the syntactic annotation of speech corpora, such as the Christine Project mentioned by G. Samp son (this volume), or the Switchboard corpus mentioned by Taylor et al. (this volume) as a follow-up of the Penn treebank project. They involve transcription as their first annotation phase . 1.3 Annotation choice The choice of annotation depends both on the availability of syntactic studies for a given language (formalized or not, theory oriented or not) and on the objective. Carroll et al. (this volume) show how difficult it is to choose a reasonable annotation scheme for parser evaluation, when a variety of parsing outputs have to be matched with the annotated corpus . Several annotation levels are usually distingui shed, with morphosyntactic annotations such as parts of speech and syntactic annotations such as constituents or grammatical relations. An ongoing debate, inspired by discus sions in theoretical linguistic s, is the choice between constituency and dependency annotation. As is well known, INTRODUCTION XVII both annotations can be seen as equivalent, as long as each phrase is associated with a lexical head: NP John / S V -, / wants want /\ VP John(subj) ""/ \ I VP V to eat 0 I Figure /./. cake(obj) NP /\ the eat the N I cake Constituency and dependency trees for the same sentence Another debate is between deep and shallow syntactic annotations : some projects limit themselves to overt elements, while others reconstruct empty subjects or elliptical material. In order to account for "deeper" information, in the tree above (figure 1), the complement of want would be an S with an empty subject coindexed with John; in the second tree, a link between John and eat would be added . Both types of annotations have their advantages and their drawbacks: constituents are easy to read and correspond to common grammatical knowledge (for major phrases) but they may introduce arbitrary complexity (with more attachment nodes than needed and numerous unary phrases). Dependency links are more flexible and also correspond to common knowledge (the grammatical functions) but are less readab le (word order is not necessarily preserved) and choosing one word as the head in any group may be arbitrary. A hybrid solution, chosen by some of the projects, is to maintain some constituents with dependency relations between them (cf Negra corpus, Japanese, Chinese and Turkish treebanks). In this case, the constituents are usually minimal phrases, called chu nks, bunsetsu or inflection groups. Given the state of syntactic know ledge, some annotation choices may be arbitrary. What is most important is consistency (similar cases must be handled similarly) and expliciteness (a detailed documentation must accompany the annotated corpus). As noted by G. Sampson, for the Susan ne Corpus, the size of the documentation can be bigger than the corpus itself. xviii AN NE ABEILLE The tagset (the list of symbols used for categories, phrases and functions) can vary from a dozen (Marcus et al. 1993) to several millions (Turkish project) , with diverse possibilities in between (medium size for the French or Italian projects), large for the Prague treebank. They partly reflect the richness of the language 's morphosyntax (with case or agreement systems yielding numerous features to be encoded). Problematic cases are often the same across languages . Compounds or idioms are clustered as such (French and Turkish projects) or simply ignored (Penn Treebank, German or Portuguese treebanks). Discontinuities, null elements, and coordination are known to be debated in linguistics, and they are matter of debate for corpus annotation too. More specific to corpora are other problematic cases such as the annotation of titles, quotations, dates, and measure phrases. 1.4 Methodology Corpus annotation was done entirely by humans in the first projects, for example in Sweden (Jaborg 1986) or for the Susanne corpus (Sampson 1995). It is now usually at least partially automated. Human post-checking is always necessary, be it systematic (as in the Penn Treebank or the Prague dependency treebank) or partial (as in the BNC). Ensuring coherence across several annotators is a crucial matter as pointed out by Wallis (this volume) for the ICE-GB project, or Brants et al. (this volume) for the Negra project. Purely automatic annotation is only done by Rocio et aI., and Jarvinen (this volume). Purely manual annotation is done by Marciniak et aI., as a preliminary stage (this was also the case in the first phase of the Prague project). Most projects involve a phase of automatic annotation followed by a phase of manual correction. Some check all annotation levels at the same time (Chinese treebank, Negra project), others check tagging and parsing information separately (Penn treebank , Prague treebank, French treebank). Most treebanks presented here involve human validation, but in order to minimize time and cost, some new projects tend to favor merging different outputs of automatic annotation tools, such as tagging with different taggers and checking only the parts with conflicts in the French Multitag project (cf. Adda et al. 1999). 1.5 Annotation tools Tools for annotation usually include tokenizers (for word boundaries and compounds), taggers (for part of speech annotation) and morphological anaIyzers, and parsers (for syntactic information). They must be robust and minimize errors. Some projects prefer to build their own tools, in order to match their annotation choices better (French, Spanish, INTRODUCTION XIX Chinese, Turkish or Japanese treebanks). Others prefer to reuse existing ones, and some adaptation is often necessary : ICE-GB treebank reuses TOSCA tools, the Prague treebank reuses M. Collins' statistical parser originally developed for English, the medieval Portuguese treebank reuses a parser (and a grammar) originally developed for modern Portuguese. Some tools are rulebased, others are based on statistics, and sometimes trained on a subset of the corpus to be annotated. Usually, two annotation phases are distinguished, each with its own tool: a tagging phase, only dealing with morphosyntactic ambiguity (and part of speech annotation) and a parsing phase dealing with syntactic ambiguity per se and constituency (or dependency) annotations. More specific tools are those for the annotators to check and correct the treebank. They are specifically developed in Wallis, Brants et aI., Abeille et al. (this volume). Sometimes, the tools for the annotators are the same as those for searching the annotated corpus. Most corpora presented here are static resources. A recent line of research is to develop dynamic treebanks, which are parsed corpora distributed with all annotation tools, in order to be enlarged at will by the user (cf. Oepen et aI. 2002). 2. USING TREEBANKS Treebanks allow for mutiple uses, by linguists, psycholinguists or computational linguists . Linguists may search for examples (or counter-examples) for a given theory or hypothesis. With the development of Optimality Theory (cf Dekkers et al. 2001), which relies on ranked and defeasible constraints, corpus-based research is now welcome in generative linguistics (cf Manning 2002f. Psycholinguists are usually interested in computing frequencies, such as low or high attaching relative clauses, and comparing them with human preferences (cf Gibson and Schutze 1999). Sociolinguists, and those working on the history of language, have always worked with corpora and are starting to use treebanks (cf Kroch and Taylor 2000) For simple linguistic queries, annotated texts enable a reduction of the noise usually associated with answers on raw texts, and also the formulation of new questions. When one is interested in French causatives, it is frustrating to list all the inflected forms of the verb "faire" in a simple query, and a lot of the answers are not relevant because they involve the homonymous noun fait (which is also part of a lot of compounds en fait , de fait , du fait que etc). Lemmatized tagged texts are thus helpful and, in a treebank, one can add the context of an infinitival verb in the same clause to be more precise . New questions involve word order and complexity of various types of phrases. (Am old et al. 2000) for example, have used the Penn treebank to determine which factors favor the non canonical V pp NP order in English. xx AN NE ABEILLE One can also check psycholinguistic preferences, in the sense that a highly frequent construction can be claimed to be prefered over a less frequent one . For example, Pynte and Colonna (2000) have shown, on experimental reading tests , that given a sequence of two Nouns followed by a relative clause in French, the relative clause tends to attach to the first N if is long and to the second N if it is short. This claim can be easily checked on a treebank, where such a correlation can be measured (cf Abeille et al. 2002) . Other applications include text clas sification, word sense disambiguation, multilingual text alignment and indexation. For text categorization, Biber (1988) works with richly annotated texts , and uses criteria such as the relative proportion of adjectives among categories (as a good discriminator for formal vs informal speech), and the relative proportion of pronominal subjects among all subjects (as a discrimator for speech vs written language). Such criteria are duplicated on other languages such as French by Malrieu and Rastier (2001). For verb sense disambiguation, it is important to know the construction of each occurrence, since it is a good guide for semantic interpretation (can as a modal can be distinguished from can as a transitive verb etc). For automatic sense classification too , parsed texts are being used, for example by (Merlo & Stevenson 2001). Bilingual text alignment is done automatically at the sentence level with good performance but the results are much poorer at the word level, because translations are not usually word by word and because of the pervasive ambiguity of word forms (homonymous forms etc). A more realistic perspective is to align clauses or major phrases such as NPs (cf. Sun et al. 2000) . For text indexing or terminology extraction, too, some syntactic structure is necessary, at least spotting the major noun phrases. Knowing that an NP is in argumental position (subject or object) may make it a better candidate for indexing than NPs only occurring in adjunct phrases. On the other hand , treebanked texts are usually smaller than raw texts (usually available in quasi infinite quantity, especially via the various web sites in the world) , especially for languages for which the annotation tools do not perform well enough to avoid human post-correction. Some searches for individual forms or patterns may yield poor results. Another obstacle si that treebanks are not readable as such and require specific search and viewing tools . This may be why they are still more used by computational linguists than by other linguists. The papers gathered here all deal with applications in computational linguistics, such as lexicon induction (Jarvinen), grammar induction (Frank et aI., Bod) , parser evaluation (Carroll et al.) or checker evaluation (Becker et al.) . Some of the questions facing the treebanks ' users are the following: • what are the corpora most suited to her goal (domain, size ...) ? INTRODUCTION XXI • are the linguistic annotations appropriate for the task ? • is the tagset usable as such or does it have to be converted (reduced? enlarged ?) • in which format does the treebank come (and does it have to be converted) ? • does the treebank come with appropriate tools (such as parameterized viewing tools or interactive search tools) ? 2.1 Exchange formats and search tools For search tools or machine learning program s to be usable on different annotated corpora, one must define some kind of exchange format. Standards still have to be defined , and Ide and Romary (this volume) make a first step towards this goal, following the XCES (XML based corpus encoding standard). Format standardization will promote the reuse of common search and viewing tools. One of the reasons for limited use of treebanks by everyday linguists is the lack of friendly search tools, a noticeable exception being the ICE-CUP query language associated with the ICE-GB corpus (Wallis this volume). 2.2 Resource induction Different kinds of resource s can be extracted from treebanks. The main motivation for the Bank of English project (Jarvinen this volume) was to extend the Collins COBUILD dictionary, with new examples and new constructions for most of the Engli sh verbs. Another choice is grammar extraction or rule induction . The corpus-based grammar may be a traditional human-written grammar book such as (Biber et al. 2000). It can also be an online grammar, to be used by NLP programs. In this case, one has to choose a formal model of grammar, for example the simple context free rewriting rule type. The papers included here in this domain involve several corpora (ATIS corpus, Penn treebank , Lancaster corpus) , Dutch (avis) or German (Negra). Some papers (Neumann, Frank et al.) start with a richer model (LFG , HPSG or TAG) which guide the type of pattern (tree like or rule like) to be extracted. Such corpus-based grammars are likely to perform better on similar corpora than grammars written independently (and likely to ignore phenomena not present in the training corpus) . They can also associate with each rule its productivity or plausibility (given the number of times it is used in the corpus), and parsers using such grammars can easily decide in the case of several possible analyses (and avoid spurious ambiguities). But a grammar is not ju st a list of independent rules, and some experiments also aim at capturing some generalizations, merging rules involving the same xxii AN NE ABEILLE parts of speech for example, or ignoring rules with a very low number of application in the corpus. A grammatical model (such as LFG) may also ask for information not present in the corpus (such as grammatical function) and part of the experiment is also to enrich the extracted rules in a semi automatic way (Frank et al.). The result is usually far more rules than a parser can handle and still be efficient (cf Bod this volume) . More experiments thus have to be done. 2.3 Training tools Treebanks can be used to train different tools for automatic natural language processing: taggers, parsers, generators. In the field of stochastic natural language processing (cf. Charniak 1993, Manning and Schutze 1999), corpora are the primary source of information for training such tools. The first stochastic parsers were using unannotated texts (unsupervised learning) but new parsers now use richly annotated corpora and perform better. Bod (1998, this volume) proposes the Data-oriented Parsing (DOP) model as an alternative to classical parsers. His parsers are context-free parsers that use the subtrees of the learning corpus (the treebank) converted into rewriting rules and associated with probabilities. In essence, finding the correct analysis (the best parse) amounts to maximizing the probability of a derivation (the product of the probabilities of the rules being used). Such parsers are robust (they do not fail on real corpora) and deterministic (they always give one best result). In such experiments, one often divides the treebank in two parts: one for learning (called the training set) and one for evaluation (called the test set). It is natural that such parsers perform well on (unannotated) corpora similar to the original treebank; but it is an open question how well such parsers perfom on different corpora. To train a parser that performs well, one has to find a balance between the size of the tagset (richness of information available in the annotated corpus) and the size of the corpus (number of word or sentences annotated). The better performances are obtained with a small tagset (less than 50 tags) and a large corpus (more than 1 million words) . Srinivas and Joshi (1999), training a shallow parser (called a supertagger) on the PennTreebank (converted with 300 different tags), show how going from a training set of 8000 words to a training set of 1 Million words improve the parser's performance from 86% (of the words with the correct syntactic structure) to 92% (using a trigram model) . Srinivas and Rambow (2000) show how to train a generator on a treebank, favoring the choice for the most common constructions. INTRODUCTION 2.4 XXIII Evaluation with treebanks Different resources and tools can be evaluated using a treebank. One can evaluate available lexicons: their coverage, the precision of their valence information as noted in the reference lexicon compared to what is found in the treebank . An indirect result could be adding weight s (or preferences) to category or valence information in existing dictionaries: from the treebank , one knows which categories or valence frame s are more often used than others for a given verb. One can also evaluate available grammars, to see how many of the treebank constructions they cover, and whether or not they provide the same analysis for similar phenomena. Xia et al. (2000) , in order to evaluate an Engli sh handcrafted grammar based on the Tree Adjoining grammar formalism (XTAG), compare it with a grammar in the same format automatically extracted from the Penn Treebank' . As for lexicons, an indirect result can be adding weight to such a grammar, since one knows from the treebank which grammar rules are more often used than others. In a larger sense, one also use treebank s to evaluate (and enrich) grammatical theories or formali sms. The construction of the Prague Dependency Treebank was part of a project to test whether the Functional Generative Description was appropriate for a rich collection of contemporary Czech texts. In a sense, the HPSG-based treebank for Polish, built by Marciniak et al. (this volume), is also a testing ground for Head-driven Phrase Structure Grammar, to see whether its description are broad or robust enough to account for samples of contemporary usage. Different tools can be evaluated using a treebank. Part of speech taggers of course, but a (validated) tagged corpus is enough for that. More ambitiou s taggers assigning grammatical functions (such as Jarvinen's FDG parser) need more than tagged corpora to be evaluated. As explained above, stochastic parsers are naturally evaluated with treebanks, but hand-crafted rule-based parsers can be also. Two papers (Caroll et aI., Lin in this volume) present an exercise in parser evaluation using Susanne corpus. It is usually difficult to directly match the output of the tool to be evaluated (a tagger or a parser) and the treebank itself. Some conversions or transformations are usually necessary. A common methodology, named Parseval, has been tested on the Penn treebank: in order not to penalize annotation choice s diverging from the treebank , it only count s as erroneous crossing bracket s and does not take the constituent s' names into account. Still, this method (called Parseval) has been shown to be unfair to certain types of parsing schemes (Lin, Caroll et al. this volume) and a method evaluating grammatical functions is now often preferred. xxiv AN NE ABEILLE Conclusion The papers included in this volume deal with building and using parsed corpora for a variety of languages. As treebanks are still a new kind of resource, it is not sur prising that more papers deal with building them than using them. Unsurpri singly, many projects dealing with different languages are faced with the same difficulties, regarding annotations of notoriously difficult con structions such as discontinuities, coordinations, parentheticals. Most applications presented here belong to the field of computational linguistics, but treebanks are starting to be used in other fields of linguistics as well, and to renew traditional studies in corpus lingui stics (which have long been limited to word level searches and computations). A goal for this book is preci sely to encourage more treebank users and uses. Obviously, treebanks are just one step towards richly annotated corpora. Future work involves adding semantic annotation such as in the Redwood treebank (Oepen et al. 2002), or adding pronoun-antecedent links as in (Tutin et al. 2000). Among projects not included in this book, one may cite ongoing treebank projects for languages such as Bulgarian (Simov et al. 2002), Dutch (van der Beek et al 2001) or modern Hebrew (Sima'an et al. 2001). Notes I. Some of these papers are revised versions of presentations at various workshops or conferences, among which the ATALA workshop on Treebanks held in Paris in June 1999 and the LINe conferences in Bergen 1999 and Saarbriicken 2000. 2. Optimality Theory proposes universal constraints which can be ranked differently among languages, but which are not weighted . More recently, some probabilistic versions of OT have been proposed (cf. Boerma & Hayes 200 I). 3. The result is that more than half of the hand-crafted rules (elementary trees) are actually in the Treebank grammar, which cover more than 80% of the treebank sentences (and more than 90% once one eliminates errors in the Treebank and arbitrary differences in annotation choices between the grammarians and the treebanks annotator s). References A. Abeille, L. Clement, A. Kinyon, F. Toussenel 2002. "The Paris 7 annotated corpus for French: some experimental results" In A Rainbow of corpora: Corpus linguistics and the languages of the world , A. Wilson, P. Ray son and T. McEnery (ed s.) Lincom Europa, Munich G. Adda, J. Mariani, P. Paroubek, M. Rajman, J. Lecomte, 1999. L'action GRACE d'evaluation de I'assignation de parties du discours pour le francais, Langues , 2-2, p. 119-129. J. Amold, T. Wasow, A. Losongco, R. Gin strom. 2000. Heaviness vs. Newness: The effects of complexity and information structure on con stituent ordering, Language 76, p. 28-55. INTRODUCTION xxv D. Biber 1988. Variation accross speech and writing, Cambridge: Cambridge University Press . D. Biber, S. Johanson, G. Leech , S. Conrad, E. Finegan 2000 The Longman grammar of spoken and written English, London, Longman. R. Bod 1998. Beyond grammar. An experience-based theory of language, CSLI , Stanford. P. Boerma, B. Hayes 2001. Empirical tests for the gradualleaming algorithm, Linguistic Inquiry, 32: 1, p. 45-86 . E. Chamiak, 1993. Statistical language learning, Cambridge, MIT Press. E. Chamiak, 1996. Treebank grammars, Proceedings AAAI. J. Chen, K. Vijay-Shanker, 2000, Automated extraction of TAGs from the Penn Treebank, Proceedings 6th International workshop on parsing technologies (lWPT), Trento. J. Dekkers, F. van der Leeuw, J. van de Wiejer (eds) 2000. Optimality Theory: phonology, syntax and acquisition. Oxford: Oxford University Press. E. Gibson , C. Schutze, 1999. Disambiguation preferences in NP conjunction do not mirror corpu s frequencies, Journal of Memory and Language, 40. J. Jaborg 1986. Manual for syntaggning, Goteborg University, Institute for sprakvetenskaplig databehandling. A Kroch, A Taylor, 2000. The Penn-Helsinki parsed corpus of middle English, Technical report, Linguistics Department, University of Pennsylvania. F. Malrieu, F. Rastier, 2001. Genres et variations morphosyntaxiques, TAL, 42-1, p. 547-578. C. Manning, H. Schiitze, 1999. Foundations of statistical natural language processing, Cambridge, MIT Press. C. Manning 2002. Probabilistic syntax, in Bod et al. (eds) Probabilistic linguistics, Cambridge, MIT Press. M. Marcus, B. Santorini, M-A. Marcinkiewicz .1993. Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19:2, 313-330. P Merlo , S. Stevenson. 2001. Automatic Verb Classification based on Statistical Distribution of Argument Structure, Computational Linguistics, 27:3, p. 373-408 S. Oepen , K. Toutanova, S. Shieber, C. Manning, D. Flickinger, T. Brants. 2002. The LinGO Redwood treebanks : motivation and preliminary application, Proceedings COLING, Taiwan. J. Pynte, S. Colonna, 2000. Decoupling syntactic procesing from visual inspection; the case of relative clause attachment in French. In Kennedy, Raddach, Helier, & Pynte (Eds.). Reading as a Perceptual Process. Elsevier. G. Sampson. 1995. English for the Computer, The Susanne corpus , Oxford, Oxford University Press . xxvi AN NE ABEILLE K. Sima'an, A. Itai , Y. Winter, A. Altman, N. Nativ, 2001. Building a treebank for modern Hebrew text, TAL, 42-2, p. 347-380. K. Simov et at. 2002 . Building a linguistically interpreted corpus for Bulgarian: the Bultreebank project, Proceedings LREC, Canary islands. B. Srinivas, A. Joshi , 1999. Supertagging: An approach to almost parsing, Computational Linguistics, 25-2: 237-266. B. Srinivas, O. Rambow, 2000 . Exploiting a probabilistic hierarchical model for generation, in Proceedings COLING, Sarrebriick. L. Sun , Y. Jin , L. Du, Y. Sun , 2000. Word alignment of English-Chinese bilingual corpora based on chunks, Proceedings Joint SIGDAT Conference on Empirical methods in NLP and very large corpora , and 38th ACL Conference, Hong-Kong, 110-116. A. Tutin, F. Trouilleux, C. Clouzot, E. Gaussier, 2000. Building a large corpus with anaphoric links in French: some methodological issues, Proceedings Discourse anaphora and reference resolution Colloquium, Lancaster. L. van der Beek , G. Bouma, R. Malouf, G. van Noord, 2001. The Alpino Dependency treebank, Proceedings LINC conf erence, Saarbriicken. H. van Halteren, N. Oostdijk. 1993. Towards a syntactic database: the TOSCA analysis system, in J. Aarts et al. (eds), English Language corpora: design, analysis and exploitation, Amsterdam, Rodopi. 145-162. F. Xia, M. Palmer, A. Joshi 2000. A uniform method of grammar extraction and its application, Proceedings Joint SIGDAT Conference on Empirical methods in NLP and very large corpora, and 38th ACL Conference, HongKong, 53:62. I BUILDING TREEBANKS Chapter 1 THE PENN TREEBANK: AN OVERVIEW Ann Taylor University of York Heslington , York, UK at9 @york.ac.uk Mitchell Marcus, Beatrice Santorini University of Pennsylvania Philadelphia PA, USA { mitch,beatrice} @Iinc.cis.upenn.edu Ab stract The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank : POS tagging , syntactic bracketing , and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http: / /www.ldc.upenn.edu. Keywords: Engli sh, Annotated Corpus, Part-of-speech Tagging, Treebank, Syntactic Bracketing, Parsing, Disfluencies INTRODUCTION The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. The material annotated includes such wideranging genre s as IBM computer manual s, nursing notes, Wall Street Journal articles, and transcribed telephone conversation s, among others. This paA. Ab eille (ed.), Treeban ks: Building and Using Parsed Corpora, 5-22. © 2003 Kluwer Academic Publishers. 6 A. TAYLOR, M. MARCUS, B. SANTORINI per describes the design of the three annotation schemes used by the Treebank: pas tagging, syntactic bracketing, and disfluency annotation (section 1) and the methodology employed in production (section 2).' All available Penn Treebank materials are distributed by the Linguistic Data Consortium http : / /www.ldc.upenn .edu. 1. THE ANNOTATION SCHEMES The majority of the output of the Penn Treebank consists of pas tagged and syntactically bracketed versions of written texts such as the Wall Street Journal and the Brown Corpus. In the early years of the project bracketing was done using a quite simple skeletal parse, while later phases made use of a richer predicate-argument bracketing schema. In the final phase of operation, we produced a tagged and parsed version of part of the Switchboard corpus of telephone conversations, as well as a version annotated for disfluencies. In the remainder of this section we discuss the design of the three annotation schemes . 1.1 Part-or-speech tagging The part-of-speech (paS) tagsets used to annotate large corpora prior to the Penn Treebank were generally fairly extensive. The rationale behind developing such large, richly articulated tagsets was to approach "the ideal of providing distinct codings for all classes of words having distinct grammatical behaviour" (Garside, Leech, and Sampson 1987). The Penn Treebank tagset, like many others, is based on that of the Brown Corpus, but it differs from it in a number of important ways. First, the stochastic orientation of the Penn Treebank and the resulting concern with sparse data led us to modify the Brown Corpus tagset (Francis, 1964; Francis and Kucera, 1982) by paring it down considerably. The key strategy in this reduction was to eliminate lexical and syntactic redundancy. Thus, whereas many pas tags in the Brown Corpus tagset are unique to a particular lexical item, the Penn Treebank tagset strives to eliminate such instances of lexical redundancy. For instance, the Brown Corpus distinguishes the forms of the verbs have, be, and do from other main verbs by different tags. By contrast, since the distinctions between the forms of these verbs is lexically recoverable, they are eliminated in the Penn Treebank and all main verbs receive the same set of tags. Distinctions recoverable with reference to syntactic structure were also eliminated. For instance, the Penn Treebank tagset does not distinguish subject pronouns from object pronouns even in cases where the distinction is not recoverable from the pronoun 's form , as with you, since the distinction is recoverable on the basis of the pronoun's position in the parse tree in the parsed version of the corpus . TH E P E N N TREEBANK : A N OVERVI EW 7 A second difference between the Penn Treebank and the Brown Corpus concern s the significance accorded to syntactic context. In the Brown Corpus, words tend to be tagged independently of their syntactic function. For instance , in the phrase the one, one is always tagged as CD (cardinal number) , whereas in the corresponding plural phrase the ones , ones is always tagged as NNS (plural common noun), despite the parallel function of one and ones as heads of their noun phrase. By contrast, since one of the main roles of the tagged version of the Penn Treebank corpu s is to serve as the basis for a bracketed version of the corpus , we encode a word's syntactic function in its pas tag whenever possible. Thus, one is tagged as NN (singular common noun) rather than as CD (cardinal number) when it is the head of a noun phrase. Thirdly, since a major concern of the Treebank is to avoid requiring annotators to make arbitrary decisions, we allow words to be associated with more than one pas tag. Such multiple tagging indicates either that the word 's part of speech simply cannot be decided or that the annotator is unsure which of the alternative tags is the correct one. The Penn Treebank tagset is given in Table 1.1. It contains 36 pas tags and 12 other tags (for punctuation and currency symbols). A detailed description of the guidelines governing the use of the tagset can be found in Santorini (1990) or on the the Penn Treebank webpage/ . 1.2 Syntactic bracketing Skeletal parsing. During the operation of the Penn Treebank, two styles of syntactic bracketing were employed. In the first phase of the project the annotation used was a skeletal context-free bracketing with limited empty categorie s and no indication of non-contiguous structures and dependencies. {8 (NP Martin Marietta Corp .) wa s (VP given {NP a $ 29 .9 million Ai r Force contract {PP f or (NP l ow-altitude na vigati on and targeting equipment))))) .) The set of syntactic tags and null elements used in the skeletal bracketing are given in Table 1.2. More detailed information on the syntactic tagset and guide- 8 Tabi e CC CD DT EX FW IN JJ JJR JJ S LS MD NN NNS NNP NNPS PDT POS PRP PP$ RB RBR RBS RP SYM A. T AYLOR , t.t. M . MARC US , B. S A NTORI NI The Pen n Treebank POS tagset Coordinating conj . Cardinal num ber Determ iner Existential there Foreign word Preposition Adjective Adjective, com parative Adject ive, superlative List item marker Modal Noun, sing ular or mass No un, plural Proper no un, singular Proper noun, plural Predete rmi ner Possessive ending Personal prono un Possessive pronoun Ad verb Ad verb , co mpa rative Ad verb, superlative Particle Sym bol TO UH VB VBD VBG VBN VBP VBZ WDT WP WP $ WRB # $ ( ) " in finitival to Interje ction Verb , base form Verb, past tense Verb, gerund/present pple Verb, past part iciple Verb, non-3rd ps. sg. present Verb, 3rd ps. sg. present Wh-d eterminer Wh -pron oun Possessive wh-pronoun Wh- adverb Pound sign Dollar sign Sen tence-final punctu ation Comma Colon, semi-colon Left bracket character Righ t bracket character Straig ht double quote Left open single quote Left open double quote Right close single quote Right close doub le quote lines concerning its use are to be found in Santorini and Marcinkiewicz (1991) or on The Penn Treebank website' . Following the release of the first Penn Treebank CD-ROM , many users indicated that they wanted forms of annotation richer than those provided by the project's first phase, as well as an increa se in the consistency of the preliminary corpus. Some also expressed an interest in a less skeletal form of annotation, expanding the essentially context-free analysis of the current treebank to indicate non-contiguous structures and dependencies. Most crucially, there was a strong sense that the Treebank could be of much more use if it explicitly provided some form of predicate-argument structure. The desired level of representation would make explicit at least the logical subject and logical object of the verb, and indicate , at least in clear cases, how subconstituents are semantically related to their predicates. Therefore in the second phase of the project a new style of annotation, Treebank II, was introduced. THE PENN TREEBANK: AN OVERVIEW Table l.2 . ADJP ADVP NP PP S SBAR SBARQ SINV SQ VP WHADVP WHNP WHPP X * o T 9 The Penn Treebank syntactic tagset Adjective phrase Adve rb phrase Noun phras e Prepositional phrase Simple declarative clause Subordinate clause Direct question introduced by wh-element Declarative sentence with subject-aux inversion Yes/no questions and subconstituent of SBARQ excluding wh-element Verb phrase Wh-adverb phrase Wh-noun phrase Wh-prepositional phrase Constituent of unknown or uncerta in category " Understood" subject of infinitive or imperative Zero variant of that in subordinate clause s Trace of wh-Constituent Predicate-argument structure. The new style of annotation provided three types of information not included in the first phase. A clear, concise distinction between verb arguments and adjuncts where such distinctions are clear, with an easy-to-use notational device to indicate where such a distinction is somewhat murky. 2 A non-context free annotational mechanism to allow the structure of discontinuous constituents to be easily recovered. 3 A set of null elements in what can be thought of as "underlying" position for phenomena such as wh-rnovement, passive, and the subjects of infinitival constructions, eo-indexed with the appropriate lexical material. The goal of a well-developed predicate-argument scheme is to label each argument of the predicate with an appropriate semantic label to identify its role with respect to that predicate (subject, object, etc.), as well as distinguishing the arguments of the predicate, and adjuncts of the predication. Unfortunately, while it is easy to distinguish arguments and adjuncts in simple cases, it turns out to be very difficult to consistently distinguish these two categories for many verbs in actual contexts. It also turns out to be very difficult to determine a set of underlying semantic roles that holds up in the face of a few A . TAYLOR , M . MARCUS, B. SANTORINI 10 paragraphs of text. The Treebank II scheme is an attempt to come up with a middle ground which allows annotation of those distinctions that seem to hold up across a wide body of material. After many attempts to find a reliable test to distinguish between arguments and adjuncts, we abandoned structurally marking this difference. Instead, we decided to label a small set of clearly distinguishable roles, building upon syntactic distinctions only when the semantic intuitions were clear-cut. However, getting annotators to consistently apply even the small set of distinctions discussed here was fairly difficult. In the skeletal parsing scheme discussed in section 1.2 we used only standard syntactic labels (e.g. NP, ADVP, PP, etc.) for our constituents (see Table 1.2) - in other words, every bracket had just one label. The limitations of this become apparent when a word belonging to one syntactic category is used for another function or when it plays a role which we want to be able to identify easily. In Treebank II style, each constituent has at least one label but as many as four tags, including numerical indices, taken from the set of functional tags given in Table 1.3. NPs and Ss which are clearly arguments of the verb are unmarked by any tag. An open class of other cases that individual annotators feel strongly should be part of the VP are tagged as -CLR (for CLosely Related) ; constituents marked -CLR typically correspond to the class of predication adjuncts proposed by (Quirk et al. 1985)4. In addition, a handful of semantic roles are distinguished: direction, location, manner, purpose, and time, as well as the syntactic roles of surface subject, logical subject, and (implicit in the syntactic structure) first and second verbal objects. { (S {NP- SBJ- 1 J ones } {VP f oll owed (NP him ) (PP- DIR in to {NP the fr ont room } } (S- ADV {NP- SBJ '-1 } {VP cl osing (NP the door ) (PP behind (NP him ) ) ) ) ) .)) { (S (ADVP- LOC Here ) {NP- SBJ - 1 he } {VP c ould n 't (VP be (VP seen (NP '-1 ) {PP by {NP- LGS (NP Blue Throat ) and (NP his gang ) ) ) ) ) ) .)) Treebank II style also adds null elements in a wide range of cases; these null elements are eo-indexed with the lexical material for which the null element 11 TH E P ENN TREEBANK : AN OVERVI EW Table l .3. Functional Tags Text Categories -HLN -LST -TTL Grammatical Functions -CLF -NOM -AOV -LGS -PRO -SBJ -TPC -CLR Semantic Roles -VOC -OIR -LOC -MNR -PRP -TMP headlines and datelines list markers titles true clefts non NPs that function as NPs clausal and NP adverbiaIs logical subje cts in passives non VP predicates surface subject topicalized and fronted constituents closel y related - see text vocatives direction & trajectory location manner purpose and reason temporal phrases stands . The current scheme uses two symbols for null elements: *T*, which marks WH-movement and topicalization, and * which is used for all other null elements. Co-indexing of null elements is done by suffixing an integer to nonterminal categories (e.g. NP-IO, VP- 25). This integer serves as an id number for the constituent. A null element itself is followed by the id number of the constituent with which it is eo-indexed. Crucially, the predicate-argument structure can be recovered by simply replacing the null element with the lexical material that it is eo-indexed with. (SBARQ (WHNP-l Wha t ) (SQ is (NP-SBJ Tiro) (VP eating (NP *T*- l ) ) ) ?) Predicate Ar gument Structure : eat (Tiro, wha t ) 12 A . TAYLOR , M . MARCUS, B. SANTORINI (S (NP-SBJ-1 The ball) (VP was (VP thrown (NP *-1) (PP by (NP-LGS Chris))))) Predicate Argument Structure: throw(Chris, ball) A null element is also used to indicate which lexical NP is to be interpreted as the null subject of an infinitive complement clause; it is eo-indexed with the controlling NP, based upon the lexical properties of the verb. (S (NP-SBJ-1 Chris) (VP want s (S (NP-SBJ *-1) (VP to (VP throw (NP the ball)))))) Predicate Argument Structure: wants (Chris, throw(Chris, ball)) Finally, we also use null elements to allow the interpretation of other grammatical structures where constituents do not appear in their default positions. Null elements are used in most cases to mark the fronting (or "topicalization" of any element of an S before the subject (except in subj-aux inversion) . If an adjunct is topicalized, the fronted element does not leave a trace since the level of attachment is the same, only the word order is different. Topicalized arguments, on the other hand, always are marked by a null element: (S (NP-TPC-S This) (NP-SBJ every man) (VP contains (NP *T*-S) (PP-LOC wi t hi n (NP him)))) Again, this makes predicate argument interpretation straightforward, if the null element is simply replaced by the constituent to which it is eo-indexed. 13 THE PENN TREEBANK : AN OVERVIEW With only a skeletal parse as used in the first phase of the Treebank project, many otherwise clear argument/adjunct relations cannot be recovered due to its essentially context-free representation. For example, there is no good representation for sentences in which constituents which serve as complements to the verb occur after a sentence-level adverb. Either the adverb is trapped within the VP, so that the complement can occur within the VP, where it belongs, or else the adverb is attached to the S, closing off the VP and forcing the complement to attach to the S. This "trapping" problem serves as a limitation when using skeletally parsed material to semi-automatically derive lexicons for particular applications. "Trapping" problems and the annotation of non-contiguous structure can be handled by simple notational devices that use eo-indexing to indicate discontinuous structures. Again, an index number added to the label of the original constituent is incorporated into the null element which shows where that constituent should be interpreted within the predicate argument structure. We use a variety of null elements to show how non-adjacent constituents are related; such constituents are referred to as "pseudo-attached". There are four different types of pseudo-attach, as shown in Table 1.4. Table l.4 . *ICH* *PPA* *RNR* *EXP* The four types of pseudo -attachment Interp ret Constituent Here Permanent Predictable Ambiguity Right Node Raising Expletive The */CH* pseudo-attach is used for simple extraposition, solving the most common case of "trapping": (S (NP-SBJ Chris) (VP knew (SBAR *ICH*-l) (NP-TMP yesterday) (SBAR-l that (S (NP-SBJ Terry) (VP woul d (VP catch (NP the ball))) I))) Here, the clause that Terry would catch the ball is to be interpreted as an argument of knew. The *RNR* tag is used for so-called "right-node raising" conjunctions, where the same constituent appears to have been shifted out of both conjuncts. A . T AYLOR , M . MARC US , B . S A NTORI NI 14 (S But (NP-SBJ -2 our outlook) (VP (VP has (VP been (ADJ P *RNR*-l))) and (VP continues (S (NP-SBJ *-2) (VP to (VP be (ADJP *RNR*-l))))) (ADJP-1 defensive))) In order that certain kinds of constructions can be found reliably within the corpus , we have adopted special marking of some special constructions . For example, extrapo sed sentences which leave behind a semantically null "it" are parsed as follow s, using the *EXP* tag: (S (NP-SBJ (NP It) (S *EXP*- l ) ) (VP is (NP a pleasure)) (S-l (NP-SBJ *) (VP to (VP teach (NP her))))) Predicate Argument Structure : pleasure (teach( *someone * , her)) The *PPA * tag was introduced to indicate "permanent predictable ambiguity", those cases in which one cannot tell where a constituent should be attached, even given context. Here, annotators attach the constituent at the more likely site (or if that is impossible to determine, at the higher site) and pseudo-attach it at all other plausible sites using the *PPA * null element.' TH E P E N N TREEBANK : AN OV ERVIEW 15 (S (NP-SBJ I) (VP saw (NP (NP the man) (PP *PPA*-l)) (PP-CLR-l wi t h (NP the telescope)))) 1.3 Disfluency annotation The final project undertaken by the Treebank (1995-6) was to produce a tagged and parsed version of the Switchboard corpus of transcribed telephone conversations, along with a version which annotated the disfluencies common in speech (fragments of words, interruptions, incomplete sentences, fillers and discourse markers). The disfluency annotation system (based on Shriberg (1994)) distingui shes complete utterances from incomplete ones, labels a range of non-sentence elements such as fillers, and annotates restarts. Table 1.5. Disfluency Annot atio n Utterances / -/ Non-sentence elements F E end of complete utterance end of incomplete utterance A fillers (uh, um, huh, oh, etc .) explicit editin g term (l mean, sorry, etc.) discourse marker (you know, well, etc .) coordinatin g conjunction (and, and then, but, etc.) aside Restarts [RM+ RR] [RM+] restart with repair (see text) restart with out repair o C Restarts have the following form : Show me flights from Boston on uh from Denver on Monday I-------RM-----I-IM------RR-------I IP RM reparandum IP interruption point IM interregnum (filled pause or editing terms) RR repair A . TAYLOR , M . MARCUS, B. SANTORINI 16 In the annotation, the entire restart with its repair is contained in square brackets. The IP is marked by a "+", and any IM material (filled pauses, etc.) follows the "+". Show me flights [ from Boston on + {F uh } from Denver on ] Monday I-------RM-------I---IM---I-------RR------I IP A: he's pretty good. / He stays out of the street / {C and, } {F uh, } if I catch him I call him / {C and } he comes back. / {D So } [ he, + he's] pretty good about taking to commands [ and + B: {F Urn. } / A: - and ] things. / B: Did you bring him to a doggy obedience school orA:No-/ B: - ju stA: - we never did. / B: - train him on your own / {C and, } -/ A: [ I, + I ] trained him on my own / {C and , } {F uh, } this is the first dog I've had all my own as an adult. / B: Uh-huh . / Figure /./. Sample disfluency annotation A detailed account of the disfluency annotation can be found in Mateer and Taylor 1995 or on the Penn Treebank website http : / / www . cis . upenn . edu / " t.r eebank. ~ METHODOLOGY The three types of Treebank annotation, pas tagging , syntactic bracketing, and disfluency annotation, are all produced by the same two-step method, automatic annotation followed by manual correction. The correction of each type of annotation is done with the aid of a task-specific mouse-based package written in GNU Emacs Lisp , embedded in the GNU Emacs editor (Lewis and Laliberte 1990). pas tagging and disfluency annotation (when relevant) feed syntactic bracketing , but the first two are independent of each other and can be done in parallel, with the two output streams then being automatically merged, if desired . TH E P E N N TREEBANK : A N OVERVI EW 2.1 17 Part-or-speech tagging During the early stages of the Penn Treebank project, the initial automatic assignment was provided by PARTS (Church 1988), a stochastic algorithm developed at AT&T Bell Lab s. PARTS uses a modified version of the Brown Corpus tagset clo se to our own and assign s pas tags with an error rate of 3-5%. The output of PARTS was automatically tokenized and the tags assigned by PARTS were automatically mapped onto the Penn Treebank tagset. Thi s mapping introduced about 4% error, since the Penn Treebank tagset make s certain distinctions that the PARTS tagset does not. Later, the automatic pas assignment was provided by a cascade of stochastic and ruledriven taggers developed on the basis of our early experience. Since these taggers are based on the Penn Treebank tag set, the 4% error rate introduced as an artefact of mapping from the PARTS tagset to ours is eliminated, and we obtain error rate s of 2-6%. Finally, during the Switchboard project we switched to the then recently released Brill tagger (Brill 1993). The result of the first, automated stage of pas tagging is given to annotators to correct. The pas correction interface allows annotators to correct pas assignment errors by positioning the cursor on an incorrectly tagged word and then entering the desired correct tag (or sequence of multiple tags) . The annotators' input is automatically checked against the list of legal tags and , if valid , appended to the original word-tag pair separated by an asterisk. Appending the new tag rather than replacing the old tag allow s us to easily identify recurring errors at the automatic pas assignment stage. Finally, in the distribution version of the tagged corpus, any incorrect tags assigned at the first, automatic stage are removed. pas 2.2 Syntactic bracketing The methodology for bracketing the corpus is completely parallel to that for tagging-hand correction of the output of an automatic proce ss. Fidditch, a deterministic parser developed by Donald Hindle first at the University of Penn sylvania and subsequently at AT&T Bell Labs (Hindle 1988, Hindle 1989) is used to provide an initial parse of the material. Annotators then hand correct the parser's output using a task- specific mou se-based interface implemented in GNU Emacs Lisp. Fidditch has three properties that make it ideally suited to serve as a preprocessor to hand correction: • It always provides exactly one analysis for any given sentence, so that annotators need not search through multiple analyses. • It never attaches any constituent whose role in the larger structure it cannot determine with certainty. In cases of uncertainty, Fidditch chunks A . TAYLOR , M . MARCUS, B. SANTORINI 18 Output of tagger Battle-tested/NNP Japanese /NNP industrial /JJ managers INNS here /RB always /RB buck /VB up /IN nervous /JJ newcomers INNS wi t h/ I N the /DT tale INN of /IN the /DT first /JJ of /IN their /PP$ countrymen INNS to /TO visit /VB Mexico /NNP , I, a /DT boatload/NN of /IN samurai INNS wa r r i or s INNS blown /VBN ashore /RB 375 /CD years INNS ago /RB . 1. Hand-corrected by annotator Battle-tested/NNP* /JJ Japanese /NNP* /JJ industrial /JJ managers INNS here /RB always /RB buck /VB*/VBP up /IN* /RP nervous /JJ newcomers INNS wi t h/ I N the /DT tale INN of /IN the /DT first /JJ of /IN their /PP$ countrymen INNS to /TO visit /VB Mexico /NNP , I, a /DT boatload/NN of /IN samurai /NNS* /FW warriors INNS blown /VBN ashore /RB 375 /CD years INNS ago /RB . 1 . Final version Battle-tested/JJ Japanese /JJ industrial /JJ managers INNS here /RB always /RB buck /VBP up /RP nervous /JJ newcomers INNS with /IN the /DT tale INN of /IN the /DT first /JJ of /IN their /PP$ countrymen INNS to /TO visit /VB Mexico /NNP , I, a /DT boatload/NN of /IN samurai /FW wa r r i or s INNS blown /VBN ashore /RB 375 /CD years INNS ago /RB .1 . Figure 1.2. Part-of-speech tagging pipeline the input into a string of trees, providing only a partial structure for each sentence. • It has rather good grammatical coverage, so that the grammatical chunks that it does build are usually quite accurate. The output of Fidditch, however, which is fairly complex, with word, Xbar, and phrase levels represented, was found to be too complicated for the annotators to handle at speed . They were therefore presented with a simplified parse containing only the phrase labels for correction. The simplified output of Fidditch is illustrated in Figure 2.2. In general , the annotators do not need to rebracket much of the parser's output-a relatively time-consuming task. Rather, TH E P E N N TREEBANK : A N OVERVI EW 19 the annotators' main task is to "glue" together the syntactic chunks produced by the parser. Using a mouse-based interface, annotators move each unattached chunk of structure under the node to which it should be attached . Notational devices allow annotators to indicate uncertainty concerning constituent labels, and to indicate multiple attachment sites for ambiguous modifiers. Insertion and eo-indexing of null elements is accompli shed simply by dragging from the associated lexical material to the site of the null element. ( (S (NP Her eyes) (AUX wer e ) (VP glazed)) (? (PP as (NP (SBAR if (S (NP she) (AUX did) (? (NEG n ' t)) (VP hear)))))) (? or) (? even) (? (S (VP see (NP h im)))) (? . ) ) Figure 1.3. Simplified output of Fidditch before correction The bracketed text after correction is shown in Figure 1.3. The fragments are now connected together into one rooted tree structure, functional tags are added and null elements inserted and eo-indexed. Finally the pas tags can be automatically combined with the skeletal parse to produce a tree with both pas and syntactic information. 2.3 Disfluency annotation As with pas tagging and syntactic bracketing , annotating disfluencies is done in two steps, although in this case the automated step is ju st a simple Perl script which attempts to identify and bracket the more common non-sentence element s, such as fillers. The correction interface for disfluencies allows easy A . TAYLOR , M . MARCUS, B. SANTORINI 20 ( (S (NP-SBJ-2 Her eyes) (VP wer e (VP glazed (NP *-2) (SBAR-ADV as if (S (NP-SBJ she) (VP did n't (VP (VP hear (NP *RNR*-l)) or (VP (ADVP even) see (NP *RNR*-l)) (NP-l him))))))) .)) Figure 1.4. Bracketed text after correction input and manipulation of the annotations which mark restarts and repairs with the same sort of mouse-driven package used for correcting the syntactic parse. 2.4 Productivity The learning curve for the pas tagging task takes under a month (at 15 hours a week), and annotation speeds after a month exceed 3,000 words per hour. The rate for disfluency annotation is similar. Not surprisingly, annotators take substantially longer to learn the more complicated bracketing task, with substantial increases in speed occurring even after two months of training. Even after extended training , performance varies markedly by annotator, however, with speeds on the task ranging from approx . 750 words per hour to well over 1,000 words per hour after three or four months experience. 3. CONCLUSIONS Although the Penn Treebank is no longer in operation the large amount of data produced by the project continues to provide a valuable resource for computational linguists , natural language programmers, corpus linguists and others interested in empirical language studies . In addition, the tools and methodology developed by the Penn Treebank have been adopted with some revision by an ongoing project to create parsed corpora of all the historical TH E P E N N TREEBANK : AN OV ERVIEW 21 stages of English, which is centred at the University of Pennsylvania and the University of York, with support from the University of Helsinki. The first corpus produced by the project, now in its second edition , the Penn-Helsinki Parsed Corpu s of Middle Engli sh, Second Edition , (Kroch and Taylor 2000) (http : / / www . ling . upenn . edu /mideng), has been released and comparable corpora of Old and Early Modern English are in production . Acknowledgments The Penn Treebank has been supported by the Lingui stic Data Consortium , by DARPA (grant No. N0014-85-KOOI8), by DARPA and AFOSR jointly (grant No. AFOSR-90-0066) and by ARO (grant No. DAAL 03-89-C0031 PRI). Seed money was provided by the General Electric Corporation (grant No. 10 1746000). The contribution of the staff of the Treebank, Grace Kim, Mary Ann Marcinkiewicz, Dagmar Sieglova (project admini strators) Robert MacIntyre (data manager/programmer), and the many annotators who worked on the project over the years, Ann Bies, Constance Cooper, Leslie Dossey, Mark Ferguson, Robert Foye, Karen Katz, Shannon Keenan, Alison Littman lames Reilly, Britta Schasberger, lames Siegal, Kevin Stephens, and Victoria Tredinnick, is also gratefully acknowledged. Notes I. This paper is a revised and expanded version of two earlier papers: Marcus et al. (1993) " Building a large annotated corpus of English: the Penn Treebank,' in Computa tional Linguistics 19(2):313-220, and Marcus et al. (1994) "The Penn Treebank: annotating predicate argument structure," in ARPA Human Language Technology Workshop. 2. http : / /www.cis.upenn .edu /-treebank 3. http : / /www.cis.upenn .edu /-treebank 4. This use of this tag was an experiment which was largely unsuccessful. Although some very experienced annotators were internally fairly consistent in their use of this tag, less experienced annotators had a hard time with it and consistency across annotators was not high. It was not used in the parsing of the Switchboard corpus. 5. The use of *PPA* was discontinued in the Switchboard phase, since annotator s did not reliably detect these ambiguities. References Bies , Ann , Mark Ferguson, Karen Katz, and Robert MacIntyre. (1995). Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Information Science, University of Pennsylvania. Brill, Eric. (1993) . A Corpus-based Approach to Language Learning . PhD Dissertation , University of Pennsylvania. Church, Kenneth W. (1980) . Memory Limitations in Natural Language Processing , MIT LCS Technical Report 245. Master's thesis , Massachusetts Institute of Technology . 22 A . TAYLOR , M . MARCUS, B. SANTORINI Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing. 26th Annual Meeting of the Asso ciation for Computational Linguistics, pages 136-143. Francis, W. Nelson (1964). A Standard Sample of Present-day English for Use with Digital Computers . Report to the U.S Office of Education on Cooperative Research Project No. E-007. Brown University, Providence RI. Francis, W. Nelson and Henry Kucera. (1982). Frequency Analysis of English Usage. Lexicon and Grammar. Houghton Mifflin , Boston. Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (1987). The Computational Analysis ofEnglish . A Corpus-based Approach. Longman, London. Hindle, Donald. (1983). User Manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory. Hindle, Donald. (1989). Acquiring disambiguation rules from text. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. Kroch, Anthony S. and Ann Taylor. (2000). The Penn-Helsinki Parsed Corpus of Middle English, Second Edition. Department of Linguistics, University of Pennsylvania. Lewis, Bil, Dan Laliberte, and the GNU Manual Group. (1990). The GNU Emacs Lisp Reference Manual. Free Software Foundation, Cambridge MA. Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313-330. Marcus, Mitchell P., Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). The Penn Treebank: Annotating predicate-argument structure. In ARPA Human Language Technology Workshop. Mateer, Marie, and Ann Taylor. (1995). Disfiuency Annotation Stylebook for the Switchboard Corpus. Ms ., Department of Computer and Information Science, University of Pennsylvania. Quirk, R , S. Greenbaum, G. Leech, and J. Svartvik. (1985). A Comprehensive Grammar of the English Language , Longman, London. Santorini, Beatrice. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania. Santorini, Beatrice and Mary Ann Marcinkiewicz. (1991). Bracketing Guidelines for the Penn Treebank Project. Ms ., Department of Computer and Information Science, University of Pennsylvania. Shriberg, E.E. (1994). Preliminaries to a Theory of Speech Disfiuencies. PhD Dissertation, University of California at Berkeley. Chapter 2 THOUGHTS ON TWO DECADES OF DRAWING TREES Geoffrey Samp son School of Cognitive & Computing Sciences, University of Sussex, Falmer; BRIGHTON BNI 9QH, ENGLAND geoffs@cogs.susx.ac.uk Abstract The task of producing cons istent, comprehensive structural annotations for reallife written and spoken usage teache s lessons that run counte r to some of the assumptions of recent linguistics. It is not easy to believe that a natural language is a well-defined system, or that questions about the analysis of particular turs of phrase always have "right answers". Computational linguistics has been at risk of repeating mistakes made by the genera l fiels of computing in the 1960s; we need to learn from the disciplin e of software eng ineering. On the other hand , annota ted corpo ra of real-life usage are already yielding findings about human nature that were unsuspected before these resources become available. Keywords: English, Treebank , Susanne Corpus , Christine Corpus 1. HISTORICAL BACKGROUND If one lives in the English country side, now and then an aeroplane flies over and photograph s one's house, and then someone calls to sell the picture as a souvenir. On the wall of my study I have a picture taken from the air in the summer of 1983; if you look very closely you can see in the garden a white disc, with a much smaller pink disc adjacent to it. The pink disc is the top of my bald head, and the white disc is a garden table covered with papers, because the photographer happened to record one of the opening days of my career as a producer of natural-language parse trees. I was working then at Lanca ster University, where my senior colleague, Geoffrey Leech , had a new research project concerned with statistics-based automatic parsing, and I had undertaken to parse manually a sample of written English to serve as a source of statistical data for the project. Glancing at the picture since, I have often A. Abe ille (ed.), Treebanks: Building and Using Parsed Corpora, 23-41. © 2003 Kluwer Academic Publishers. 24 GEOFFREY SAMPSON wondered how happy I would have felt about embarking on that task, if I had known how many years of my life I was destined to devote to it. That original "treebank" - structurally analysed sample of natural language 1 , - was for internal project use and was not published; but it led me on to develop the SUSANNE Corpus, another treebank of written English, which has been publicly available since 1992 and has established itself as a popular languageengineering research resource, and more recently the CHRISTINE treebank of spoken English, the first stage of which was circulated in summer 1999. Shortly before the Paris treebank conference, I was given the go-ahead for a new project, "LUCY", to create a treebank of written English with special relevance for the teaching of writing skills. When LUCY is completed, I shall have been working on English structural annotation, with the help of various researchers at different times, more or less continuously for the two decades of my title''. Before looking at details of this work, it is worth reminding ourselves of the flavour of computational linguistics as it was when I began drawing trees. In those days, most computational linguists did not work with real-life corpus data, and did not seem to want to. They made their data up out of their heads. By coincidence, 1983, the year when I began drawing trees, was also the year of the inaugural meeting of the recently-founded European Chapter of the Association for Computational Linguistics, and that conference (held at Pisa) was a good opportunity to take stock of the current state of play in the subjecr'. Here is a typical selection of example-sentences used by speakers at the Pisa meeting to illustrate the workings of their systems : Whatever is linguistic is interesting . A ticket was bought by every man. The man with the telescope and the umbrella kicked the ball. Hans bekommt von dieser Frau ein Buch. John and Bill went to Pisa. They delivered a paper. Maria e andata a Roma con Anna. Are you going to travel this summer? Yes, to Sicily. By contrast, here is a sample of utterances from the CHRISTINE treebank, based on material representing spontaneous British speech of the 1990s collected in the British National Corpus : well you want to nip over there and see what they come on on the roll can we put erm New Kids # no not New Kids Wall Of# you know THOUGHTS ON TWO DECADES OF DRAWING TREES 25 well it was Gillian and # and # erm {pause} and Ronald's sister erm {pause} and then er {pause} a week ago last night erm {pause} Jean and I went to the Lyceum together to see Arsenic and Old Lace lathered up, started to shave {unclear} {pause} when I come to clean it there weren 't a bloody blade in, the bastards had pinched it but er {pause} I don't know how we got onto it {pause} er sh- # and I think she said something about oh she knew her tables and erm {pause} you know she'd come from Hampshire apparently and she # {pause} an- # an- yo- # you know er we got talking about ma- and she's taken her child away from {pause} the local school {pause} and sen- # is now going to a little private school up {pause} the Teign valley near Teigngrace apparently fra- It is hardly necessary to underline the differences between these datasamples, in terms of complexity and general "messiness". True, I have made the contrast particularly vivid by drawing my real-life examples from informal, spontaneous speech. But even written language which has undergone the disciplines of publication tends in practice to be much less predictable than the Pisa examples. Here are a few sentences drawn at random from the LOB Corpus of published British English: Sing slightly flat. Mr. Baring, who whispered and wore pince-nez, was seventy day. if he was a Advice - Concentrate on the present. Say the power-drill makers, 75 per cent of major breakdowns can be traced to neglect of the carbon-brush gear. But he remained a stranger in a strange land. In the first example we find a word in the form of an adjective, flat, functioning as an adverb. In the next example, the phrase Mr. Baring contains a word ending in full stop followed by a word beginning with a capital which, exceptionally, do not mark a sentence boundary. The third "sentence" links an isolated noun with an imperative construction in a logic that is difficult to pin down. In Say the power-drill makers..., verb precedes subject for no very clear reason. The last example is as straightforward as the examples from the Pisa meeting; but straightforward examples are not the norm. Not surprisingly, then, the task of annotating the grammatical structure of real-life language samples turned out to be a good deal more complicated than I had anticipated. 26 G EOFFREY S AMP SO N It would be inappropriate, here, to give much detail about the structural annotation scheme we evolved. For the sake of concreteness, I shall briefly illustrate it with an extract from the CHRISTINE Corpu s (see FIG. 2.1). In FIG. 2.1, the next to rightmo st field contains the words of a speech turn uttered by the speaker whose CHRISTINE code name is Gemm a006 , and the rightmo st field shows the tree structure in which the words occur, displayed on successive lines as segments of a labelled bracketing . Gemma's second word you is a noun phrase (N) functioning as subject ( : 8) of its clause, whose verb group (V) is the single word want. The object of want is an infinitival clause (Ti : 0) , whose understood logical subject is again you, hence a "ghost" element 8101 is inserted in the word stream with an index number, 101, which marks it as identical to the subject of the main clause - and so on. The field to the left of the words contain s their "wordtags" - word classification s drawn from an alphabet of 350-odd structurally-distinct classes of word; Gemma's first word, well , is classified as one type of "discourse item" (a form , characteristic of speech, which does not usually constitute part of a wider grammatical structure). TtCL,..,.,O-129 YOt ~. ~i'&Ynt; .g l t 1 CV'ftr thO:K(~ tftrd. rg~q< ,,:g ~ qj f ~ri. s. K*N:~ ,#h@t ChHY IT Figu re 2./. 2. -Ott Extract from C HR ISTINE corpus BUILDING TREEBANKS When I began drawing trees for Geoffre y Leech 's project, he produced a 25-page typescript listing a set of grammatical category symbols which he suggested we use, with notes on how to apply them to debatable cases. I THO UGHTS O N TWO DEC AD ES OF DRAWI NG TR EES 27 remember thinking that this seemed to cover every possible linguistic eventuality, so that all I needed to do was to apply Leech's guidelines more or less mechanically to a series of examples. I soon learned differently. Every second or third sentence seemed to present some new analytic problem, not covered in the existing body of guidelines. So I and the research team I was working with began making new deci sion s and cumulating the precedents we set into an ever-growing body of analytic rules . What grew out of a 25-page type script was publi shed in 1995 as a book of 500 large-format pages, English for the Computer (Sampson 1995). In the same year, an independently-developed but very comparable annotation scheme used for the University of Pennsylvania treebank was published via the Web, as Bies et at. (1995). Thi s great growth in annotation guidelines was cau sed partly by the fact that real-life language contains many significant items that are scarcely noticed by traditional linguistics. Personal name s are multi-word phrases with their own characteristic internal structure, and so are such thing s as addresses, or references to weights, mea sures, and money sums; we need consi stent rules for annotating the structures of all these forms, but they are too culture-bound to be paid much attention by the inherited school grammar tradition. In written language, punctuation mark s are very significant structurally and must be fitted into parse tree s in some predictable way, but syntactic analysis within theoretical lingui stics ignored punctuation completely. The more important factor underlying the complexity of our annotation rules, though, was the need to provide an explicit, predictable annotation for every turn of phra se that occurs in the language. As lane Edwards has put it: "The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar way s" (Edwards 1992: 139). For the theoretical lingui sts who set much of the tone of computational linguistics up till the 1980s, this kind of comprehensive explicitness was not a priority. Syntactic theori sts commonly debated alternative analy ses for a limited number of "core" con structions which were seen as having special theoretical importance, trying to establish which analysis of some con struction is "psychologically real" for native speakers of the language in question. They saw no reason to take a view on the analysis of the many other con structions which happen not to be topics of theoretical controversy (and , because they invented their examples, they could leave most of those other con structions out of view) . Language engineering based on real-life usage, on the other hand , cannot pick and choose the aspects of language structure on which it focuses - it has to deal with everything that comes along. For us the aim is not to ascertain what structural analy sis corresponds to the way language is organized in speakers' minds - we have no way of knowing that; we just need some reliable, practical way of registering the full range of data in a consistent manner. 28 G EOFFREY S AMP SO N Often , so far as I can see, various different analy ses of some usage would each be perfectly reasonable; our task is not to ask which analy sis is "right", but to choo se one of the analy ses (at random , if necessary) and to make explicit that this is the analy sis we have chosen, so that future examples of the same construction will be annotated in the same way, and statistics extracted from a treebank will not make the mistake of adding apple s and pears. Con sider, for example, the construction exemplified in the more, the merrier - the construction that translates into German withje and desto. Here are three ways of grouping a sentence using that construction into con stituents: [ [ the wider the wheelbase is J. [ the more satisfactory is the perf ormance JJ [ [ the wider the wheelbase is J, the mo re satisfactory is the performance J [ [ [ the wide r the wheelbase is J, the mo re satisfactory J is the performance J The two clau ses might be seen as co-ordinated (as in the first line), since both have the form of main clau ses and neither of them contains an explicit subordinating element. Or the second clau se might be seen as the main clause, with the first as an adverbial clau se adjun ct. Or the first clause might be seen as a modifier of the adjectival predicate within the second clau se. There seems to be no strong reason to choo se one of these analyses rather than another; what matters, if we are to produce meaningful statistics for use in language engineering, is to settle on one of the analyses and stick to it. Theoretical lingui sts are sometimes inclined to look down their noses at this kind of taxonomic exerci se as having little intellectual , scientific substance. Lingui sts of the Chom skyan , generativ e school have in the past been quite dismissive of taxonomic approaches". But I do not see how the theoreticians can hope to make real progre ss in their own work, without a solid foundation of grammatical taxonom y to catalogue and classify the data which their theorie s ought to explain. In the comparable domain of natural history, it was two centuries after the taxonomic work of John Ray, and a century and a half after that of Linnaeu s, before theoretical biology was able to develop as a substantial discipline in its own right in the late nineteenth century' . From a theoretical point of view the Linnaean system was somewhat "unnatural" (and was known from the start to be so), but it provided a practical , usable conspectus of an immensely complex world of data ; without it, theoretical biology could not have got off the ground . The generative linguists' idea that they can move straight to theorizing, bypassing the pain staking work of taxonomy altogether, is one that I find baffling. In any case it is clear that language engineering, for which theorie s about psycholinguistic mechani sms are irrele vant, badly needs a comprehensive groundwork of data collection and classification. One explanation for some linguists' lack of intere st in the taxonomi c approach may be their belief that natural-language structure is much simpler than it superficially appears to be. Underl ying the diversity of observable linguis- THO UGHTS O N TWO DEC AD ES OF DRAWI NG TR EES 29 tic "performance", it is suggested, there lurks a mental grammatical "competence" which is the proper object of scientific lingui stic study, and which can be defined in terms of a limited number of rules , to a large extent genetically inherited and common to all natural languages. As "competence" systems, the grammars of natural languages such as English or Czech might be hardly more complex than tho se of programming languages such as Pascal or Java" . If that were so, one might well see little need for intensive datasystematizing activity. But the idea that a natural language is governed by a limited set of clear-cut grammatical rules does not survive the experience of structurally annotating real-life examples. One way I sometimes express this point is to say that if someone cares to identify a rule of English grammar which they regard as reliable, I would expect to be able to find in English corpus data an example that breaks the rule , not as a momentary error of "performance" but within wording that appears to be intended by the writer or speaker and which "works" as an act of communication. Thu s, one rule which grammarians of Engli sh might take to be as reliable as any is the rule that reflexive pronouns, being eo-referential with earlier elements in the same clau se, can never occur as clau se subjects. Yet here is an example found in the LOB Corpus, taken originally from a publi shed magazine article on the Cold War: Each side proceeds on the assumption that itself loves peace, but the other side consists of warmongers. Thi s use of itself is an absolutely direct violation of the rule ; yet the sentence seems perfectly successful in making its point. In case anyone should suspect that the example was produced by a non-native speaker who se Engli sh was poor, the author was in fact Bertrand Russell, one of the leading Engli sh intellectuals of the twentieth century; and his very next sentence, not quoted here, has another example of the same con struction, making it quite clear that this use of reflexive pronouns was no careless "performance error" but a studied effect. The truth is that rules of natural-language grammar are not binding laws like the laws of physics or chemistry. They are flexible guidelines, to which users of a language tend to conform but which they are free to adapt or disregard. Thi s make s the task of categorizing real-life usage much more challenging than it would be if natural languages were like programming languages with a few irregularities added , but it also means that the task cannot be bypassed via aprioristic theorizing. 3. EXPLOITING THE SUSANNE TREEBANK Despite the low esteem in which theoretical linguists held taxonomic work, I soon found that even a small-scale Engli sh treebank yielded new scientific 30 GEOFFREY SAMPSON findings, sometimes findings that contradicted conventional linguistic wisdom. For instance , introductory textbooks of linguistics very commonly suggest that the two most basic English sentence types are the types "subject - transitiveverb - object", and "subject - intransitive-verb". Here , for instance , are the examples quoted by Victoria Fromkin and Robert Rodman in An Introduction to Language to illustrate the two first and simplest structures diagrammed in their section on sentence structure (Fromkin & Rodman 1983: 207-9) : the child found the puppy the lazy child slept Looking at statistics on clause structure in the treebank I developed at Lancaster, though, I found that this is misleading (Sampson 1987a: 90). "Subject transitive verb - object" is a common sentence type, but sentences of the form "subject - intransitive-verb" are strikingly infrequent in English. If the sentence has no noun phrase object to follow the verb, it almost always includes some other constituent, for instance an adverbial element or a clause complement, in post-verb position. The lazy child slept may be acceptable in English, but it could be called a "basic" type of English sentence only in some very un-obvious sense of the word "basic" . Going a little deeper into technicalities, I was able to use the SUSANNE treebank to shed new light on an idea that has been known to most linguists since it was first discussed by Victor Yngve in 1960. Yngve argued that there is a constraint on the incidence of "left branching" in English parse trees 7 . Suppose, for instance , that we accept FIG. 2.2 (after Yngve 1960: 462) as an appropriate constituency diagram for the complex noun phrase as good a young man for the job as you will ever find : it is noticeable that the branches stretch down from the root much further in the "south-easterly" than the "south-westerly" direction. Yngve measured the left-branching factor of individual words numerically: for instance, young in FIG. 2.2 would be assigned the number three, because three of the four branches linking that word to the root are other than the rightmo st branch below their respective mother nodes; and he suggested that there might be some fixed maximum , perhaps imposed by our mental language-processing machinery, such that grammatical structures are allowed to "grow" up to that maximum degree of left branching but not beyond it (whereas right branching is unlimited). There is no doubt that Yngve was right to identify an asymmetry in English grammatical structure: the broadly N.W.-to-S .E. trend of parse trees is very noticeable in any English treebank. An investigation of the SUSANNE treebank, though , showed that the precise generalization was rather different from what Yngve supposed (Sampson 1997b). English does not have an absolute limit on the global left-branchingness of parse trees as wholes. It has a local and THOUGHTS ON TWO DECADES OF DRAWING TREES 31 Figure 2.2. Tree for "as good a young man for the job as you will ever find " statistical constraint on the left-branchingness of individual tree nodes : that is, there is a fixed probability for the expansion of any nonterminal symbol to contain another nonterminal symbol as other than the rightmost daughter, and, because this probability is low, in practice trees do not "grow" very far in the south-westerly direction. This was a rather pleasing finding: the fact that the constraint is local rather than global represents good news for computational tractability. Although Yngve 's hypothesis had been a standard part of linguists ' education for more than thirty years, there was no way of checking the facts before treebanks became available. The statistics on left branching reveal a surprisingly precise kind of order underlying the diversity of natural-language constructions. Another kind of statistical investigation carried out on my original Lancaster treebank suggested a measure of precision in the extent of that diversity itself. Richard Sharman (former Director of SRI Cambridge) has likened natural languages to fractal objects , which continue to display new detail no matter at what scale they are examined. An investigation I carried out on the frequencies of constructions in the Lancaster treebank made that analogy seem 32 G EO FFREY S AM PSON rather exact''. I took a high-frequency syntactic category, the noun phrase, and counted the frequencies of the various expansions of that category in the treebank - for instance , the commonest expansion of the category "noun phrase" (in terms of the coarse vocabulary of grammatical categories which I used for this purpose) was "determiner + singular noun" , which accounted for about 14% of all noun phrases in the data. But there were more than 700 other nounphrase types which occurred at lower frequencies - many types each occurred ju st once in the treebank . It turned out that there was a regular relationship between construction-frequencies, and the number of different constructions occurring at a given frequen cy. As one looks at lower and lower frequencies, more and more different constructions occur at those frequencies, with the consequence that quite large proportions of text include constructions which are individuall y rare. Specifically: if m is the frequency of the commonest single construction (in my data , m was about 28 per thousand words) and f is the relative frequenc y of some construction (jm is its absolute frequency) then the proportion of all construction-tokens which represent construction- types of relative frequency less than or equal to f is about jO.4 This finding contradicts the picture of a natural language as containing a limited number of "competent" grammatical structures, which in practice are surrounded by a penumbra of more or less random , one-off "performance errors". If that picture were correct , one would expect to find construction frequencie s distributed bimodally, with competent constructions occurring at reasonably high frequencies, individual perform ance errors each occurring at a low frequenc y, and not much in between. My data were not like that; constructions were distributed smoothly along the scale of frequencies, with no striking gaps". Consider what the mathem atical relation ship I have quoted would mean, if it continued to hold true in much larger data sets that I was in a position to investigate. (I cannot know that it does, but as far as they went