Agenda
Gertjan van Noord
Large-scale Automated Syntactic Analysis of Dutch
In this presentation, we will describe some aspects of the Alpino parser, and some recent attempts (some in vain) at improving the parser.
The Alpino parser is a system which assigns syntactic structures to Dutch sentences fully automatically. It is a hybrid system in which a hand-written grammar and large dictionary is combined with a statistical disambiguation component. The grammar and dictionary contain detailed linguistic rules which are used to derive syntactic structures for a given sentence. In most cases, these linguistic rules will allow multiple candidate structures for a given sentence, because of (unintended) ambiguity. For instance, sentences such as the following can be analysed in several ways:
(1) We luisterden naar de berichten over de oorlog in Irak
(2) We luisterden naar de berichten over de oorlog in de auto
(3) Mannen die vrouwen haten
In order to figure out the intended analysis of a sentence in such cases, Alpino includes a statistical disambiguation component which is able to solve about 80% of such disambiguation decisions. This component is trained using a set of 10 thousand manually verified syntactic analyses (the Alpino treebank).
The disambiguation component furthermore uses cooccurence information of heads and their dependents extracted from much larger corpora for improved disambiguation accuracy (van Noord 2010). This cooccurence information informs the parser, for instance, which nouns typically occur as direct object or subject of a particular verb. This helps for the disambiguation of sentences such as:
(4) Die topfilm heeft u natuurlijk gezien
(5) Die getuige heeft u natuurlijk gezien
(6) De wijn die Elvis gedronken zou hebben als hij wijn had gedronken
(7) De paus heeft duizend daklozen te eten gehad
(8) De paus heeft twee biefstukken te eten gehad
If time permits, we describe a recent experiment to add “word embedding” features to the disambiguation component, building on ideas in Mikolov et al, 2013. We compare word embedding features with the original disambiguation approach, and we also report on the combination of the two techniques.
Gertjan van Noord. Self-trained Bilexical Preferences to Improve Disambiguation Accuracy. In: Harry Bunt, Paola Merlo and Joakim Nivre (editors), Trends in Parsing Technology. Dependency Parsing, Domain Adaptation, and Deep Parsing. Springer Verlag. pp 183-200. 2010.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. eprint arXiv:1310.4546