Open Access BASE2021

Old Catalan Morphosyntax: Developing an Annotated Corpus

Abstract

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi-supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet. ; "Research that partially facilitated the work presented in this article was funded by the British Academy (PDF grant pf170063), and the Cambridge Humanities Research Grant (tier 1 grant, GANT011262). Additionally, this work has been supported by the French government, through the UCAJEDI Investments in the Future project managed by the National Research Agency (ANR) with the reference number C870A06228 – EOTP : SYVACA – D112.

Publisher

Ubiquity Press; Department of Theoretical And Applied Linguistics; Journal of Open Humanities Data

DOI

10.17863/CAM.79124

Report Issue

If you have problems with the access to a found title, you can use this form to contact us. You can also use this form to write to us if you have noticed any errors in the title display.