In: Jensen , K E 2014 , ' Linguistics and the digital humanities : (Computational) corpus linguistics ' , MedieKultur , vol. 30 , no. 57 , pp. 117-136 .
Corpus linguistics has been closely intertwined with digital technology since the introduction of university computer mainframes in the 1960s. Making use of both digitized data in the form of the language corpus and computational methods of analysis involving concordancers and statistics software, corpus linguistics arguably has a place in the digital humanities. Still, it remains obscure and figures only sporadically in the literature on the digital humanities. This article provides an overview of the main principles of corpus linguistics and the role of computer technology in relation to data and method and also offers a bird's-eye view of the history of corpus linguistics with a focus on its intimate relationship with digital technology and how digital technology has impacted the very core of corpus linguistics and shaped the identity of the corpus linguist. Ultimately, the article is oriented towards an acknowledgment of corpus linguistics' alignment with the digital humanities.
The Mathematics of Language (MoL) special interest group traces its origins to a meeting held in October 1984 at Ann Arbor, Michigan. While MoL is among the oldest SIGs of the ACL, it is the first time that the proceedings are produced by our parent organization. The first volume was published by Benjamins, later ones became special issues of the Annals of Mathematics and Artificial Intelligence and Linguistics and Philosophy, and for the last three occasions (really six years, since MoL only meets every second year) we relied on the Springer LNCS series. Perhaps the main reason for this aloofness was that the past three decades have brought the ascendancy of statistical methods in computational linguistics, with the formal, grammar-based methods that were the mainstay of mathematical linguistics viewed with increasing suspicion. To make matters worse, the harsh anti-formal rhetoric of leading linguists relegated important attempts at formalizing Government-Binding and later Minimalist theory to the fringes of syntax. Were it not for phonology and morphology, where the incredibly efficient finite state methods pioneered by Kimmo Koskenniemi managed to bridge the gap between computational practice and linguistic theory, and were it not for the realization that the mathematical approach has no alternative in machine learning, MoL could have easily disappeared from the frontier of research. The current volume marks a time when we can begin to see the computational and the theoretical linguistics camps together again. The selection of papers, while still strong on phonology (Heinz and Lai, Heinz and Rogers) and morphology (Kornai et al.), extends well to syntax (Hunter and Dyer, Fowlie) and semantics (Clark et al., Fernando). Direct computational concerns such as machine translation (Martzoukos et al.), decoding (Corlett and Penn), and complexity (Berglund et al.) are now clearly seen as belonging to the core focus of the field. The 10 papers presented in this volume were selected by the Program Committee from 16 submissions. We would like to thank the authors, the members of the Program Committee, and our invited speaker for their contributions to the planning and execution of the workshop, and the ACL conference organizers, especially Aoife Cahill and Qun Liu (workshops), and Roberto Navigli and Jing-Shin Chang (publications) for their significant contributions to the overall management of the workshop and their direction in preparing the publication of the proceedings.
CLARIN is a European Research Infrastructure Consortium (ERIC), which aims at (a) making extensive language-based materials available as primary research data to the humanities and social sciences (HSS); and (b) offering state-of-the-art language technology (LT) as an eresearch tool for this purpose, positioning CLARIN centrally in what is often referred to as the digital humanities (DH). The Swedish CLARIN node Swe-Clarin was established in 2015 with funding from the Swedish Research Council. In this paper, we describe the composition and activities of Swe-Clarin, aiming at meeting the requirements of all HSS and other researchers whose research involves using text and speech as primary research data, and spreading the awareness of what Swe-Clarin can offer these research communities. We focus on one of the central means for doing this: pilot projects conducted in collaboration between HSS researchers and Swe-Clarin, together formulating a research question, the addressing of which requires working with large language-based materials. Four such pilot projects are described in more detail, illustrating research on rhetorical history, second-language acquisition, literature, and political science. A common thread to these projects is an aspiration to meet the challenge of conducting research on the basis of very large amounts of textual data in a consistent way without losing sight of the individual cases making up the mass of data, i.e., to be able to move between Moretti's "distant" and "close reading" modes. While the pilot projects clearly make substantial contributions to DH, they also reveal some needs for more development, and in particular a need for document-level access to the text materials. As a consequence of this, work has now been initiated in Swe-Clarin to meet this need, so that Swe-Clarin together with HSS scholars investigating intricate research questions can take on the methodological challenges of big-data language-based digital humanities.
CLARIN is a European Research Infrastructure Consortium (ERIC), which aims at (a) making extensive language-based materials available as primary research data to the humanities and social sciences (HSS); and (b) offering state-of-the-art language technology (LT) as an eresearch tool for this purpose, positioning CLARIN centrally in what is often referred to as the digital humanities (DH). The Swedish CLARIN node Swe-Clarin was established in 2015 with funding from the Swedish Research Council. In this paper, we describe the composition and activities of Swe-Clarin, aiming at meeting the requirements of all HSS and other researchers whose research involves using text and speech as primary research data, and spreading the awareness of what Swe-Clarin can offer these research communities. We focus on one of the central means for doing this: pilot projects conducted in collaboration between HSS researchers and Swe-Clarin, together formulating a research question, the addressing of which requires working with large language-based materials. Four such pilot projects are described in more detail, illustrating research on rhetorical history, second-language acquisition, literature, and political science. A common thread to these projects is an aspiration to meet the challenge of conducting research on the basis of very large amounts of textual data in a consistent way without losing sight of the individual cases making up the mass of data, i.e., to be able to move between Moretti's "distant" and "close reading" modes. While the pilot projects clearly make substantial contributions to DH, they also reveal some needs for more development, and in particular a need for document-level access to the text materials. As a consequence of this, work has now been initiated in Swe-Clarin to meet this need, so that Swe-Clarin together with HSS scholars investigating intricate research questions can take on the methodological challenges of big-data language-based digital humanities.
We present de-identification and pseudonymization of a learner corpus within the ongoing research infrastructure project SweLL[1]. The main project aim is to make available a linguistically annotated corpus of essays written by second language (L2) learners of Swedish. To ensure that the data collected in the project can be used openly in research protecting the subjects' integrity, we developed data handling flow, a set of metadata about the learners, pseudonymization principles of learner texts, and tools in support of pseudonymization. During data collection and storage, the data needs to be handled in a secure way, and the participating subjects must be de-identified in the corpus, where common personal identifiers such as names, age, geographic places, dates must be identified, masked and eventually replaced. These identifiers might occur in metadata about the learner, and in the learners' text(s). The SweLL project adopted a rather restrictive approach to metadata describing important aspects about each produced text and learner so that learners are de-identified while still providing important information for research purposes about the learner's gender, age given in 5-year interval spans, total time in Sweden, education level, mother tongue, and languages spoken in various communicative situations. The metadata does not provide exact date of birth, arrival date to Sweden, the country of origin or nationality of the learner, and no information is given about the educational establishment, where the essays have been collected. De-identification through metadata might not be solely satisfactory, since the texts written by a learner may, and in fact often contain personal information about the learner. Pseudonymization involves the identification of personal information that can relate to the subject (e.g. My name is Ali), and the classification of that information, masked into certain predefined types (e.g. My name is first_name). As the first step, we manually mark-up text segments that reveal personal information in the corpus data. The identified segments are categorized as personal names, institutions (referring to schools, work place, sport teams), geographic data (such as country, city, region, areas, street name, numbers), transportation types and line names/numbers, age, date, phone number, email address, personal web page, social security number, account number, certificate/license number, profession and education, and sensitive information revealing physical or mental disabilities, political views, unique family relations, and any other items not covered by the previous categories. Each marked text string with a category is then replaced in a systematic way to reproduce a "natural" text to increase reading flow. This step includes assigning unique id-numbers to each entity within a certain category type so if the particular entity is repeated in the text, the same running number is assigned to it and can be replaced by the same word. We also add morphological information to each masked entity to be able to replace it in the same morphological form as the original. There are several ways to mask the sensitive information through substitution, either by rendering, or by replacement with another pre-defined token of the same category. Rendering is applied to information that can be collected from general resource lists, such as personal names and surnames; city and country names, nationalities and languages; geographic names; street names; names of schools, institutions, work places; etc. Replacement applies to strings containing information with certain formatting where general resource lists cannot suffice. Such cases include middle names or initials, numerical information such as phone numbers or dates. In some cases, when the annotator does not know how to categorize a certain text string, the original text is kept but marked by a placeholder. Distinction is made between objects that need to be replaced because of sensitivity, and objects that might be sensitive but can be replaced later, or to be removed later. The pseudonymized corpus is under development, as are the tools supporting the pseudonymization process. We expect the corpus and the tools to be released as open source by the end of 2020. [1] https://spraakbanken.gu.se/eng/swell_infra ; SweLL
In this thesis we present the idea of using parallel phrases for word alignment. Each parallel phrase is extracted from a set of manual word alignments and contains a number of source and target words and their corresponding alignments. If a parallel phrase matches a new sentence pair, its word alignments can be applied to the new sentence. There are several advantages of using phrases for word alignment. First, longer text segments include more context and will be more likely to produce correct word alignments than shorter segments or single words. More importantly, the use of longer phrases makesit possible to generalize words in the phrase by replacing words by parts-of-speech or other grammatical information. In this way, the number of words covered by the extracted phrases can go beyond the words and phrases that were present in the original set of manually aligned sentences. We present experiments with phrase-based word alignment on three types of English–Swedish parallel corpora: a software manual, a novel and proceedings of the European Parliament. In order to find a balance between improved coverage and high alignment accuracy we investigated different properties of generalised phrases to identify which types of phrases are likely to produce accurate alignments on new data. Finally, we have compared phrase-based word alignments to state-of-the-art statistical alignment with encouraging results. We show that phrase-based word alignments can be used to enhance statistical word alignment. To evaluate word alignments an English–Swedish reference set for the Europarl corpus was constructed. The guidelines for producing this reference alignment are presented in the thesis.
The success of modern software for natural language processing impresses. Programs for orthography and grammar correction, information retrieval from document databases, and translation from one natural language into another, among others, are sold worldwide in millions of copies nowadays. The relationship of the Arabic language to the computer in the process by which the learner acquires the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences to automated communication
Scholars have long established the importance of the cultural outcomes of social movements in the context of political power and representation. However, they have also acknowledged the methodo-logical difficulties associated with studying cultural outcomes, especially when culture is manifested through linguistic practices. This paper addresses the potential for dealing with movements and culture as mani-fested in symbols, public discourse, narratives, and rhetoric and makes two contributions: It links the social movement literature studying culture through language with Natural Language Processing (NLP) techniques for systematic and comprehensive cultural analysis; and introduces a state-of the-art method which pro-vides a better understanding of language change and linguistic influence given the capacity of computa-tional analyses to process large volumes of data for multiple actors and varied data sources during long pe-riods of time. The paper describes the cultural influence of women organizations in Spain between 2000-2020 on issues such as gender inequalities, abortion, gender violence, prostitution, and surrogacy. Tweets and manifestos by women's organizations', as well as national press coverage of women issues and inter-ventions by MPs in the parliamentary arena, are used to describe the advantages and limitations of the method for the study of cultural outcomes. Computational linguistics provides new possibilities for scholarly research on cultural outcomes of social movements but also shows that these methods should be accompa-nied by precise definitions of cultural outcomes, detailed and replicable operationalisation processes, and theoretical models that identify the mechanisms that explain the linguistic phenomena that underly cultural change.
"Innovations and Applications of Technology in Language Education is a collection of twelve chapters by an international group of language and linguistics education experts. Although technology in language education is a global interest, its practices should be contextualized. The book covers how language educational technology is currently applied, discusses how it should be applied, and gives directions for its future development. Providing a critical review of respective current practices and perspectives, the book begins by presenting a set of research-based principles for developing second language teachers' professionalism. It then examines the use of technology to enhance students' English language skills. Acknowledging the advantages and disadvantages of AI-mediated communication, the book argues for the use of AI to facilitate communication in language education. It also proposes the use of AI to develop and administer language tests and suggests guidelines for practitioners to deploy AI in developing and administering language tests efficiently. The book concludes by discussing technology for specific purposes in second language education and the potential of computer-mediated communication (CMC) to enhance interaction between students"--
Intro -- Normativity in Language and Linguistics -- Editorial page -- Title page -- Copyright page -- Table of contents -- Foreword -- Norms and normativity in language and linguistics: Basic concepts and contextualisation -- 1. Introduction -- 2. Concepts and terms -- 2.1 Norms in general -- 2.2 Rules and principles: Central features -- 3. A historical perspective -- 4. The present volume: Outline and contextualisation -- Acknowledgements -- References -- Concerning the scope of normativity -- 1. Introduction -- 2. Generalities -- 2.1 Truth as norm -- 2.2 On knowledge and belief -- 2.3 The dual nature of beliefs -- 2.4 Descriptive vs. prescriptive attitude vis-à-vis norms -- 3. Semantics -- 3.1 Necessary truth as the basis of philosophical/linguistic semantics -- 3.2 Necessary truth as an exemplification of normativity -- 3.3 Normativity prevails over psychology/cognition -- 3.4 Linguistic vs. cognitive semantics -- 4. Rational explanation -- 4.1 Definition -- 4.2 Justification in three different situations -- 4.2.1 No laws -- 4.2.2 Statistical laws -- 4.2.3 Universal (= deterministic) laws -- 4.3 Theoretical vs. practical reasoning -- 4.3.1 Two inverse types of inference -- 4.3.2 Sufficient vs. necessary conclusions of practical reasoning -- 4.4 Conclusion -- 5. The implicit normativity of everyday life -- 6. Epilogue -- References -- Norms of language: What kinds and where from? Insights from phenomenology -- 1. Introduction -- 2. Some basic concepts and insights of phenomenology -- 2.1 What is phenomenology? -- 2.2 Intentionality and intuition -- 2.3 Operative intentionality and embodied intersubjectivity -- 2.4 Life world, typification and sedimentation -- 2.5 Summary -- 3. Itkonen on language norms, accessible by intuitions -- 3.1 Norms of correctness and rationality -- 3.2 Intuitions and their objects.
Zugriffsoptionen:
Die folgenden Links führen aus den jeweiligen lokalen Bibliotheken zum Volltext: