Abstractive text summarization for Hungarian
In: Annales mathematicae et informaticae: international journal for mathematics and computer science, Band 53, S. 299-316
ISSN: 1787-6117
82 Ergebnisse
Sortierung:
In: Annales mathematicae et informaticae: international journal for mathematics and computer science, Band 53, S. 299-316
ISSN: 1787-6117
Automatic text summarization is a representation of a document that contains the essence or main focus of the document. Text summarization is automatically performed using the extraction method. The extraction method summarizes by copying the text that is considered the most important or most informative from the source text into a summary [1]. Documents can be divided into two types, namely single documents and multi documents. Multi document is input that comes from many documents from one or more sources that have more than one main idea.This study aims to summarize the text using a Genetic Algorithm by paying attention to the extraction of text features on each chromosome. The feature extraction used is sentence position, positive keywords, negative keywords, similarity between sentences, sentences containing entity words, sentences containing numbers, sentence length, connections between sentences, the number of connections between sentences. The number of chromosomes used is half of the number of public complaints. The data used is data on public complaints against the DIY government from February 2018 to July 2020. The data is obtained from the e-lapor DIY website. From the test results, the average value of Precision 1, Recall is 0.71, and f-measure value is 0.79.
BASE
In: http://hdl.handle.net/10919/86357
When you are browsing social media websites such as Twitter and Facebook, have you ever seen hashtags like #NeverAgain and #EnoughIsEnough? Do you know what they mean? Never Again is an American student-led political movement for gun control to prevent gun violence. In the United States, gun control has long been debated. According to the data from the Gun Violence Archive (http://www.shootingtracker.com/), in 2017, the U.S. saw a total of 346 mass shootings. Supporters claim that the proliferation of firearms is the direct spark of a series of social unrest factors such as robbery, sexual crimes, and theft, while others believe the gun culture represents an integral part of their freedom. For the Never Again Gun Control Movement, we would like to generate a human readable summary based on deep learning methods so that one can study incidents of gun violence that shocked the world such as the 2017 Las Vegas shooting, in order to figure out the impact of gun proliferation. Our project includes three steps: pre-processing, topic modeling, and abstractive summarization using deep learning. We began with a large collection of news articles associated with the #NeverAgain movement. The raw news articles needed to be pre-processed in multiple ways. An ArchiveSpark script was used to convert the WARC and CDX files to a readable and parseable JSON. However, we figured out that at least forty percent of the data was noise. A series of restrictive word filters was applied to remove noise. After noise removal, we identified the most frequent words to get a preliminary idea whether we were filtering noise properly. We used the Natural Language Toolkits (NLTK) Named Entity chunker to generate named entities, which are phrases that form important nouns (people, places, organizations, etc.) in a sentence. For Topic Modeling, we classified sentences into different buckets or topics, which identified distinct themes in the collection. While we were performing the dictionary creation and document vectorization, the Latent Dirichlet allocation algorithm (for topic modeling) did not take the normalized and tokenized word corpus directly. It had to be converted into a vector for each article in the collection. We chose to use the Bag Of Words (BOW) approach. The Bag Of Words method is a simplifying representation used in natural language processing and information retrieval. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. According to topic modeling, we needed to choose the number of topics, which means one must guess how many topics are present in a collection. There is no foolproof way of replacing human logic to weave keywords into topics with semantic meaning. To address this we tried the coherence score approach. Coherence score is an attempt to mimic the human readability of the topic, and the higher the coherence score, the more coherent the topics are considered. The last step for topic modeling is Latent Dirichlet Allocation (LDA). Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Compared with some other algorithms, LDA is a probabilistic one, which means that LDA is better at handling topic mixtures in different documents. In addition, LDA identifies topics coherently whereas the topics from other algorithms are more disjoint. After we had our topics (three in total), we filtered the article collection based on these topics. What resulted was three distinct collections of articles on which we could apply an abstractive summarization algorithm to produce a coherent summary. We chose to use a Pointer-Generator Network (PGN), a deep learning approach designed to create abstractive summaries, to produce said summaries. We created a summary for each identified topic and performed post-processing to produce one summary that connected the three topics (which are related) into a summary that flowed. The result was a summary that reflected the main themes of the article collection and informed the reader of the contents of said collection in less than two pages. ; NSF IIS-1619028 ; Description of files of this collection: - NeverAgain_report_in_PDF_format.pdf: The final report of the project in PDF format. - NeverAgain_Report_Latex_Material.zip: A zip file containing the source material of the LaTeX version of the final report. - NeverAgain_ ZIP_file_of_source_code.zip: A zip file containing all the source code of the project. - NeverAgain_presentation_in_powerpoint: The final presentation in Microsoft PowerPoint format. - NeverAgain_presentation_in_pdf: The final presentation in PDF format.
BASE
This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summarization task in the context of the tourist domain. In particular, our aim is to retrieve relevant information about a place or an object pictured in an image in order to provide a short summary which will be of great help for a tourist. For tackling this task, we proposed a broad set of experiments using crowdsourcing services that could be useful as a reference for others who want to rely also on crowdsourcing. From the analysis carried out through our experimental setup and the results obtained, we can conclude that although crowdsourcing services were not good to simply gather gold-standard summaries (i.e., from the results obtained for experiments 1, 2 and 4), the encouraging results obtained in the third and sixth experiments motivate us to strongly believe that they can be successfully employed for finding some patterns of behaviour humans have when generating summaries, and for validating and checking other tasks. Furthermore, this analysis serves as a guideline for the types of experiments that might or might not work when using crowdsourcing in the context of text summarization. ; This work was supported by the EU-funded TRIPOD project (IST-FP6-045335) and by the Spanish Government through the FPU program and the projects TIN2009-14659-C03-01, TSI 020312-2009-44, and TIN2009-13391-C04-01; and by Conselleria d'Educació–Generalitat Valenciana (grant no. PROMETEO/2009/119 and grant no. ACOMP/2010/286).
BASE
In: Journal of Asian scientific research, Band 5, Heft 1, S. 1-15
ISSN: 2223-1331
In: Asian journal of research in social sciences and humanities: AJRSH, Band 6, Heft 7, S. 29
ISSN: 2249-7315
Automatic Text Summarization has been shown to be useful for Natural Language Processing tasks such as Question Answering or Text Classification and other related fields of computer science such as Information Retrieval. Since Geographical Information Retrieval can be considered as an extension of the Information Retrieval field, the generation of summaries could be integrated into these systems by acting as an intermediate stage, with the purpose of reducing the document length. In this manner, the access time for information searching will be improved, while at the same time relevant documents will be also retrieved. Therefore, in this paper we propose the generation of two types of summaries (generic and geographical) applying several compression rates in order to evaluate their effectiveness in the Geographical Information Retrieval task. The evaluation has been carried out using GeoCLEF as evaluation framework and following an Information Retrieval perspective without considering the geo-reranking phase commonly used in these systems. Although single-document summarization has not performed well in general, the slight improvements obtained for some types of the proposed summaries, particularly for those based on geographical information, made us believe that the integration of Text Summarization with Geographical Information Retrieval may be beneficial, and consequently, the experimental set-up developed in this research work serves as a basis for further investigations in this field. ; This work has been partially funded by the European Commission under the Seventh (FP7-2007-2013) Framework Programme for Research and Technological Development through the FIRST project (FP7-287607). It has also been partially supported by a grant from the Fondo Europeo de Desarrollo Regional (FEDER), projects TEXT-MESS 2.0 (TIN2009-13391-C04-01) and TEXT-COOL 2.0 (TIN2009-13391-C04-02) from the Spanish Government, a Grant from the Valencian Government, project "Desarrollo de Técnicas Inteligentes e Interactivas de Minería de Textos" (PROMETEO/2009/119), and a Grant No. ACOMP/2011/001.
BASE
In: Iraqi journal of science, Band 59, Heft 4B
ISSN: 0067-2904
In: Iraqi journal of science, Band 58, Heft 3A
ISSN: 0067-2904
Nowadays, the automatic text summarization is a highly relevant task in many contexts. In particular, query-focused summarization consists of generating a summary from one or multiple documents according to a query given by the user. Additionally, sentiment analysis and opinion mining analyze the polarity of the opinions contained in texts. These two issues are integrated in an approach to produce an opinionated summary according to the user's query. Thereby, the query-focused sentiment-oriented extractive multi-document text summarization problem entails the optimization of different criteria, specifically, query relevance, redundancy reduction, and sentiment relevance. An adaptation of the metaheuristic population-based crow search algorithm has been designed, implemented, and tested to solve this multi-objective problem. Experiments have been carried out by using datasets from the Text Analysis Conference (TAC) datasets. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics and the Pearson correlation coefficient have been used for the performance assessment. The results have reported that the proposed approach outperforms the existing methods in the scientific literature, with a percentage improvement of 75.5% for ROUGE-1 score and 441.3% for ROUGE-2 score. It also has been obtained a Pearson correlation coefficient of +0.841, reporting a strong linear positive correlation between the sentiment scores of the generated summaries and the sentiment scores of the queries of the topics. ; Work supported by Ministerio de Ciencia, Innovación y Universidades - Spain and Agencia Estatal de Investigación - Spain (Projects PID2019-107299GB-I00/AEI/10.13039/501100011033 and MTM2017-86875-C3-2-R), Junta de Extremadura - Spain (Projects GR18090 and GR18108), and European Union (European Regional Development Fund). Jesus M. Sanchez-Gomez is supported by Junta de Extremadura, Spain and European Union. (European Social Fund) under the doctoral fellowship PD18057. ; peerReviewed
BASE
In recent years, Twitter has become one of the most important microblogging services of the Web 2.0. Among the possible uses it allows, it can be employed for communicating and broadcasting information in real time. The goal of this research is to analyze the task of automatic tweet generation from a text summarization perspective in the context of the journalism genre. To achieve this, different state-of-the-art summarizers are selected and employed for producing multi-lingual tweets in two languages (English and Spanish). A wide experimental framework is proposed, comprising the creation of a new corpus, the generation of the automatic tweets, and their assessment through a quantitative and a qualitative evaluation, where informativeness, indicativeness and interest are key criteria that should be ensured in the proposed context. From the results obtained, it was observed that although the original tweets were considered as model tweets with respect to their informativeness, they were not among the most interesting ones from a human viewpoint. Therefore, relying only on these tweets may not be the ideal way to communicate news through Twitter, especially if a more personalized and catchy way of reporting news wants to be performed. In contrast, we showed that recent text summarization techniques may be more appropriate, reflecting a balance between indicativeness and interest, even if their content was different from the tweets delivered by the news providers. ; This research work has been partially funded by the Spanish Government (Ministerio de Economía y competitividad) through the project "Técnicas de Deconstrucción en la Tecnologías del Lenguaje Humano" (TIN2012–31224), and by the Valencian Government through projects PROMETEO (PROMETEO/2009/199) and ACOMP/2011/001.
BASE
The text summarization is a technique where the original large text is condensed into smaller version without changing its abstract meaning. The text summarization is done on the common foreign and regional languages typically, but infrequent work has been observed for the Marathi language. As the amount of e-contents on web is increasing drastically, the users are facing difficulty to read the newspaper articles with extraction of its different perspectives with sorting. We are focussing on educational, Political and sports news for summarization, which will be helpful for students who are appearing for competitive exams. This paper explores the pre processing techniques for Marathi e-news articles.
BASE
SSRN
Working paper
This paper presents tackling of a hard optimization problem of computational linguistics, specifically automatic multi-document text summarization, using grid computing. The main challenge of multi-document summarization is to extract the most relevant and unique information effectively and efficiently from a set of topic-related documents, constrained to a specified length. In the Big Data/Text era, where the information increases exponentially, optimization becomes essential in selection of the most representative sentences for generating the best summaries. Therefore, a data-driven summarization model is proposed and optimized during a run of Differential Evolution (DE). Different DE runs are distributed to a grid in parallel as optimization tasks, seeking high processing throughput despite the demanding complexity of the linguistic model, especially on longer multi-documents where DE improves results given more iterations. Namely, parallelization and the grid enable, running several independent DE runs at same time within fixed real-time budget. Such approach results in improving a Document Understanding Conference (DUC) benchmark recall metric over a previous setting. ; This paper is based upon work from COST Action IC1406High-Performance Modelling and Simulation for Big Data Appli-cations (cHiPSet), supported by COST (European Cooperation inScience and Technology). This paper is also based upon workfrom COST Actions CA15140 "Improving Applicability of Nature-Inspired Optimisation by Joining Theory and Practice (ImAppNIO)", and CA18231 "Multi3Generation: Multi-task, Multilingual, Multi-modal Language Generation", both supported by COST. The author AZ acknowledges the financial support from the Slovenian Research Agency (Research Core Funding No. P2-0041). AZ also acknowledges EU support under Project No. 5442-24/2017/6 (HPC – RIVR). AZ also acknowledges the EU Interreg Alpine Space project SmartVillages and Erasmus TSM grant. The author EL acknowledges the financial support by the Generalitat Valenciana through the Research Project PROMETEU/2018/089, and by the Spanish Government through the INTEGER project (RTI2018-094649-B-I00), and network RED iGLN (TIN2017-90773-REDT).
BASE
The Web 2.0 has resulted in a shift as to how users consume and interact with the information, and has introduced a wide range of new textual genres, such as reviews or microblogs, through which users communicate, exchange, and share opinions. The exploitation of all this user-generated content is of great value both for users and companies, in order to assist them in their decision-making processes. Given this context, the analysis and development of automatic methods that can help manage online information in a quicker manner are needed. Therefore, this article proposes and evaluates a novel concept-level approach for ultra-concise opinion abstractive summarization. Our approach is characterized by the integration of syntactic sentence simplification, sentence regeneration and internal concept representation into the summarization process, thus being able to generate abstractive summaries, which is one the most challenging issues for this task. In order to be able to analyze different settings for our approach, the use of the sentence regeneration module was made optional, leading to two different versions of the system (one with sentence regeneration and one without). For testing them, a corpus of 400 English texts, gathered from reviews and tweets belonging to two different domains, was used. Although both versions were shown to be reliable methods for generating this type of summaries, the results obtained indicate that the version without sentence regeneration yielded to better results, improving the results of a number of state-of-the-art systems by 9%, whereas the version with sentence regeneration proved to be more robust to noisy data. ; This research work has been partially funded by the University of Alicante, Generalitat Valenciana, Spanish Government and the European Commission through the projects, "Tratamiento inteligente de la información para la ayuda a la toma de decisiones" (GRE12-44), "Explotación y tratamiento de la información disponible en Internet para la anotación y generación de textos adaptados al usuario" (GRE13-15), DIIM2.0 (PROMETEOII/2014/001), ATTOS (TIN2012-38536-C03-03), LEGOLANG-UAGE (TIN2012-31224), SAM (FP7-611312), and FIRST (FP7-287607).
BASE