Weather forecasting using DBSCAN clustering algorithm
In: Annales mathematicae et informaticae: international journal for mathematics and computer science, Band Accepted manuscript
ISSN: 1787-6117
44 Ergebnisse
Sortierung:
In: Annales mathematicae et informaticae: international journal for mathematics and computer science, Band Accepted manuscript
ISSN: 1787-6117
In: Iraqi journal of science, S. 5572-5580
ISSN: 0067-2904
Clustering is an unsupervised learning method that classified data according to similarity probabilities. DBScan as a high-quality algorithm has been introduced for clustering spatial data due to its ability to remove noise (outlier) and constructing arbitrarily shapes. However, it has a problem in determining a suitable value of Eps parameter. This paper proposes a new clustering method, termed as DBScanBAT, that it optimizes DBScan algorithm by BAT algorithm. The proposed method automatically sets the DBScan parameters (Eps) and finds the optimal value for it. The results of the proposed DBScanBAT automatically generates near original number of clusters better than DBScanPSO and original DBScan. Furthermore, the proposed method has the ability to generate high quality clusters with minimum entropy [ 0.2752, 0.4291] in TR11 and TR12 datasets.
SSRN
Abstract Since Covid-19 was declared as a pandemic disaster, the world economic order has begun to be shaken, and Indonesia is no exception. Indonesia's economic growth has continued to contract since quarter II. Central Java Province is in the third place with the highest number of positive cases in Indonesia. The government try to improve the quality control over the implementation of village funds by observing the classification of village status. The status has been made by the Ministry of Villages based on IDM value. The purpose of this study is to create a village status cluster based on the three index values that compose the IDM, namely IKS, IKL, and IKE. This goal is realized through a comparative analysis of two clustering methods, that is K-means and DBSCAN. The results showed that by using the DBSCAN 4 clusters were formed, while using the K-means 3 clusters were formed. The silhouette value for each cluster formed using the DBSCAN is higher than the silhouette of clusters formed by the K-means and it is concluded that the DBSCAN is more appropriate to use in clustering village status in Central Java province in 2020 than K-means. ; Abstrak Sejak Covid-19 dinyatakan sebagai bencana pandemi, tatanan ekonomi dunia mulai terguncang, tidak terkecuali Indonesia. Pertumbuhan ekonomi Indonesia terus berkontraksi sejak triwulan II. Provinsi Jawa Tengah menempati urutan ketiga dengan jumlah kasus positif tertinggi di Indonesia. Pemerintah berupaya meningkatkan pengendalian mutu atas pelaksanaan dana desa dengan memperhatikan klasifikasi status desa. Status desa tersebut telah dibuat oleh Kementerian Desa berdasarkan nilai IDM. Penelitian ini bertujuan untuk membuat klaster status desa berdasarkan tiga nilai indeks penyusun IDM, yaitu IKS, IKL, dan IKE. Tujuan tersebut diwujudkan melalui analisis komparatif dari dua metode clustering, yaitu K-means dan DBSCAN. Hasil penelitian menunjukkan bahwa dengan menggunakan DBSCAN terbentuk 4 cluster, sedangkan dengan K-means terbentuk 3 cluster. Nilai silhouette untuk setiap cluster yang dibentuk menggunakan DBSCAN lebih tinggi daripada silhouette kluster yang dibentuk oleh K-means dan dapat disimpulkan bahwa DBSCAN lebih tepat digunakan pada clustering status desa di Provinsi Jawa Tengah pada tahun 2020 dibandingkan K-means.
BASE
In: International journal of population data science: (IJPDS), Band 5, Heft 5
ISSN: 2399-4908
IntroductionThe South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa. As the number of HIV records increases, there is need for more efficient ways of de-duplicating this big-data. In this work, we used clustering to perform big-data deduplication.
Objectives and ApproachOur objective was to use DBSCAN as clustering algorithm together with bi-gram word analyser to perform big-data deduplication in resource-limited settings. We used HIV related laboratory records from entire South Africa collated in the NHLS Corporate Data Warehouse for period 2004-2014. This involved data pre-processing, deterministic deduplication, ngrams generation, features generation using Term Frequency Inverse Document Frequency vectorizer, clustering using DBSCAN and assigning cluster labels for records that potentially belonged to the same person. We used records with national identification numbers to assess quality of deduplication by calculating precision, recall and f-measure.
ResultsWe had 51,563,127 HIV related laboratory records. Deterministic deduplication resulted in 20,387,819 patient record deduplicates. With DBSCAN clustering we further reduced this to 14,849,524 patient record clusters. In this final dataset, 3,355,544 (22.60%) patients had negative HIV test, 11,316,937 (76.21%) had evidence for HIV infection, and for 177,043 (1.19%) the HIV status could not be determined. The precision, recall and f-measure based on 1,865,445 records with national identification numbers were 0.96, 0.94 and 0.95, respectively.
Conclusion / ImplicationsOur study demonstrated that DBSCAN clustering is an effective way of deduplicating big datasets in resource-limited settings. This enabled refining of an HIV observational database by accurately linking test records that potentially belonged to the same person. The methodology creates opportunities for easy data profiling to inform public health decision making.
In: International Journal of Computer Networks & Communications (IJCNC) Vol.11, No.6, November 2019
SSRN
Working paper
International audience ; Sensor Networks form a crucial topic in research, as it seems to target a huge variety of uses in which it could be applied, such as healthcare, smart cities, environment monitoring, military, industrial automation, and smart grids. The clustering algorithms represent an essential factor in conserving power within energy-constrained networks. The selection of a cluster head balances the energy load within the network in a proper way, eventually contributing to the reduction of energy consumed, as well as the enhancement of network lifespan. This paper introduced a distributed DBSCAN protocol for saving the energy of Sensor devices in IoT Networks. This protocol is implemented on each IoT sensor device and the devices apply the DBSCAN algorithm to partition the network into clusters in a distributed way. The efficient periodic cluster head strategy is proposed based on certain criteria like remaining energy, number of neighbors, and the distance for each node in the cluster. The cluster head will be chosen in a periodic and distributed way to consume the power in a balanced way in the IoT sensor devices inside each cluster. The comparison results confirm that our protocol can conserve power and enhance the power conservation of the network better than other approaches.
BASE
In: Journal of multi-criteria decision analysis, Band 20, Heft 5-6, S. 235-253
ISSN: 1099-1360
ABSTRACTThis paper presents a methodology for capturing complex trade‐off relationships among objective functions and the underlying design principles through Pareto frontier representations, for Multi‐Objective Optimization problems in complex engineering design processes. The methodology provides engineers a way to quickly assess achievability of target values on objective functions (referred to as f‐feasibility), and propose design solutions that meet or exceed these targets. Obtaining Pareto frontier representations can be challenging for cases when there are discontinuities in the Pareto frontiers and when the trade‐off relationships vary across these discontinuities. The proposed methodology addresses this issue by first identifying discontinuities in the Pareto frontier through a clustering procedure, and then obtaining functional approximations of the trade‐off relationships for each of the discontinuous portion of the Pareto frontier. A two‐stage approach for clustering is employed. In the first stage, a DBSCAN clustering technique is used to classify disjoint Pareto sets in the objective space (f‐space) into different groups. In the second stage, a frequent set mining algorithm named ECLAT and a subsequent filtering procedure are used to extract design constraint combinations that are active (binding) within each of these groups to form subgroups. Approximations are obtained for each of the subgroups using a constrained least squares technique, which are then combined to obtain an overall equation for assessing f‐feasibility of any arbitrary point in objective space. In the two implementation studies, it is observed that the two‐stage approach for clustering is highly effective in yielding good quality Pareto frontier representations and f‐feasibility assessments. Copyright © 2013 John Wiley & Sons, Ltd.
Abstrak Indonesia merupakan negara demokrasi terbesar ketiga di dunia. Salah satu cerminan demokrasi tersebut adalah pemilihan presiden. Seorang tokoh politik yang ingin maju menjadikan opini masyarakat yang sekarang disampaikan melalui media sosial sebagai pertimbangan. Peranan media sosial menjadi sangat penting, karena mampu mendongkrak suara secara signifikan bahkan dijadikan senjata baru bagi banyak bidang terutama kampanye politik. Salah satu media sosial yang sangat populer adalah Twitter. Twitter menggunakan tweet yang mengandung data apabila diolah dapat menjadi informasi. Data dari tweet dapat dijadikan bahan untuk mencari opini masyarakan terhadap calon presiden dan pola yang terbentuk dan pengetahuan pada pemilihan presiden 2019. Untuk menangani tweet yang berbentuk data tekstual dapat dilakukan dengan menggunakan text mining. Metode yang digunakan adalah algoritma partitioning clustering yaitu Density-Based Spatial Clustering of Application with Noise (DBSCAN). Hasil dari penelitian ini adalah DBSCAN menjadi metode terbaik karena mempunyai validitas silhouette index (SI) sebesar 0.8094 dan waktu eksekusi di RapidMinner 2.5676 detik. Frekuensi nama Joko Widodo mendominasi kategori positif, negatif dan netral. Hasil penelitian ini dapat digunakan untuk orang, organisasi dan proses bisnis yang berkaitan erat dengan Pilpres 2019. Kata kunci: Calon Presiden, DBSCAN, Text Mining, Twitter dan Pemilihan Presiden 2019 Abstract Indonesia Indonesia is the third largest democracy in the world. One reflection of that democracy is the presidential election. A political figure who wants to move forward makes public opinion now conveyed through social media a consideration. The role of social media is very important, because it is able to jack up the sound significantly and even become a new weapon in many fields, especially political campaigns. One very popular social media is Twitter. Twitter uses tweets that contain data when it is processed into information. Data from tweets can be used as material to seek public opinion on candidates of presidential and patterns formed and knowledge in the 2019 presidential election. To handle tweets in the form of textual data can be done using text mining. The method used is partitioning clustering algorithm, namely Density-Based Spatial Clustering of Application with Noise (DBSCAN). The results of this study are DBSCAN to be the best method because it has a silhouette index (SI) validity of 0.8094 and an execution time on RapidMinner 2.5676 seconds. The frequency of the name Joko Widodo dominates the positive, negative and neutral categories. The results of this study can be used for people, organizations and business processes that are closely related to the Presidential Election 2019. Keywords: Candidate of Presidential, DBSCAN, Text Mining, Twitter dan Presidential Election 2019
BASE
In: FRL-D-23-02772
SSRN
In: Risk analysis: an international journal, Band 42, Heft 4, S. 830-853
ISSN: 1539-6924
AbstractIn 2016, the British government acknowledged the importance of reducing antimicrobial prescriptions to avoid the long‐term harmful effects of overprescription. Prescription needs are highly dependent on the factors that have a spatiotemporal component, such as bacterial outbreaks and urban densities. In this context, density‐based clustering algorithms are flexible tools to analyze data by searching for group structures and therefore identifying peer groups of GPs with similar behavior. The case of Scotland presents an additional challenge due to the diversity of population densities under the area of study. We propose here a spatiotemporal clustering approach for modeling the behavior of antimicrobial prescriptions in Scotland. Particularly, we consider the density‐based spatial clustering of applications with noise algorithm (DBSCAN) due to its ability to include both spatial and temporal data. We extend this approach into two directions. For the temporal analysis, we use dynamic time warping to measure the dissimilarity between time series while taking into account effects such as seasonality. For the spatial component, we propose a new way of weighting spatial distances with continuous weights derived from a Kernel density estimation‐based process. This makes our approach suitable for cases with different local densities, which presents a well‐known challenge for the original DBSCAN. We apply our approach to antibiotic prescription data in Scotland, demonstrating how the findings can be used to compare antimicrobial prescription behavior within a group of similar peers and detect regions of extreme behaviors.
In: Cambridge elements. Elements in quantitative and computational methods for the social sciences
In the age of data-driven problem-solving, applying sophisticated computational tools for explaining substantive phenomena is a valuable skill. Yet, application of methods assumes an understanding of the data, structure, and patterns that influence the broader research program. This Element offers researchers and teachers an introduction to clustering, which is a prominent class of unsupervised machine learning for exploring and understanding latent, non-random structure in data. A suite of widely used clustering techniques is covered in this Element, in addition to R code and real data to facilitate interaction with the concepts. Upon setting the stage for clustering, the following algorithms are detailed: agglomerative hierarchical clustering, k-means clustering, Gaussian mixture models, and at a higher-level, fuzzy C-means clustering, DBSCAN, and partitioning around medoids (k-medoids) clustering.
The rural development measurement is undoubtedly not easy due to its particular needs and conditions. This study classifies village performance from social, economic, and ecological indices. One thousand five hundred ninety-one villages from the Community and Village Empowerment Office at Riau Province, Indonesia, are grouped into five village maturation classes: very under-developed village, under-developed village, developing village, developed village, and independent village. To date, Density-based spatial clustering of applications with noise (DBSCAN) is utilized in mining 13 of the villages' attributes. Python programming is applied to analyze and evaluate the DBSCAN activities. The study reveals the grouping's silhouette coefficient values at 0.8231, thus indicating the well-being clustering performance. The epsilon and minimum points values are considered in DBSCAN evaluation with percentage splits simulation. This grouping can be used as guidelines for governments in analyzing the distribution of rural development subsidies more optimal.
BASE
SSRN
In: Maǧallat abḥāṯ al-Baṣra: al-ʿulūm al-insānīya = Journal of Basrah researches : the humanities, Band 50, Heft 2, S. 318-332
ISSN: 1817-2695
The use of efficient machines and algorithms in planning, distribution, and optimization methods is of paramount importance, especially when it comes to supporting the rapid development of technology. Cluster analysis is an unsupervised machine learning function for clustering objects based on some similarity measure. In this paper, we review different types of clustering algorithms for clustering data of different sizes and their applications. This survey reviews five primary clustering approaches—Partitioning, Hierarchical, Density-Based, Model-Based, and Grid-Based clustering—highlighting their strengths, limitations, and suitability for location-based optimization. Each algorithm is evaluated on key performance criteria, including noise handling, computational efficiency, scalability, and the ability to manage spatial constraints. Key evaluations demonstrate that DBSCAN achieved an average silhouette score of 0.76, indicating strong cluster cohesion and separation, while K-Means showed the fastest computational time for datasets under 10,000 points. The Grid-Based method excelled in scalability, handling datasets exceeding 1 million points with minimal computational overhead. Case studies and real-world applications demonstrate the practical utility of these algorithms in optimizing center placement across diverse industries. The results provide valuable insights for practitioners and researchers seeking to improve distributed network design, resource efficiency, and location optimization using advanced clustering methodologies.