Data Mining of SILC Data: Turkey Case
Official data produced by the National Statistical Institutes (NSIs) have an essential place in the governmental economic and social decision-making process. Addressing official data with data mining methods rather than traditional statistical approaches is crucial to extract new information and hidden patterns. However, the usefulness of data mining methods for official statistics remains unexplored. In the present study, SILC (Survey of Income and Living Conditions) data for the year 2015 conducted by the Turkish Statistical Institute (TurkStat) are examined with data mining methods. Cross-sectional data of 36036 individuals were handled, and the variables affecting the individual income were determined, also the welfare status of the individuals was examined. To determine the socio-economic profiles of individuals, latent class analysis (LCA) and k-modes clustering analysis were used. The socio-economic status of individuals was classified using clustering and random forest (RF) algorithm models. In the LCA model with ten classes, it was obtained which probability of a newly selected individual would belong to which class. The latent class profile definitions of the individuals were obtained according to the variable values obtained from the latent classes with the highest probability. Ten clusters obtained as a result of k-modes were defined according to cluster modes, and cluster profile definitions of individuals were obtained, and also their results were compared with LCA results. In this study, in which categorical variables were considered, it was seen that LCA method provided more consistent results than k-modes method. In the RF model, where individual income is selected as a function of all nine input variables, the importance of the variables was determined. It is observed that education, occupation, and age variables were more important and made the most contribution to the RF model, respectively. In the SILC data, which is an extensive and detailed data, methods such as LCA and RF seem to be appropriate for the application of data mining and obtaining meaningful results from the data. Similar data mining processes can be used to obtain meaningful results for different official data.