Automatic learning framework for pharmaceutical record matching
Pharmaceutical manufacturers need to analyse a vast number of products in their daily activities. Many times, the same product can be registered several times by different systems using different attributes, and these companies require accurate and quality information regarding their products since these products are drugs. The central hypothesis of this research work is that machine learning can be applied to this domain to efficiently merge different data sources and match the records related to the same product. No human is able to do this in a reasonable way because the number of records to be matched is extremely high. This article presents a framework for pharmaceutical record matching based on machine learning techniques in a big data environment. The proposed framework aims to explode the well-known rules for the matching of records from different databases for training machine learning models. Then the trained models are evaluated by predicting matches with records that do not follow these known rules. Finally, the production environment is simulated by generating a huge amount of combinations of records and predicting the matches. The obtained results show that, despite the good results obtained with the training datasets, in the production environment, the average accuracy of the best model is around 85%. That shows that matches which do not follow the known rules can be predicted and, considering that there is not a human way to process this amount of data, the results are promising. ; This work was supported by the Research Program of the Ministry of Economy and competitiveness, Government of Spain, through the DeepEMR Project, under Grant TIN2017-87548-C2-1-R