author:"Coldea, Alex" | Pollux - Fachinformationsdienst Politikwissenschaft

Filter

Format

Medientyp

Jahre

3 Ergebnisse

Sortierung:

Aufsatz(elektronisch)#110. September 2024

Sensitive Data Flagging within Data Quality reports: R and Regex Integration for Effective Text Flagging in Large Datasets

In: International journal of population data science: (IJPDS), Band 9, Heft 5

Coldea, Alex-Ioan; Elmessary, Muhammad A; Thayer, Daniel S; Scanlon, Ieuan; Davies, Hannah

ISSN: 2399-4908

ObjectiveWe're introducing a comprehensive approach to enhance sensitive data flagging within our Data Quality (DQ) tool. Within de-identified health records, sensitive data such as user ids, post codes, phone numbers, and addresses pose significant privacy risks if exposed in their raw form. To address this challenge, we propose using all necessary regex patterns and free text field checks for sensitive data flagging with the objective of efficient detection within large datasets and ultimately reporting such data.
ApproachA regex map has been developed along free text field thresholds, distribution counts and percentage checks to allow a generic approach. This ensures that any new regexes can be added to the flagging process without significant code changes.
Once identified, sensitive data instances, including fields that contain free text are flagged and displayed in an html report that is human readable by DQ reviewers.
ResultsThrough a combination of SQL and R regex based text processing, our approach allows a seamless identification and flagging of sensitive data within datasets. Sensitive data for large tables of +25 million records with over 50 columns are getting flagged in less than 150 seconds (approx. 2 minutes).
ConclusionOur development offers a practical solution for sensitive data flagging in complex datasets where team members can implement robust sensitive data flagging by adding more regex formulas in the codebase.
ImplicationsDQ reviewers can inspect the sensitive flagged fields in a human readable way, providing an extra measure of confidence when determining the quality of de-identified health records.

Open Access

Verfügbarkeit an Ihrem Standort wird überprüft

Dieser Artikel ist auch in Ihrer Bibliothek verfügbar: |

elektronisch

gedruckt

Exportieren

Aufsatz(elektronisch)#210. September 2024

Developing a High Velocity Dataset Quality Checking Pipeline

In: International journal of population data science: (IJPDS), Band 9, Heft 5

Davies, Hannah; Thayer, Daniel S; Elmessary, Muhammad A; Coldea, Alex-Ioan; Jones, Carys; Evans, Lorna; Makovics, Alexander; Howell-Wright, Owen

Davies, Hannah; Thayer, Daniel S; Elmessary, Muhammad A; Coldea, Alex-Ioan; Jones, Carys; Evans, Lorna; Makovics, Alexander; Howell-Wright, Owen; Hughes, Lee

ISSN: 2399-4908

ObjectiveThe volume and frequency of refreshed data within the [organisation] has increased significantly since the beginning of the COVID-19 pandemic. Therefore, a more efficient data quality (DQ) checking process was necessary.
Approach Having previously developed an automated DQ checking tool, the focus was on re-engineering the process of DQ task allocation and communication of results.
Results 5 analysts were trained in DQ checking. A JIRA workflow tracks the management of data loading. When a dataset is ready for DQ, the Data Manager allocates a ticket to the DQ Lead who then allocates it onto one of the 5 analysts. Via a DQ Slack channel, the analyst is informed and acknowledges receipt of the task. On completion of DQ, the analyst updates the ticket and transfers it to the appropriate workflow stage. Passed DQ tickets are transferred to the Data Manager for data release, whereas failed ones are placed "On Hold". The DQ Lead triages the issues and liaises with relevant parties for resolution, which may require data amendments. On receipt of amended data, the DQ ticket is transferred back to the queue and the analyst is notified to re-check the data.
ConclusionsThe team now complete a high volume of DQ checks efficiently. In 2023, 405 datasets, containing 1716 tables, were quality checked, with the initial DQ taking, on average, 2.6 days.
Implications The improved speed of DQ checking ensures projects access the latest available data whilst maintaining the expected DQ levels, integrity and reputation of the TRE.

Open Access

Verfügbarkeit an Ihrem Standort wird überprüft

Dieser Artikel ist auch in Ihrer Bibliothek verfügbar: |

elektronisch

gedruckt

Exportieren

Aufsatz(elektronisch)#310. September 2024

Evolving the HDRUK Phenotype Library: Phenotype Creation and Editing

In: International journal of population data science: (IJPDS), Band 9, Heft 5

Thayer, Daniel S; Scanlon, Jack; Elmessary, Muhammad A; Zinnorov, Artur; Scanlon, Ieuan; Coldea, Alex; Davies, Hannah; Oliveira, Carla

Thayer, Daniel S; Scanlon, Jack; Elmessary, Muhammad A; Zinnorov, Artur; Scanlon, Ieuan; Coldea, Alex; Davies, Hannah; Oliveira, Carla; Denaxas, Spiros; Jefferson, Emily; Hemingway, Harry

ISSN: 2399-4908

ObjectiveThe HDRUK Phenotype Library (phenotypes.healthdatagateway.org) shares definitions used to measure concepts of interest (such as diagnoses or treatments) within health datasets. It already holds more than 1000 phenotypes, with researchers able to contribute their work via an API. We aimed to create a more user-friendly method of contributing to the Library.
ApproachWe designed a phenotype creation workflow enabling users to create and submit new content via web interface. Goals included clarity and ease of use for a broad range of users.
ResultsA home page shows researchers' own content. Researchers can create new phenotypes using a web form, entering metadata such as name, authors, description, and publications. Code lists are defined via one or more rules, including search terms or referring to another phenotype, or by CSV upload. Users can make their phenotypes accessible to a research group or to all authenticated users, as well as publish content on the web. Publication requests are reviewed to ensure content is complete and appropriate. Editing, with full history and version control, is also supported.
ConclusionsWe implemented and released the new features. We are currently engaging researchers to get feedback and invite content submission.
ImplicationsThe benefit of tools to support research transparency and repeatability is only realized when they are adopted. We hope that a GUI to support phenotype creation will broaden the Library's user base help it serve as an enabler of higher-quality, more efficient research across the worldwide health data research community.

Open Access

Verfügbarkeit an Ihrem Standort wird überprüft

Dieser Artikel ist auch in Ihrer Bibliothek verfügbar: |

elektronisch

gedruckt

Exportieren