Article | REF: H3701 V1

Detection and correction of data quality problems with machine learning

Author: Laure BERTI-ÉQUILLE

Publication date: May 10, 2023

You do not have access to this resource.
Click here to request your free trial access!

Already subscribed? Log in!

Overview

Français

ABSTRACT

This synthesis presents the recent evolution of techniques for the evaluation and improvement of data quality in databases based on machine learning methods. It describes recent solutions proposed mainly by the academia as well as approaches implemented to detect and correct main data quality problems such as outlying, inconsistent or missing data, and duplicates.

Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.

Read the article

AUTHOR

Laure BERTI-ÉQUILLE: Research Director - Development Research Institute - ESPACE-DEV - Montpellier, France

INTRODUCTION

Significant progress has been made in recent years in the design of tools to automate the evaluation, monitoring and improvement of data quality, thanks in particular to technological advances in Artificial Intelligence, and in particular machine learning (ML – Machine Learning). Machine learning techniques have been made operational on a large scale, and are now widely deployed in all sectors of activity, to automate prediction and classification tasks in decision support for numerous fields of application (health, finance, marketing, etc.). However, the reliability of these methods' results remains highly dependent on the quality of the input data for the learning models. Data is often imperfect, and optimal data quality is rarely achieved. Thus, two complementary approaches are commonly proposed: one from the data management research community, aimed at correcting data upstream of analysis chains (by cleaning or repairing data), and the other from the community of learning researchers and practitioners (data scientists), aimed at developing models that are more robust to noise and more efficient, with greater emphasis on transforming and preparing data for a particular predictive task.

For decades, for the data management community, data cleansing has consisted in correcting and transforming data using declarative ETL (Extraction-Transformation-Loading) approaches , detecting inconsistencies in relational databases in the form of constraint violations, to "repair" them and to propose solutions, often theoretical, enabling reasoning from inconsistent data, querying it, verifying and satisfying integrity constraints , discovering functional dependencies or business rules...

You do not have access to this resource.

Exclusive to subscribers. 97% yet to be discovered!

You do not have access to this resource.
Click here to request your free trial access!

Already subscribed? Log in!

The Ultimate Scientific and Technical Reference

A Comprehensive Knowledge Base, with over 1,200 authors and 100 scientific advisors

+ More than 10,000 articles and 1,000 how-to sheets, over 800 new or updated articles every year

From design to prototyping, right through to industrialization, the reference for securing the development of your industrial projects

KEYWORDS

This article is included in

Software technologies and System architectures

This offer includes:

Knowledge Base

Updated and enriched with articles validated by our scientific committees

Services

A set of exclusive tools to complement the resources

Practical Path

Operational and didactic, to guarantee the acquisition of transversal skills

Doc & Quiz

Interactive articles with quizzes, for constructive reading

Subscribe now!

Ongoing reading
Detecting and correcting data quality problems using machine learning

The impact of data quality in machine learning

Bibliography

(1) - BARBER (R.F.), CANDES (E.J.), RAMDAS (A.), TIBSHIRANI (R.) - Predictive inference with the Jackknife+. - Ann. Statist., 49(1):486-507, February 2021.
(2) - BARNETT (V.), LEWIS (T.) - Outliers in statistical data. – - John Wiley and...

Events

International conferences :

Very Large Databases (VLDB) Conference: http://vldb.org/conference.html
ACM SIGMOD (Special Interest Group on Management of Data): https://dl.acm.org/event.cfm?id=RE227

Standards and norms

Data quality — Part 1: Overview https://www.iso.org/standard/50798.html - ISO/TS 8000-1 - 2011
Data quality — Part 2: Vocabulary https://www.iso.org/standard/73456.html - ISO 8000-2 - 2017
Data quality — Part 8: Information and data quality: Concepts and measuring https://www.iso.org/standard/60805.html - ISO 8000-8 - 2015
Data quality — Part 61: Data quality management: Process reference model...

You do not have access to this resource.

Exclusive to subscribers. 97% yet to be discovered!

You do not have access to this resource.
Click here to request your free trial access!

Already subscribed? Log in!

The Ultimate Scientific and Technical Reference

A Comprehensive Knowledge Base, with over 1,200 authors and 100 scientific advisors

+ More than 10,000 articles and 1,000 how-to sheets, over 800 new or updated articles every year

From design to prototyping, right through to industrialization, the reference for securing the development of your industrial projects