A Rule Based Taxonomy of Dirty Data

Lin Li ., Taoxin Peng ., Jessie Kennedy .

Abstract


There is a growing awareness that high quality of data
is a key to today’s business success and that dirty data existing
within data sources is one of the causes of poor data quality. To
ensure high quality data, enterprises need to have a process,
methodologies and resources to monitor, analyze and maintain
the quality of data. Nevertheless, research shows that many
enterprises do not pay adequate attention to the existence of dirty
data and have not applied useful methodologies to ensure high
quality data for their applications. One of the reasons is a lack of
appreciation of the types and extent of dirty data. In practice,
detecting and cleaning all the dirty data that exists in all data
sources is quite expensive and unrealistic. The cost of cleaning
dirty data needs to be considered for most of enterprises. This
problem has not attracted enough attention from researchers. In
this paper, a rule-based taxonomy of dirty data is developed. The
proposed taxonomy not only provides a mechanism to deal with
this problem but also includes more dirty data types than any of
existing such taxonomies.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.