Blog

Assessing Data Quality - Richard Seidl

Written by Richard Seidl | Feb 29, 2012 11:00:00 PM

This article is a field report from a project to assess the quality of customer databases. The process described in it combines three automated audits: an audit of the database schema, an audit of the database structure and an audit of the database content. The database schema audit checks for design flaws and rule violations. The database structure audit measures the size, complexity and quality of the database model. The database content audit processes the data itself to detect invalid data values, missing records and redundant records. The purpose of these audits is to assess the quality of the database and determine whether a data reengineering or data cleansing project is required.

Declining data quality

The existing databases of most IT users are the product of a long development. It began with their conception by a database designer on the basis of the information available at the time. This was followed by the development or automatic generation of a database schema that reflected the design. The database was then installed and applications began to use it. As new applications were added, the structure was modified or extended to accommodate them. These structural changes were often ad hoc, so the complexity of the structure grew and the quality declined, similar to software systems following the laws of software evolution. In addition to this structural degradation, the contents of the database were corrupted by faulty programs that stored incorrect values and deleted important records. With large databases, such quality erosion is difficult to detect, but over time it spreads like a cancer and causes more and more system failures. As a result, not only the program quality but also the data quality suffers from erosion.

Failure to anticipate data development

At the beginning of a database’s life cycle, the designer must make assumptions about the amount of data and the way in which the data will be used. He must also anticipate how the data will develop. The design of the database is based on such assumptions. The same assumptions are transferred to a database schema in which the tables are defined and linked to each other by foreign keys and indexes. If the assumptions prove to be correct, the database can evolve without any problems. If not, the database structure will soon become obsolete and the form will no longer match the content. It will also become increasingly difficult to adapt. If the data volume grows much more than expected and the data usage turns out to be completely different from what was originally planned, the database will need to be reengineered.

Lack of data independence

Often enough, database designers cannot predict the actual growth of a database, nor can they foresee how it will actually be used. When they design the database for the first time, they probably only design it from the perspective of a single application and rarely from a global view of all possible applications. The data is therefore not independent of the applications that use it from the outset. Later, the structure has to be adapted to accommodate more and different applications, but the further the design is stretched, the more fragile it becomes. At some point, it is no longer suitable for any of the applications.

Violation of the normal forms

Any temporary patch to the database structure has negative consequences. Instead of creating new sub-tables when additional attributes are needed, the local database administrator simply adds them to existing tables. This makes the database rows longer and longer and more difficult to manage. When new keys are needed, they are added as indexes. As a result, there are more and more indexes. Data groups are added to existing rows, violating the second normal form, and repetitions of the same data elements are made to avoid creating new sub-tables, violating the principle of the first normal form. Additional foreign keys are added, increasing the number of dependencies between tables, and subkeys are added to the primary key that often do not refer to the rows as a whole. This is tantamount to a violation of the third normal form principle. In the end, the database no longer conforms to the normal forms for relational databases, which leads to a loss of quality and an increase in complexity.

Incorrect use of stored procedures

Another feature that contributes to the loss of portability and data independence is the use of stored procedures. If stored procedures were only used for access operations and to ensure the integrity of the data, this would be acceptable. However, they are often misused to process logic. Instead of placing business rules in separate rule modules or placing check routines in the client components, developers hide them in the stored procedures, where they are no longer visible to the tools that process the programs. Stored procedures full of selections and loops, sets and declarations are a clear sign of misuse. They bind the database to a specific application.

Inconsistent and incorrect data updates

The content of a database suffers not only from errors when executing the programs, but also from inconsistent changes. For example, it can happen that the type of a data attribute contained in two different tables is changed in one table but not in the other, or that the value of an attribute is no longer compatible with the type of this attribute in the program being used. The Frankfurt Stock Exchange once had to be closed for half a day because an invalid value in the database led to a data exception error in the processing job. Unfortunately, there is much more erroneous data in existing databases than users realize, and this erroneous data can lead to a system crash at any time.

Another problem with the content of databases is missing and redundant data records. Missing data records are those that are still referenced but have been deleted either intentionally or unintentionally. They may have been deleted by one application but are still required by another. Redundant data records are those that should actually be deleted but were not deleted for some reason. They remain as corpses in the database and occupy valuable storage space even though they are no longer used.

The quality deficiencies identified here are only a subset of the many quality problems with application databases. Low database quality has become a major concern for IT users and is the motivation for the three testing methods presented in this article:

  • Analyzing the database schema
  • Measuring the database structure
  • Validating the database content

Static analysis of database schemas

The database model must be statically analyzed in order to identify quality defects and measure the size, complexity and quality of the database structure. This requires an automated testing process.

The first step in auditing a database is to statically analyze the database schema with the aim of detecting rule violations. Just as with program code, rules must also apply to database schema code in order to enforce good coding practice. On the one hand, the rules are designed to fulfill certain data quality goals such as security, compatibility, performance and portability. On the other hand, they should make it easier to change and expand the database.

In order to weight rule violations differently, it is necessary to classify them. The database checker distinguishes between large, medium and small rule violations. Major rule violations, such as the violation of the first normal form, are a major obstacle to the further development and reuse of data. They can also lead to errors. Medium rule violations, such as exceeding a predefined size limit, reduce the modularity and maintainability of the database. Minor rule violations such as the violation of naming conventions make it more difficult to maintain the data and its consistency with the code. Serious defects are weighted with 2, medium defects with 1 and minor defects with 0.5.

Check the manufacturer-specific features

There are many database functions that are vendor-specific, i.e. they are not part of standard SQL. They should be used with caution. Some options should be forbidden because they affect performance or hinder the transfer of data. Others, such as the NULL option, should be mandatory for the same reasons. It is up to the responsible database analyst to determine which functions should be prohibited and which should be mandatory.

Checking the normal forms

Avoiding vendor-specific features may reduce performance, but it makes the database more vendor-independent, i.e. the data becomes more portable. Having a primary key in every record may not be required by the SQL language, but it may be a rule that needs to be enforced. A foreign key is also not required, but if the table is dependent on another table, at least one should be defined. Under no circumstances should a table have repeated data of the same type or subsets of data. Such obvious violations of normal forms can be detected using static analysis and identified as bad practice.

Set boundaries

There may also be limits on how many attributes are permitted in a data set and how long a line may be. It is necessary to check that these limits are not exceeded. It may also be desirable to prohibit certain data types, such as packed decimal, binary, floating point or timestamp data, for reasons of compatibility with databases from another manufacturer. Incompatible data types are one of the main causes of errors during data migration.

To summarize, there should be a set of rules for database design and these rules should be enforced. Typical rule violations are:

  • Unique identifier is missing in the table
  • Foreign key missing in dependent table
  • Table contains a repeated data attribute
  • Table contains a subgroup
  • Table has no external view defined
  • Table has no indexes
  • Primary key contains too many subkeys
  • The NULL option is missing
  • The DELETE option is missing
  • Table has an incompatible data type
  • Number of attributes exceeds maximum limit
  • Length of the row exceeds the maximum permitted length
  • Schema is insufficiently commented

It is the task of the schema auditing tool to check such rules and report any violations. It is then the task of the database administrator to correct the database schema accordingly.

Measure the database structure

To evaluate the architecture of a database, it must first be measured in order to base the evaluation on solid figures. Without numbers, it is not possible to compare, and without being able to compare, it is not possible to evaluate. The measurements of an existing database can be compared with those of a benchmark database or simply with those of another database. Data measurements can be simple counts, or they can be complex metrics of size, complexity and quality.

Database size metrics

As with the rule check, a tool is required to perform the measurement, count the database attributes and calculate the metrics. There are several database attributes that need to be counted. In an SQL database, these include

  • Lines of SQL code
  • Number of tables
  • Number of attributes
  • Number of buttons
  • Number of views
  • Number of relationships
  • Number of stored procedures
  • Number of stored access procedures
  • Number of stored procedure statements
  • Number of integrity rules
  • Number of rules for stored procedures
  • Number of test cases required to cover the database

Two size metrics are calculated from these dimensions:

  • Function points
  • Data points

The function points of a database are either 15, 10 or 5, depending on the number of attributes and keys. Databases with more than 50 attributes or 2 keys receive 15 function points, databases with 20 to 50 attributes or more than one key receive 10 function points, databases with less than 20 attributes receive 5 function points. The data points are 1 for each attribute, 2 for each key and index, and 4 for each view. These metrics are used to predict the cost of evolution projects that involve changes to the data.

Metrics for database complexity

In addition to the size metrics, complexity and quality metrics should also be calculated. All metrics are ratio metrics derived from a relation of one set of attributes to another. They are calculated using algorithms that compare actual values with benchmark values based on empirical studies of similar databases. For each complexity metric, there is a lower and an upper bound. The average complexity is the geometric mean of the two limits. The six calculated complexity metrics are:

  • Content complexity as the ratio of data types and data keys to the number of data attributes
  • View complexity as the ratio of views to the number of tables
  • Access complexity as the ratio of access paths to the number of access objects views and tables
  • Relational complexity as the ratio of tables to table relationships to the number of tables
  • Structural complexity as the ratio of structural elements (tables, views and indices) to the number of base elements (attributes and keys)
  • Memory complexity as the ratio of stored attributes to the size of the memory area in bytes

The weighted average of these six complexity metrics gives the overall complexity of the database on a ratio scale of 0 to 1. The complexity ratio should be as low as possible.

Metrics for database quality

The desired properties of a database have been propagated in the literature on this topic, in particular by J-L. Hainaut, Peter Aiken and Michael Blaha. These and other authors emphasize the importance of data independence, data flexibility, accessibility and data portability. One of the biggest obstacles to software migration is the intertwining of business logic with access logic. One of the biggest obstacles to data evolution is the inflexibility of the database structure. In the 1980s, both Dominique Warnier and Michael Jackson propagated data-driven software design, a method in which the program structure is derived from the data structure. The result was programs that reflected the current hierarchical and networked data models. When the time came to switch to relational databases, the program structures were no longer suitable. They had to be redeveloped at great expense. What was once a good idea to find a quick solution turned out to be the biggest obstacle to modernizing the systems. Even today, the structure of many relational databases is still based on the requirements of the original application. It was not planned how the data could be used in the future.

The solution to the problem of data independence is that each application should have its own views of the data. Views were defined to make this possible. Of course, this increases the complexity of the database, but compromises always have to be made. Here is the trade-off between database independence and data complexity. The same is true for data accessibility, which requires more keys and more indexes to access the data in many ways. Data flexibility implies that the database can be easily changed and expanded. To achieve this, there should be many smaller tables rather than a few large ones. However, as with object-oriented programming, this leads to more relationships between the tables and therefore to greater complexity. Storage efficiency means storing more data in less space, but this also increases complexity. This shows that there is hardly any way to increase the quality of a database function without causing higher complexity or sacrificing the quality of another function. This is a dilemma that database designers have to solve. Achieving high quality with low complexity is an optimization problem that depends on the context in which the data is used. When data is used in different contexts, the best possible compromise must be found.

For each quality metric, there is also a lower and an upper limit, which are derived from the comparison of different databases. The average quality is the geometric mean of the highest and lowest quality values. The six calculated qualities are:

  • Data independence as the ratio of data views and indexes to the number of data tables
  • Data accessibility as the ratio of access paths to the number of stored attributes
  • Data flexibility as the ratio of tables, keys and indices to the number of data attributes
  • Data storage efficiency as the inverse ratio of stored attributes to the size of the memory area in bytes
  • Data conformity as the ratio of rule violations to the number of schema lines

The weighted average of these quality metrics results in the quality coefficient for the entire database. It should be as high as possible. According to the quality measurement scale of the ISO 9126 standard, values above 0.8 are excellent, from 0.6 to 0.8 good, from 0.4 to 0.6 satisfactory and below 0.4 unsatisfactory.

Validate the database content

Validating the contents of a given database requires an oracle. There should be someone or something that determines what should be in the columns of a table in general and in which row in particular. This oracle could be a person, a program, an interpretable specification, or another database. A person could browse the database and visually inspect the values of each row or selected columns or selected rows. There are browsers that support this, but it is a very tedious and time-consuming task, even with a good browser. With very large databases, it is practically impossible. Therefore, an automated data validation process is required.

Validation of the data content with the help of a customized program

An alternative is to write a program that reads the database and fetches the next data record or selects data records according to certain search criteria. The content of the data record is then compared with the rules coded in the program or with the content of another table that is used as a reference. This method is useful, but requires time and effort to prepare. The program must be written by a programmer who knows the database and how to access it. In addition, this solution can also be incorrect and must be debugged.

Validate data content with a test script

Another alternative is an interpretable specification - a script - that can be easily compiled by someone who is familiar with the data. In the script, rules are formulated for validating the rules of selected attributes in selected data sets. This assumes, of course, that the auditor writing the script knows what should be in the database. To support this, the auditor can use another database as a basis for comparison, but the auditor must establish the links between two databases. In order to write an interpretable specification, the auditor needs a language with which he can do this:

  • specify which data records with which keys should be in a table. The keys can be taken from the foreign keys of associated tables,
  • specify which data records with which keys should not be in a table. This can be achieved by comparing tables with each other,
  • comparing the content of a selected column in one table with the content of a corresponding column in another table,
  • comparing columns in the same table with each other,
  • comparing the values of a column with a given constant value,
  • compare the values of a column with a range of values, i.e. limit analysis,
  • compare the values of a column with a set of expected values, i.e. an equivalence class,
  • compare the values of a column with the result of an arithmetic or logical expression,
  • compare the values of a column with the concatenated values of the other columns and constants.

All asserted comparisons listed above can be the subject of a logical expression, i.e. they are only executed if this condition is met. The condition makes the values of a column dependent on the values of another column or on the primary key, e.g.

This_Column E RANGE (1:50) if (Mainkey > 1000);

There are many ways in which these statements can be formulated. The validator can define discrete values as well as value ranges and equivalence classes, relationships between data values and results of calculations. The assertion language is based on the Schematron schema validation rules as defined in the ISO/IEC standard 19757.

Execute the data validation

The data validator formulates its assertions about the data in an assertion script for each database table and passes it to the compiler. The compiler checks the syntax and confirms the data names against the database schema. If errors are found, the checker must correct the script. If the script is correct, an intermediate code is generated in which the qualified data names and constant values are referenced. A validation run is then started for each target table. The validation job processes the target table sequentially and checks one row after the other against the specified rules. If a row is missing, redundant or contains invalid values, it is listed in the validation report by key, together with the expected values. With this report, the validator can easily identify incorrect content and create corresponding error reports. The data correctness rate is the inverse ratio of the number of incorrect attributes to the total number of attributes in the database.

Summary

The two approaches for assessing data quality have already been used in several measurement and testing projects with positive results. So far, however, the two approaches have never been used together. The data evaluation process proposes to combine static schema analysis with validation of the data content. First, the rules are defined together with the users and the metrics are weighted according to the user goals. Secondly, the database schemas are analyzed statically and the structures are measured. Thirdly, the validation scripts are compiled together with the customer, based on what the customer thinks the values of the columns should be. Finally, the database contents are examined and the exceptions documented. A report on the status of the data quality can then be created.

The result of a data quality assessment depends to a large extent on the setting of the benchmark measurement values and the weighting of the metric values. These can be adjusted before the measurement jobs are executed. The user may place more emphasis on the accessibility of the data and less on the independence of the data, or he may consider the flexibility of the data to be more important than the testability. The user may also be more or less tolerant of improved data content. For one user, a correctness rate of 95% may be sufficient, for another nothing less than 100% is acceptable.

These processes are based on practical experience from many projects in recent years. Due to the lack of tools for evaluating the various aspects, a number of self-developed tools are growing with these projects. For example, the tools DLIAudit, ADAAudit and SQLAudit check the rules for IMS, ADABAS, DB-2, Oracle and MS-SQL databases. The result is a defect report on schema rule violations. SQLAudit also counts about 35 database attributes and calculates 6 complexity and 6 quality metrics. All 12 metrics are ratio metrics derived from a relation of one set of attributes to another. The database metrics are collected in the form of a metrics report. In addition, they are exported as an XML file so that they can be used by anyone who wants to save them in a metrics database or process them further, for example. These metrics can form the basis for a benchmark, without which nothing can really be evaluated. The Datatest tool provides a language based on predicate logic, expressed in assertions, which the auditor can use to specify the data content. This has often been used to validate the content.

The bottom line is that quality is relative. It is relative to the quality goals that have been set. This is why the Goal-Question-Metric model established by Basili and Rombach a long time ago is still valid. The user must first set goals, then he must ask how their fulfillment is to be assessed and thirdly he can define metrics to measure the degree of fulfillment. This method can be applied equally well to the assessment of data quality. The data analyst must first set goals, goals such as portability, extensibility and flexibility. Metrics must then be defined to measure the extent to which they are met by the current database. Finally, they need to be measured and evaluated. Without clearly defined, measurable goals, it is not possible to evaluate anything.