Assessing Data Quality
This article is a field report from a project to assess the quality of customer databases. The process described in it combines three automated...
Conventional testing of operational information systems involves testers sitting at a computer workstation and operating the user interfaces. They enter the data, start a transaction and check the results. Each transaction is a test case. The same applies to Internet systems that are transaction-oriented. For this type of system test, the tester simulates the future end user and does not need any higher qualifications than the end user himself.
With batch systems or background processes, the testing problem has always been different. You are much more dependent on test tools here. The tester needs a much higher qualification because the test is largely invisible. One or more input files are provided, then batch runs are started that process those files to update databases or generate reports. The tester’s task is to fill the input files and/or databases with test data, start the jobs and validate the results. This sounds simple at first, but is in fact a very difficult task, far more difficult than testing a user interface. Data warehouse systems belong to this category of IT systems. When testing a user interface, the relationship between the input data and the result data is clear. They belong to the same transaction. They are also manageable in number. An input screen will contain approximately 10 data values. The number of results in the output mask is usually even less. The tester therefore has to deal with a manageable amount of data per transaction.
This looks different with a batch test and even more so with data warehouse systems. A data warehouse is usually filled with input files/tables by several source systems using load jobs. Each input medium has a large number of individual attributes, often more than a hundred. This multiplies the number of data attributes if there are several input sources. Within the data warehouse system, many jobs are usually still working on the transformation and aggregation of these attributes. The connection between the input data on the one hand and the data stored in the data warehouse is by no means as clear as when testing user interfaces and, in contrast, is hardly intuitive to understand.
In this situation, the tester is dependent on formal means. He needs rules that describe the relationship between the inputs and outputs. Otherwise, he can only check the plausibility of the result data, i.e. whether it belongs to a valid range of values, regardless of the state of the input data. However, this is a very superficial test. If it says that every number > 0 and < 1000 is plausible, then every number between 0 and 1000 is a correct result, regardless of how it was generated. Such a test is better than none at all, but hardly sufficient to guarantee correctness.
Statistical testing or comparing the frequency of selected result types is also inadequate, as it can only reveal rough deviations between the target and actual values. It is better than the plausibility test, but still not good enough to prove the correctness of the results. At best, it can uncover some representative deviations.
The only really effective method for verifying all individual results is to check each individual result against its rule. For each result value, the prescribed formula for its creation and the arguments or input values must be used to calculate whether the value is correct or not. In principle, this check can be carried out manually. The tester only needs one window on the input data, a second window on the rule and a third window on the results. In this way, he needs just over half an hour for each result value. With larger data warehouse systems, it is quite possible that up to 5000 attributes are read from a source system. Even if he only wants to verify one representative result value for each result type, with 5000 result types he needs approx. 2500 hours or 312.5 days of intensive work just to verify the results. When considering other source systems and also the transformations within the data warehouse, the manual effort multiplies.
The other problem with this manual adjustment is that it is not only time-consuming but also very demanding. It requires the tester’s full concentration if they want to reproduce the rule for each result type. Such an intensive mental effort can hardly be expected from employees under today’s project conditions. The error rate of rule comprehension would be higher than the error rate of rule coding. Therefore, there is only one solution: to automate the results check. In one of our first data warehouse test projects in 2006, we were able to gain experience with automated results checking.
It all starts with the requirements specification. According to Hutcheson, there is no possibility of verification without a specification. In a data warehouse project, the specification is the sum of all data transformation rules. In this data warehouse project, the data transformations were created using an entity-relationship modeling tool and the rules were specified in Excel tables. In each table, there was a row for each new data attribute or result type. In addition to the table and attribute names, the attribute type, the attribute description and, of course, the rule for the attribute were documented in the rows. For the example described here, values and rules of the load jobs from a source system were used. The other source systems and the transformation and aggregation jobs in the data warehouse system were tested according to the same scheme.
Deduction from the original rule table:
INTEREST_RATE: "Link to TAB_X / ATTR_D interest conditions applicable to the account."; If ATTR_B in TAB_Y(debit) has a value other than 0, this account is linked to an interest group. The following then applies: TAB_Y / KEY (Position 3-4) (Interest Type) A(2), TAB_Y / KEY (Position 5-9) (Interest Subtype) A(5). TAB_Y / KEY (Position 10-12) A(3). The above Key fields are concatenated in the ID. If in TAB_B the ATTR_D values are zeroes, the account interest condition has to be extracted from the X_TAB: TAB_X / ATTR_B (Base Rate) TAB_X / ATTR_S (Spread Rate). If both attributes <> 0 value, extract TAB_X If TAB_X / ATTR_D is available (i.e. <> 0), extract the same as for TAB_X_ID. If only ATTR_D of TAB_Y <> value, extract ATTR_D.
The rules were written in quasi-formal English prose by the analysts. At first glance, it seemed as if an automatic translation of the informal rules into a formal, interpretable expression would be impossible. On closer inspection, however, certain basic patterns emerged in the rules. Over 4700 of the 5300 result attributes were specified with one of the following 7 rule patterns:
Each of these assignment rules could be conditional or unconditional. Conditional rules were linked to some condition, e.g. if, when, as long as, in case, etc. The condition operands could be both constant values and source attributes. A special condition was the success of a join operation. As a rule, these were simple ifs with else/otherwise clauses, but they were not always unambiguous.
Of course, these basic rules were always formulated slightly differently by different analysts, but they were similar enough to be able to recognize and classify them. In addition, the constant values and the source attributes were recognizable. The source attributes were always written in capital letters, the constant values were set under quotation marks. This provided the basis for the automatic conversion of these semi-formal rules into strictly formal, testable assertion statements that could be translated into interpretable test procedures by an assertion compiler. However, some of the rules had to be rewritten beforehand in order to be analyzable. This task was carried out by the testers. Approximately 3900 rules belonged to rule classes 1 and 2 and did not need to be revised. The remaining 1400 rules had to be rewritten, but due to the lack of time, only 750 rules were implemented according to the following patterns:
The rules for a list of alternative values (enumeration) could consist of both constants and variables. The operands were separated by “!” characters.
assign 'const01' ! const02' ! const03'
assign TAB_A.ATTR_1 ! TAB_A.ATTR_2 ! TAB_A.ATTR_3
The rules for a union (concatenation) of several source attributes from a source table could consist of both constants and variables. The operands were separated by “÷” characters.
assign TAB_A.ATTR_1 ÷ '_' ÷ TAB_A.ATTR_2
The rules for joining several source attributes from different source tables were presented as join operations.
assign join TAB_A.ATTR_1 with TAB_B.ATTR_2 with TAB_C.ATTR_3
The rules for assigning target attributes using arithmetic operations had to be slightly reformulated to make them unambiguous.
assign TAB_A.ATTR_1 * -1 + TAB_B.ATTR_2 / 2
As there was no provision for bracketing, the operands had to be placed in the correct order so that they could be resolved from left to right. This was sufficient for the majority of arithmetic expressions.
The rules for assigning function results have been formalized as a function call with the parameters written in curly brackets.
assign Func.x {TAB_A.ATTR_1, ’2’, TAB_B.ATTR_2}
Finally, all conditions were converted into a standardized form. This form provided for an unbracketed list of and/or comparisons for better comprehensibility.
assign <assignment> if TAB_B.ATTR_1 = '1' and TAB_C.ATTR_2 = '0'
assign <Zuweisung> if TAB_D.ATTR_3 < TAB_E.ATTR_1 or TAB_D.ATTR_3 > ‘0’
The output of a join operation could also be part of a condition, as the following example shows:
assign <Zuweisung> if join TAB_A.ATTR_1 with TAB_B.ATTR_3 = true
All if statements were linked to an assign assignment. The assign statement was noted first and then the determining condition.
assign TAB_D.ATTR_2 if join TAB_D.ATTR_1 with TAB_B.ATTR_1 = true
The leitmotif for this syntax was to remain as close as possible to the existing English prose rule so that the relevant analysts could easily understand it. It thus represents a compromise between the need to be understood by experts and the need to be interpreted by an automaton. As such, the form of expression comes close to a specialized language for data warehouse projects.
Deduction from the adjusted rule table
INTEREST_RATE: "Link to TAB_X.ATTR_D interest conditions.
Debit interest conditions applicable to the account.";
" ? assign TAB_Y.ATTR_D | TAB_Y.ATTR_C | TAB_X.ATTR_C | 'D' if TAB_Y.ATTR_D (debit) <> '0',
assign TAB_X.ATTR_C if TAB_X.ATTR_D <> '0',
assign TAB_X./ATTR_C | TAB_X.ATTR_D if TAB_X.ATTR_B <> '0',
assign '*na*' if TAB_Y.ATTR_D = '00' and TAB_X.ATTR_S = '0' and TAB_X.ATTR_B = '00'
assign TAB_X.ATTR_N|ATTR_Y|ATTR_O|ATTR_A if other.(comment).";
During the formal revision of the rules, documentation errors were discovered which were reported to the analysts by the testers. These were mostly incomplete rule descriptions (if without else, missing keys in join statements, etc.) that left room for interpretation, which prevented an exact check of the rule. This affected around 4% of all rules.
Of the approximately 5300 rules in the specification, over 3900 could be implemented without adaptation. This meant that 1400 rules had to be adapted, of which around 600 turned out to be unimplementable. In the end, almost 800 rules were rewritten in the formal syntax. This took an average of 20 minutes per rule. This means that around 266 hours were required to formalize the rules. Alternatively, it would have been possible a) not to test these rules or b) to write the assertion scripts manually. However, writing the assertion scripts manually would have cost just as much, if not more, effort. In addition, two semantic levels with two different descriptions of one and the same rule would have had to be updated. A similar problem arises when creating code. If the code is generated from the specification, only the specification needs to be maintained. However, if the code is written manually, which is usually the case, both the specification and the code must be updated. As this is too time-consuming in the long term, after a short time the specification is neglected and only the code is developed further. This takes the analysts out of the game and the product remains in the hands of the programmers, with all the negative consequences associated with historically grown systems, above all the fact that nobody understands them, not even the programmers themselves.
To avoid this situation, it is advisable to maintain only one description of an IT system, namely the one whose language comes closest to human language - the domain-specific language. At least the test scripts should be generated automatically. This was also done in this project. The assertions were generated from the specified rules using a tool.
Generated test script:
file: ACCOUNT;
// This comparison procedure assumes that the old file contains the following attributes:
// from TAB_X Table: A_ID, A_TYPE, ATTR_S, ATTR_R, ATTR_C, ATTR_S, ATTR_D, ATTR_F, ATTR_N
// from TAB_Y Table: ATTR_C, ATTR_D, ATTR_E, ATTR_F, ATTR_P
// from TAB_Z Table: ATTR_R
if ( new.A_ID = old.ATTR_ID );
assert new.A_ID = old.A_TYPE if (old.A_TYPE = "G");
assert new.A_ID = "S" if (old.ATTR_A = "R" & old.ATTR_S = "S");
assert new.A_ID = "C" if (old.ATTR_A = "R" & old.ATTR_S = "C");
assert new.A_TYPE = "L" if (old.ATTR_A = "R" & old.ATTR_S = "L");
assert new.A_RATE = old.ATTR_C if (old.ATTR_B = "0");
assert new.A_INTEREST = old.ATTR_D;
assert new.A_LIQU = old.ATTR_E;
assert new.A_ACCT = old.ATTR_P;
assert new.A_PC = old.ATTR_PC;
assert new.START_DATE = "2005.05.15" if (old.ATTR_H <> "0");
assert new.REVIEW_DATE = old.ATTR_V if (old.ATTR_F <> "0" & old.ATTR_N <> "0") ;
assert new.REVIEW_DATE = "NULL" if (old.ATTR_F = "0") ;
assert new.PRIMARY_INDICATOR = "P" ! "S" ! "*n.a.*";
assert new.INDICATOR = "inactiv" if (old.ATTR_R = X"3") ;
assert new.INDICATOR = "inactiv" if (old.ATTR_R = X"1") ;
assert new.INDICATOR = "closed" if (old.ATTR_R = "C") ;
assert new.INDICATOR = "active" if (other);
end;
The simple thing about test assertions compared to code is that they can be arranged in any order for a specific file or table. Each attribute of a table is tested on its own, independently of the other target attributes. It is only important that all assertions of a table are together in one test procedure. The test procedure has the name of the target table and an if-key condition to connect the target object with the source object. This if-key condition has the following format:
if (TAB_X.KEY_1 = TAB_A.KEY_1 & TAB_X.KEY_2 = TAB_B.KEY_2 & TAB_X.KEY_3 = TAB_C.KEY_3 &
The key comparison seeks to match the target object with the correct source objects. The data of the source objects is first retrieved from the source tables using select and join statements and combined into comparison objects corresponding to the target objects. This means that for each target object there is a comparison object with all the source data from which the target object is formed. This creation of the comparison objects from the source tables is a prerequisite for the verification of the result values. All comparison objects of the same object type or all comparison objects for a target object type are summarized as rows in a table.
For this reason, the select statement for creating the comparison objects is also generated as part of the assertion generation. These select statements are derived from the join and select statements in the rule specification.
The assertion statements themselves are conditional or unconditional. Conditional assertions are linked to an if condition.
The assertion statement has the following basic syntax:
assert new.TAB.ATTRIBUT <oper> <comparison expression>;
Rules of rule class 7 (functions) could not be converted into testable assertions by the assertion generator, as the connection to the necessary function libraries was missing. These assignments were marked as comments in the assertion procedures and had to be tested manually by the testers.
The assertion generator parsed the Excel tables in which the rules are specified and converted the rules into assertions one after the other. One or more assertions were generated from each rule; this was done for the 266 target tables in less than a minute. At the end there were 266 assertion procedures and a log of all missing and untranslatable rules, statistics on the rules and SQL select statements for selecting the source data for a target object. The biggest advantage of automatic assertion generation, apart from the syntactic correctness of the assertions, is the repeatability of the process. If the rules are changed, it is only necessary to restart the assertion generation and thus regenerate the affected assertion procedures.
Example 4: Statistics from the test script generation
+---------------------------------------------------+
| Assertion Generation completed !
| Number of E/R Rule Lines processed = 11384
| Number of new Tables processed = 266
| Number of new Attributes processed = 10910
| Number of old Tables processed = 132
| Number of old Attributes processed = 00891
| Number of new Attributes without a rule = 05573
| Number of Transformation Rules specified = 05337
| Number of Transformation Rules processed = 04742
| Number of Basel-II Rules recognized = 00890
| Number of Basel-II Rules processed = 00852
| Number of Complex Rules recognized = 00853
| Number of Complex Rules processed = 00562
| Number of Complex Basel Rules recognized = 00157
+---------------------------------------------------+
| Number of Assert Scripts generated = 00266
| Number of Assertions generated = 04897
| Number of Table Selectons generated = 00181
| Number of Assert Keys generated = 00023
| Number of Assert Conditions generated = 00308
| Number of Assert Concatenates generated = 00103
| Number of Assert Alternates generated = 00365
| Number of Assert Arithmetic generated = 00009
| Number of Test Cases generated = 05337
+---------------------------------------------------+
One way to determine the test coverage is to determine the ratio of the test cases performed to the planned test cases. In order to be able to measure the test coverage in this project and also to make the test progress quantifiable, a separate test case had to be created for each rule to be tested. As the manual creation of test cases would have involved a great deal of effort, these were also generated by the assertion generator in the form of CSV files. The CSV files could then be completely imported into the test case management tool later on.
The following information was generated for each test case:
In this data warehouse project, “Quality Center” from Mercury was used as a test case management tool, which was well suited for the generated test cases due to its flexible import interface.
A defined and repeatable test process was established for the multiple repetition of the data warehouse test using the test tools described here. This process consisted of 8 steps:
The result of the 8 steps was a log of the missing data objects and the deviating data attributes, i.e. the data whose values did not match the result of the assertions. These were identified as incorrect. ## Verification of the test results against the assertions
Funktionales Testen setzt ein Orakel voraus. Das Orakel dient als Quelle der Wahrheit, es teilt uns mit, ob die Testergebnisse richtig oder falsch sind. James Bach schreibt: “An oracle is an evaluation tool that will tell you whether the system has passed or failed a test. In high volume automated testing, the oracle is probably another program that generates or checks the test results. The oracle is generally more trusted than the software under test, so a concern flagged by the oracle is worth spending time and effort to check…”
In the data warehouse project, the in-house developed tool “DataTest” was used as an oracle, which interpreted the assertion procedures and verified the test results against the rule. For each target table, the corresponding source data was collected and stored in an intermediate database, where it was sorted according to its key. The target table was then read and verified for each data record using the source data matching this data record. Each attribute was checked against its assertions. Result data that deviated from the specified result according to the assertion and source data were listed in a deviation log. The tests carried out have shown that it is perfectly possible to automatically validate a large number of test results in just a few minutes.
Furthermore, all target datasets for which there was no matching source dataset and source datasets for which there was no matching target dataset were logged. This gave the testers a list of all unpaired keys and all incorrectly filled target attributes in the data warehouse. Of course, these had to be checked manually by the testers to determine whether there was a genuine error. The result was documented for each data record in the associated test case in the “Quality Center” test case management tool and an error was reported if the test case was not successful. This allowed the test progress to be determined for controlling and reporting purposes.
In this way, a number of incorrect assignments from the import files to the data warehouse tables could be uncovered. Unfortunately, it was not possible to verify all attributes, as not all rules could be converted into assertions. Of the 5300 rules, only 4700 could be transformed into assertions. The rest had to be checked manually, which of course took a lot of time.
Example 5: Excerpt from the data validation report
+----------------------------------------------------------------------------------------+
| File/Table Comparison Report | Key Fields of Record (new,old) | Key Fields of Record (new,old)
| Key Fields of Record(new,old) |
+----------------------------------------------------------------------------------------+
New: ACCOUNT || Old: Attributes || Key Fields of Record(new,old) | ++
| Old: Attributes |
+----------------------------------------------+-----------------------------------------+ | Key Fields of Record(new,old) | +----------------------------------------------+-----------------------------------------+
| RecKey:100000USD114601001 | duplicate key in old File/Table |
+----------------------------------------------+-----------------------------------------+ | RecKey:100000USD114601001
| RecKey:100000XXX104501001 | | New: ATTR_ID
| New: ATTR_ID | G |
RecKey:100000XXX104501001 | | New: ATTR_ID | G | Old: Constant_Value | L |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000YYYY104501001 | | New: ATTR_C | XXX00 | | RecKey:100000YYYY104501001
RecKey:100000YYYY104501001 | RecKey:100000YYY104501001 | New: ATTR_C | XXX00 |
| Old: ATTR_C | 0 |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000ZZZ104501001 | | New: ATTR_C | XXX00
| New: ATTR_D | 0 |
| Old: ATTR_P | 1 | | RecKey:100000ZZZ104501001
+----------------------------------------------+-----------------------------------------+
| RecKey:100000ZZZ104501001 | | New: ATTR_D | 0 | Old: ATTR_P | 1 | +++
| New: REVIEW_DATE | 9999-08-01_00:00:00 |
| Old: Constant_Value | NULL |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000YYYY104501001 | | Old: Constant_Value | NULL | +++
| New: ATTR_P | GL |
| Old: Constant_Value | RE |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000XXX104601001 | | New: START_DATE
| New: START_DATE | 9999-08-01_00:00:00 |
| Old: Constant_Value | NULL |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000XXX104701001 | | Old: Constant_Value | NULL | +++
| New: ATTR_X | G |
| Old: Constant_Value | C |
+----------------------------------------------+-----------------------------------------+
| RecKey:100000ATS196501100 | missing from the new File/Table |
+----------------------------------------------+-----------------------------------------+
| Total Number of old Records checked: 91 | Number of old Records found in new File: 08 | Total Number of old Records found in new File: 11
| Number of old Records found in new File: 08 | | Number of old Records with new File: 08
| Number of old Records with duplicate Keys: 72 | | Number of old Records not found in new File: 08
| Number of old Records not in new Table: 11 |
| Total Number of new Records checked: 59 | | Number of new Records found in new File: 08 | | Number of old Records with duplicate Keys: 72 | | Number of old Records not in new Table: 11 |
| Number of new Records found in old File: 08 | | Number of new Records with duplicate Keys: 72 | | Number of old Records not in new Table: 11 | | Total Number of new Records checked: 59 |
| Number of new Records with alternate Keys: 00 | | Total Number of new Records checked: 59 | | Number of new Records found in old File: 08 |
Number of new Records found in old File: 08 | | Number of new Records with alternate Keys: 00 | Number of new Records not in old File: 51 |
| Total Number of Fields checked: 93 |
| Total Number of non-Matching Fields: 46 | | Percentage of matching Fields
| Percentage of matching Fields: 51 % |
| Percentage of matching Records: 14 % |
+----------------------------------------------------------------------------------------+
In addition to the verification of the transformation rules, further test scenarios were automated in the data warehouse project. These were all developed in the Python scripting language, resulting in a comprehensive framework for data-based test automation and static code analysis of the data warehouse environment over the course of the data warehouse project. Among other things, the following test approaches were implemented in this framework:
Thanks to automation, it was also possible here to make the test repeatable on the one hand and to make it possible via the entire data warehouse on the other.
The following conclusions were drawn from the data warehouse test project:
This test approach described here was the basis for further data warehouse tests and was continuously optimized in the process.
This article is a field report from a project to assess the quality of customer databases. The process described in it combines three automated...
The test process Motivation Many authorities and companies that work with spatial information are now building geoportals as an essential part of...
A software migration requires a different testing approach than a development project. This is because there is usually neither a requirements...