Blog

(Test)data radical cure - Richard Seidl

Written by Richard Seidl | Jun 13, 2024 10:00:00 PM

“There’s a budget again. It just needs to say AI on it.” - Richard Seidl

Test data management is currently experiencing a renaissance. Driven, of course, by AI and the possibilities it offers: better, more accurate test data, simple generation, effortless management across system boundaries. Ideas are bubbling over as to what will be possible … or would be … or well, maybe … hm.

There are a few classic challenges in software development. But while we have halfway solved those around releases (pipelines) and test environments (cloud), for example, test data is another matter. Especially in application landscapes with many different systems and data repositories, test data initiatives can quickly get out of hand. It’s not easy either, because the challenges are manifold:

  • Data, data, data - there is simply a lot of it. Countless tables and fields and thousands, sometimes even millions, of data records. All of which are cross-linked and contain import/export data in directories.
  • Compatibility - Each system has its own schema, its own data organization, which does not necessarily fit together. A clean data architecture across system boundaries is not easy. Different responsibilities then add further complexity.
  • Synthetic vs. real data - Synthetic test data helps me immensely with my structured test cases (limit values, ECs, etc.) - but it is also not a reality.
  • And as soon as you get to real data with all its peculiarities, errors, etc., data protection is just around the corner: anonymization and pseudonymization require a lot of energy and time.
  • Do we want to go one better? Then our test data must also map historicization and time travel. Yeah - jackpot.

The AI will fix it, right?

But no problem at all. Just throw all the rules, requirements etc. into an AI and then we generate our test data across all systems and almost in real time - a dream. But I’m pretty sure it won’t work that easily. There are already some very nice approaches to generating and managing test data. However, my observation is that this is often a case of treating symptoms. I would rather ask two other questions.

What data do I really need? (And of these: which do I really need?) Just because we can store everything doesn’t mean we have to. It’s so easy to add a field to a table - but the effects can be dramatic. So: Just leave it out and delete all structures that are not needed. A (test) data radical cure!

Do I have a suitable data architecture? With cross-system architectures, I see many interfaces and dependencies, but hardly an overall picture of the data content, data flows and where which data is stored in a meaningful way. So that they are not stored redundantly and circularly. It’s slowly becoming a shoe.

And if the data is halfway decent, then I’ll think about something with AI 😉