How to track changes, stay agile and sane at the same time

How to track changes, stay agile and sane at the same time

A fight with and about changes in MariaDB

The Intro

Welcome! If you are a software developer you might have encountered similar feats to what I am about to describe. Maybe you wondered if you really found the best solution. I will put code (SQL, Kotlin) only in the chapter about The Fight. In case you do not feed computers with commands for sport - you will know what to do.

The Philosophy

They say Change is the only constant in life. By "they" I mean that this is attributed to Heraclitus a Greek philosopher. It is often used in variations to show in a classic play on words how unrealistic our wish to define rules that should be valid for months and years to come really is. In the following quest I encountered with my brave team not only the changes of requirements over time, but tracking changes was part of the feature itself - yes - a bit meta, I know.

The Feat

Without further ado, what did we face?

  • "We will have a lot of data that will be added in many different forms" - pretty standard.
  • "From a small form to hundreds of forms and tables" - bigger database schemas then.
  • "Data supporting multiple languages and various configurations" - ok, we might invest in data migrations.
  • "A lot of it will be optional, some will change over the next months and years" - let us see..
  • "All that data will get a status and once submitted should get a version and be recallable" - oh well..

So much for a relational database - or is it?

Having seen similar requests in different projects I knew it was time to prepare for a longer run. Be it managing historic data for insurance contracts where there are not only current and historic but also pending versions. Huge university databases where old curricula data needs to be supported for a long time. Or projects where it is simply about tracking changes of master data that could be changed by multiple users in a system. It was time to become a chess player, thinking ahead multiple steps and changes, to choose not only the currently best fitting solution but the one that can be adapted to be the best for future me to worry less about.

The Heroes

Usually, you have some of the following solutions raising their hands:

  • The extra relational database schema (the ancient one, differentiates always perfectly between the challenges)
  • The relational database with extra historic tables (the old hero, often sung about for its stable results)
  • The non-relational database (the new one risen, celebrated for its agility on change)
  • The extra combined key with a timestamp to have all in the same place (the jack of all trades, avoids conflicts by being slow)
  • The database-specific versioning (the underdog, usually for limited use only)
  • The combined structure - formerly XML, nowadays often JSON - pushed into a relational database (the Harley Quinn, mostly chaotic, may produce unexpected results)

Yes, they all come with their own downsides:

  • be it problems when forgetting to adapt changes in historic tables
  • keeping different tables in sync in different schemas or then duplicated logic
  • having to deal with complex keys and queries because of too much data
  • incompatibilities arise from too many changes in non-relational structures
  • the dependency on a specific database without any way out or..
  • the sheer suicidal amount of problems Harley Quinn might bring to the table (pun intended)

The Underdog

With all that being said, our team went with the somewhat unsung hero: Database specific versioning. We chose to use System-versioned tables introduced by MariaDB in version 10.3.4 in 2018 [1]. This was only possible since we decided to go for MariaDB when we chose our relational database but there are similar solutions for other databases out there. The selection of this solution might not be straightforward. Let me outline why:

  1. System-versioned tables are a database-specific solution, which means that native queries will have to be written. In a generalist view, this is disputed as some might prefer staying out of any lower layer to be database-independent.
  2. Then system-versioned data rows cannot be dumped in a simple way, which might make copies and backups a bit less obvious and investigations into problems needing extra steps.
  3. These rows also cannot be adapted by their nature, meaning that changes can only take place on the last, current row. This is one of the most important features of our hero to ponder about.
  4. Although those data rows cannot be changed, the data and type could change if the table or column configuration is changed. This has to be considered in every migration.
  5. Every change to a value in a System-versioned table will lead to yet another row added which usually means being prepared for A LOT of data.

So why would we choose this solution even though it seems to have so many limitations? To answer that I want to highlight again what I stated in the beginning about chess. We need a solution that will not slow us down and enable us to do a lot of changes later on. So even though we take into account that we will need native queries and problems when we forget to modify those, we have the advantage of only one schema, one crowd of entities to support. There will be compilation failures would there be problems in mappings or incompatible changes in the future. This is all a huge improvement regarding the alternative of extra historic tables or throwing all in non-relational Database structures. Another downside is the fact that, since we only need specific submitted versions, we will potentially generate a lot of data that we do not even need. Here we knew that we can handle that later by cleaning up not-used rows, enjoying the luxury of not having to deal extra when creating new versioned tables for our cause.

The Fight

Enabling system-versioning is the easy part. All you need (though this is specific to MariaDB but similar in other databases) is to add system versioning:

ALTER TABLE project ADD SYSTEM VERSIONING;

Some potential problems were anticipated written in the chapter above, but let us have a look at the actual challenges we faced. And of course, let us see how we prevented being knocked out early.

Round 1

In the beginning, it was mostly about finding out that we will not get around writing native queries. This was also an internal fight with ourselves. It meant making peace with leaving the rule to stay as general and independent as possible of a specific database. Still, we went with a duplication approach: providing different logic to support a current version fetching without specific queries. This would help in the future to find out potential performance issues and stay compatible and neutral on this side. Like this, it was easier to implement the historic queries part by part.

    @Transactional(readOnly = true)
    override fun getPartnerFinancing(
        partnerId: Long,
        version: String?
    ): PartnerFinancing {
        return projectVersionUtils.fetch(version,
            projectId = projectVersionUtils.fetchProjectId(version, partnerId),
            currentVersionFetcher = {
                PartnerFinancing(
                    finances = partnerFinancingRepository
                        .findPartnerFinancingById(partnerId)
                        .toFinancingModel()
                )},
            previousVersionFetcher = { timestamp ->
                PartnerFinancing(
                    finances = partnerFinancingRepository
                        .findPartnerFinancingByIdAsOfTimestamp(partnerId, timestamp)
                        .toPartnerFinancingHistoricalData()
                )}
        ) ?: PartnerFinancing(
            finances = emptyList()
        )
    }

The take-away: It is totally fine to commit to a database system (pun intended) that you use in today's software.

Round 2

Next was the realization of what it means if there are schema changes necessary. MariaDB will prevent accidental changes by default so any modification on system versioned tables will result in errors when running those migration scripts. There is a property to set to let that pass. But of course, you have to try to live with the fact that this will technically alter the past.

SET @@system_versioning_alter_history = 1;

Any change on versioned tables should always be a conscious decision not taken lightly.

Round 3

Getting the first small heart attack when being spontaneously asked to show how to backup data before we were ready to do so locally. While this sounds obvious since the documentation let us knew beforehand, it was still a moment of uncertainty. Knowing it should be possible because data can be queried, but also knowing that historic data really cannot be inserted manually.

mysqldump does not read historical rows from versioned tables, and so historical data will not be backed up. Also, a restore of the timestamps would not be possible as they cannot be defined by an insert/a user. [1]

As it also turns out here: a containerized environment is definitely for the win. A few statements and everything is done without even having to take care of database-specific queries. See docker 'use volumes' [2].

Know your environment! Use containers - and map your volumes!

Round 4

Next up it was time to get the first almost knock-down when debugging through historic values. If specific versions are needed only, like in our case, there is the need for an own version table to map system-versioned rows to the ones of your versions. This makes the 'row_end' one of the most important values. Therefore it is good to know that you can query those columns even though they might not show in the database view of your choice.

SELECT id, name, row_start, row_end FROM project;

Know all the features you are using and read about their limitations first.

Round 5

The last great challenge I want to describe was trying to win against the new opponent: the request to 're-open' old versions. As stated above historic rows cannot be modified, so that meant saving old historic data again to have those values back as the current ones. For defending against this attack on the rules we build-up, it was necessary to come up with a trick. Otherwise, it would have meant implementing something specifically for pretty much all the different tables we had so far, as it would not have been easy to copy all data and remove specific ids already present. This plan saved us many hours of potentially painful development and included the following parts:

  • knowing that we would be able to trace all version-specific tables back to one specific identifier over its relations
  • being able to query for table names due to their constraints, the ability to generate only queries for tables that had changed data relevant to the restoring
  • knowing what we do and having the guts to disable the foreign key constraints for the actual execution of the re-saving (insert-emoji-either-screaming-or-in-awe)

Sometimes dangerous actions (like disabling constraints) might be worth the try.

Well, we are up and running, I am sure there are more rounds to come.

The Closure

What did we learn? It may not be natural selection, but it is the fittest method that we are looking for. It is the implementation that we should be choosing not for being the best nor the fastest, but the one we can ideally implement early on, easily adapt and that offers overall the best compromises.

So there, take that: a technical article starting with a Greek philosopher and ending with Darwin!

The End.