Nearly every piece of software needs data to run. Nearly all systems need some form of personal data, even if it is your email address (to login) and a name to make it all a bit more personal.

You’ll have this data secured away in production, but you’re not supposed to use it in development. It’s people’s personal identifiable information and you don’t want to inur the wrath of the ICO if you’re in breach of GDPR. Nope, sensitive production data stays in production, period.

It’s easy to relate to personal data as its sensitive, but the same restrictions surround commercially and legally sensitive data to mention just a few. What if you want to stress test a system with excessive quantities of data, or where the data simply doesn’t exist yet (e.g. for a new system or for a new feature). You’ll need to produce - through some means - fake data. Whatever you do it’ll cost, in time and/or money, but it doesn’t have to…

Let’s take stock for a moment, what do you actually need?

  1. The means to create as much data as required. Maybe less or more than in production, but an amount nevertheless
  2. The means to create data that will fit into your database, so it needs to match your database schema
  3. The means to create realistic data, so it needs to meet your business rules, and ‘look right’

There may be other nuances of what you require, but generally these are the things you need. By ‘look right’ I mean, a name in the name column, not just jibberish. And business rules, mean things like the dispatch date is always after the purchase date.

There happens to be a product that allows you to achieve all of the above points. It’s an open source product so completely free, and will be - forever.

The product is Data Helix, it’s an open source Java project supported by the FINOS foundation. Data Helix is designed to solve these problems, that so many of our clients also have on a regular basis.


You can:

  • define your database schema ✔
  • define your relationships between fields ✔
  • specify how much data you want ✔
  • output data into a database ✔

And there are lots of other features in the pipeline. You can see them all here, and suggest more if you’d like.

Data Helix is a suite of tools that has the data profile at its core. This is a file that allows you to describe the data you need. That is all of those points above.

For example:

{
  "fields": [ 
    { "name": "order_reference", "type": "string" },
    { "name": "order_date", "type": "datetime" },
    { "name": "dispatch_date", "type": "datetime", "nullable": true }
  ],
  "constraints": [
    { 
      "field": "dispatch_date", "equalToField": "order_date", 
      "offset": 3, "offsetUnit": "days" 
    },
    { "field": "order_reference", "matchingRegex": "[A-Z0-9]{5}" },
    { "field": "order_date", "afterOrAt": "2010-01-01T00:00:00.000" },
    { "field": "order_date", "afterOrAt": "2010-01-01T00:00:00.000" },
    { "field": "order_date", "beforeOrAt": "2020-06-01T00:00:00.000" },
    { "field": "order_date", "beforeOrAt": "2020-06-01T00:00:00.000" } 
  ]
}

If you want to see it working, try it out in the online playground.

Take a look at the Data Helix website or the samples in the online playground or take a look at the documentation.