A mini-language for data transformation

In order to solve the problem of data representation in the EACTS Congenital Database, I need to extract unique facts from the combined-facts list. Whatever the transformation will be, it needs some notation. For example, I could write SQL statements by hand. But this would be inefficient and hard to maintain.

Alternatively, I could write a configuration file that would describe what are the new factors and how to derive them. A first pick was an XML file. I even started writing it, but stopped after twenty lines, because it wasn't much better than the SQL statements. It was long and the notation was taking more space than the information itself.

    <NewFactor code="TOF">
        <Name>Tetralogy of Fallot</Name>
            <Factor code="DIAG033">
            <Factor code="DIAG028">
            <Factor code="DIAG030">
            <Factor code="DIAG029">
            <Factor code="DIAG089">

I tried to write a Python data structure instead of XML.

redesign = {
    "TOF": {
        "Qualification": [

It's lighter than XML, but still too much quotes and brackets. Remembering Eric Raymond and his Art of Unix Programming, I decided to write a mini-language. At the beginning it's more like a configuration file, but I expect it to develop. The above example is now expressed in a single line:

tof:Tetralogy of Fallot: DIAG033 DIAG028 DIAG030 DIAG029 DIAG089

Tetralogy of Fallot, mapping of factorsAfter having defined the rules for factors, I can automagically generate an easily readable graph that shows the mapping. The graph and the database engine uses single mapping definition, so there's no place for a typo or other human error. My consultants will be able to examine the mapping, and I'll be able to prepare the mapping schemes in a an efficient way.

If you click on the image on the left, you'll see an example of mapping. The left side ellipses are the original diagnoses and procedures. The ellipses on the right are the derived ones. Please note that the right-side factors are single, atomic. There are no “something + something” entries. The arrows show the direction of mapping. For example, “TOF, absent pulmonary valve” is remapped to two atomic factors: “TOF” and “Absent pulmonary valve”.

Having an easy editable file, I can define all the remapping I need.


Author: automatthias

You won't believe what a skeptic I am.

One thought on “A mini-language for data transformation”

Comments are closed.