The chain of problems related to the EACTS Congenital Database analysis led me to the major problem with the data structure, or in other words the way the information about the patient treatment is represented. It doesn't really have anything to do with the computer itself. It's the problem of more general nature. It would happen also when analyzing data on paper or an abacus.
When the data structures are ready, the analyst can take a data set and analyze the data as it is. EACTS Congenital Database unfortunately isn't the case. In order to analyze this data, the representation must be deeply redesigned.
This way or another, the data must fit into a table, where one row is a single observation and columns represent the properties of the observation. Since I'm analyzing the mortality, my observation is the patient. What should be patients attributes like?
Each attribute should represent a single observed property of the patient, a single fact. It's very important that the fact is single, because the data analysis will try to analyze interactions anyway, so there's no need to combine facts in advance. Let's take this example:
In the original, wrong data structure, facts are combined into columns and the data in the columns seems unrelated, while in fact it is related. In the second table, each column represents a single fact, so the combinations of “Yes” entries express similarities between patients.
However, I'm not sure that the “correct data structure” from the picture is in fact correct. I've became suspicious. Maybe when there's no ventriculotomy, one of the patches is always applied? If yes, which one? Perhaps RV-PA conduit implies something in other columns? There can be many things that are so obvious, that there's no need to actually have them on the data. Well, they are obvious only to a cardio-thoracic surgeon, not to a data analyst.
The data is the only thing that was given to me, so I need to acquire all the knowledge needed to understand all the logic that the current data set doesn't express and redesign the data representation.
This kind of research wasn't really planned for my thesis. I wanted to make a multicriteria analysis, but there's much work to do before any analysis can be done. And just to prepare the data would be enough for a MA thesis. I need to balance the areas of interest. Maybe I'll analyze just one kind of disease, and I'll do it well.
The model means a data structure that expresses complete knowledge about given subject. There are no “obvious” things in the data structure. Something may be present or not present in the model. If it's not present, it's not being analyzed.
Having to design the models for all the diseases, I need to study the heart anatomy and the diseases I'm going to analyze. Fortunately, a lot of information about the congenital heart diseases is available online.