I’ve finished the normalization mapping of the International Nomenclature for Congenital Heart Diseases. The last problem to solve was the uniqueness of the resulting vector, a.k.a. backward compatibility. Every entry from the normalized factors can be now mapped backwards to the old nomenclature. However, inconsistent sets of nomenclature entries will be mapped back only to consistent ones, so it’s kind of a backward-incompatibility.
Archive for the ‘Data Analysis’ Category
Why does the “nomenclature” tangle the diagnoses? Why add a new diagnosis “TGA+VSD” while there are “TGA” and “VSD” already?
The new nomencalture introduces even more tangled entries. Where is it going?
I think it’s because they can’t think of a way to analyze the data with overlapping sets. It’s a run-away strategy. Unfortunately they won’t run away. The overlapping still occurs, because it’s impossible to create a list with al the possible combinations of diseases.
There are no really precise and good parameters that could be used to evaluate the quality of care in hospitals, which submit the data to the EACTS Congenital Database or similar databases. The databases don’t contain detailed information about patients’ health. Let’s quickly review possible QoC parameters.
I met my supervisor today. Finally, a man who took the chickling under his shelter.
It’s seven weeks to go and I’m just starting. He made a good point that I should include my previous work as a part of the thesis, and to rephrase the title, so it would include building and using the information system. This way all this work will be a part of the thesis instead of being just “a preparation”.
Preparing the materials is a major part of my thesis work. The database I’m analyzing uses a list of diagnoses which in fact should be called diseases. The diagnoses list is tangled, the entries do not describe single properties of the patient. I can’t analyze the data in this form, because all the statistical methods assume that the variables describe single properties. Multiple regression, for instance, will find combinations of factors that have an effect on the response variable, but the factors themselves need to describe single properties.
What I need to do, is remapping.
Tetralogy of Fallot is a significant and complex congenital heart disease. It consists of four heart malformations. So if a patient is described to have TOF, it means that she/he has all those four malformations together.
However, the separate malformations are already present on the diagnoses list, as separate entities. From a data-modeling perspective, it’s a redundancy on the factors (malformations, diseases) list. This leads to problems with interpretation. As the VSD is one of TOF’s components, is already present on the diseases list, and users are allowed to enter both VSD and TOF diagnoses, there are patients with all four combinations in the database.
This is what the current “related factors” view looks like. I am using advantage of the two presentation techniques: size and color. The size represents the number of patients that represent the association. Color represents the mortality, from blue and green (low mortality) to orange and red (high mortality).
This representation allows to quickly scan the table and find interesting points: red and big rows. Size and color work independently, it’s kind of two-dimensionality of the report representation.
I’ve spend many days working on the database browser and normalization of the data structure. The original problem was that the so called nomenclature, which is a flat list of diseases that couples factors that should be independent for the analysis.
The plan is to decouple the factors and create alternative nomenclature. The new, as I call it, normalized factors will carry the same information as the original nomenclature.
The data structure refactoring will never end. At least it feels like it never will.
While normalizing the factors, I decided to add a new class, a factor type. I hate word “type” but I don’t have any better word now. The types will be:
- Given: not influenced by the hospital, mostly diseases and their properties.
- Treatment: the elements of the treatment. The hospital’s choice.
- Result: influenced by both given and treatment factors. They’ll be the dependent variables.
Having the factor types, it’ll be possible to compare the hospitals against the way of treating the patients. For example, what percent of patients will have palliative operations?
After few days of building a database browsing tool with Django, I must say: Django is brilliant!
I have set up browsing for all the major database entities: hospitals, surgeons, patients, operations and factors (a factor is a disease, procedure, etc). The database browsing website is intensively interlinked. Every hospital, surgeon, factor, etc is linked, so you can click it and follow to information about it.
Almost everything was done using the generic views, which means that I didn’t have to write the views myself. I decided to migrate the data instead of using the legacy tables directly. It required a bit of work with migrating the data, but I wanted to have pure Django-generated tables. Besides reading the Django documentation, what I have done was:
- Write models for my data: Hospital, Surgeon, Patient, Operation, Factor
- Move the data from the old tables to Django tables
- Generate slugs for nice URLs (slug-is-this-king-of-string)
- Design URLs (/hospitals/, /hospital/CODE/, etc)
- Write templates (hospital_list.html, hospital_detail.html, etc)
The next major task will be creating the “related factors” view. It seems like I’ll have to write a custom SQL query for that. The question is: having a given factor (tag), what else factors are associated with patients with my given factor?