Normalizing the list of diagnoses, I’ve removed the “TGA+VSD” entry, changing the representation into two separate diagnoses, “TGA” and “VSD”. If you don’t know anything about the medicine, it seems perfectly fine. But if you do, you’ll know that the “TGA+VSD” is not a mere co-occurence, it’s a very different case from both TGA and VSD, although both TGA and VSD defects are present.
Is the change in the representation a right thing to do then? Let’s consider an usage example. A surgeon wants to see a report of patients with TGA. She looks at the report of related diseases and sees that 39% of patients had also VSD. “But VSD+TGA is a totally different case!”, thinks the surgeon. So she goes to the VSD row and clicks “exclude”. Voila! The set of patients has decreased by 39%, the mortality has also changed, by over one percent.
The change in the representation preserves the information; you can still find the patients, but instead of using a plain set or predefined sets (e.g. VSD, TGA, TGA+VSD), you can choose be requiring and excluding certain diseases.
What about Coarctation of Aorta? It can co-occur with both TGA and VSD. It can also occur with both of them together, as a triple disease. Using the plain list, you would need to have seven entries on the list:
|3.||VSD + CoA||0||1||1|
|5.||TGA + CoA||1||0||1|
|6.||TGA + VSD||1||1||0|
|7.||TGA + VSD + CoA||1||1||1|
Seven might look like a strange number here. In fact, if you include patients who have none of those three diseases, you will close the list with the eighth entry, which makes the list have exactly 23 entries. If you include the fourth disease, the list will have 16 (24) entries. Adding the fifth disease will yield a list with 32 entries. Every next disease will double the list length. Given that the database contains a limited number of patients, the longer the list, the smaller patient samples in each entry. By gaining the selectivity, you quickly lose the sample size. If you were interested in VSD regardless to other diseases, you would have to select all the entries with VSD, which would be like picking a grains of sand to build a pyramid. In conclusion, you don’t want to work with a plain list.
The technique I recommend, including and excluding certain factors, allows you to define the particular set of patients that you’re interested in. It requires minimum number of steps, is still flexible and general.
Using the inclusion-exclusion technique, you can pick patients whichever way you like (limited to the information contained within the database). If you are aware of a new quality that patients with both TGA and VSD have, you can use your knowledge when browsing the database and reshape your set of patient according to your needs, by including and excluding patients with certain factors.