Abuse of scoring systems

Apgar is a scoring system,

…a simple and repeatable method to quickly and summarily assess the health of newborn children immediately after childbirth. (…) The Apgar score is determined by evaluating the newborn baby on five simple criteria on a scale from zero to two and summing up the five values thus obtained. The resulting Apgar score ranges from zero to 10.

One of the criteria is the skin color, which can be blue all over, blue at extremities and normal. This is an ordinal variable, which means that the variable does not have number values, but named and ordered levels. Blue at extremities is worse than normal, blue all over is worse than blue at extremities. By transition, blue all over is worse than normal.

Apgar score is meant to provide a single number as an outcome. To achieve that, five ordinal criteria need to be aggregated. Unfortunately, there is no way to directly aggregate skin color with pulse, for example. However, numbers are easy to aggregate, by means of addition. Hence the idea of transforming levels to numbers and aggregating them.

This is somewhat dangerous approach. The main purpose of Apgar is:

…to determine quickly whether a newborn needs immediate medical care.

However, having the Apgar outcome in form of numbers, people might be quick to calculate mean value and standard deviation. Looking for “mean apgar” in scholar.google.com reveals some 400 documents. It’s not a majority, because ther are 71 thousands documents with word “apgar”, so those 400 are only 0.5%.

Calculating mean and standard deviance of Apgar values wasn’t something that Apgar creator had in mind. Its purpose was to quickly assess if a newborn needs medical care.


Apgar score values are not numbers. They are summed identifiers of five ordinal variables. In order to calculate statistics, the original data (criteria values) should be used, as there are dedicated statistical methods to analyze ordinal variables. These methods, as the reader may already have guessed, are not transforming ordinal variable values into numbers in order to perform calculations on them.

When designing a survey for statistical analysis, Apgar score must not be used. The five original criteria must be included in the survey instead.

All the things that apply to Apgar, apply also to the Aristotle Score, which I have already criticized. Height and weight are numbers. Generally, things that are measured, are numbers. Things that are assessed subjectively, like newborn skin color, are ordinal and do not have values. Aristotle Score values are seemingly numbers. However, it’s important to bear in mind that they are not! Therefore, one must not calculate mean or standard deviation of Aristotle Score.

Current Basic Score reports are based on mean Basic Score values, which is an abuse of a scoring system. I suggest finding another method of quality of care evaluation.

Last but one step of the analysis

So far, everything I was doing was a preparation. Now, there are 10102 days to go and I’m starting the actual data analysis, i.e. the final multiple logistic regression. Two mighty servers are currently processing my data. They have already calculated the simple additive models without interactions. I hope they will finish the models with interactions by tomorrow.

After many conversations with my expert consultant,  I have achieved results that do make sense to him. No revelations, but I don’t expect them anymore. It’s good enough when the regression results match his expectations.  The calculatated coefficients are informative, as they represent the size of the effect.

Once I have the regressions ready, I’ll be ready to perform the hospital comparisons, the very final phase of the analysis.

Doctors and computer scientists

I do admire doctors. Cardio-thoracic surgeons are able to take a scalpel and actually fix a human heart. This is an unimaginable thing for me.

Somehow, the data analysis doesn't seem unimaginable for doctors. They don't hesitate to take a table of values, add, divide, calculate a mean value and discuss a p-value. A doctor can think that he knows, what's an analysis about. When I'm doing an analysis, a doctor can look over my shoulder and say: “Well, this makes sense.”

I guess that looking over a surgeon's shoulder I would say the same. But it does not mean that I'd be able to perform a surgery. If anybody would see me taking a surgeon's knife and approaching a patient, I would be stopped immediately.

You might say that it's a matter of mistake cost. A mistake during analysis doesn't kill. A mistake during surgery does. But is it really like this with the analysis? A wrong analysis can miss an important effect, which can result in mistreating a large number of patients, so this (possibly) little error would be repeated over and over… So what's worse, one big mistake or thousands of small ones?

Doctors, before creating scores and calculate mean values, please go to an analyst and have a little talk. It will help everybody.

Modern Applied Statistics with S

I got the package from Amazon today, containing the fourth edition of the famous Modern Applied Statistics with S. The book doesn't explain the theory of statistics, but focuses on actual use of S programming language and its free alternative, R. The creator of R, prof. Brian D. Ripley is one of authors, so I can expect this book be R-user friendly.

I bought the book mostly because of my thesis, but I expect it to be useful in the analytical work at large.

Simple interaction example

How to explain an interaction in statistics? During last spring semester, I’ve come up with an example: tea sweetening.

Let’s consider two factors: adding sugar and stirring. The dependent variable is the taste of tea (bitter or sweet). First, let’s consider stirring alone, without adding sugar. Well, it doesn’t matter how much of stirring you do, the tea will remain as bitter as it was at the beginning. Conclusion: stirring has no effect on the taste of tea.

Second factor, adding sugar alone, without stirring. You can add two or three teaspoons of sugar, without being able to sense any difference. After adding a lot of teaspoons, you start feeling that it’s becoming sweet. Conclusion: adding sugar has a very weak influence on the taste of tea.

Now, the third experiment, adding sugar and stirring, together. Wow, it looks like it’s enough to add just one teaspoon of sugar and stir a little and you can already feel the difference.

Conclusion: There is an interaction between adding sugar and stirring.

It’s not the sugar or the stirring alone that makes the tea sweet. It’s the fact that those two things happen together.

I want a desktop search engine

I'll try to install Beagle. It requires Mono, which itself seems to be a quite big thing to install. As a first thing, I tried a subversion checkout of Mono, but the autogen.sh script failed so I couldn't build it. I wonder how difficult is to create a Slackware package with complete Beagle install…?