Scientists, share your source code

It’s a typical example: the paper is published, describing a new algorithm for data analysis. Mathematical background is described in the paper, roughly. A piece of software that implements it, is written and available for download from a web-site. You visit the web site, download it and run it. You get unexpected results. You wonder what’s happening. You go back to the site and look for the source code ― and it’s not there.

I’ve recently visited and tested two pieces of software doing basically the same thing: predicting missing genotypes. There is no source code for any of those two, and fastPHASE additionally needs you to register and accept an academic license to use it, introducing an annoying delay in obtaining the program.

By the way, why are all those scientific program names written in UPPERCASE? Because it creates an impression of IMPORTANCE? Just a side note.

Scientists work for the sake of humanity (I hope), striving to make our world a better place. Right? So why don’t they make the source code available?

Not releasing source code of scientific software is a Bad Thing, because it harms research in the field and is antisocial. The ones that lose, is the closed-source project itself, other projects in the field, and subsequently, everyone who could have benefit from the research. The only one who can possibly benefit from it, is only the author, but I highly doubt that they ever do.

Keeping the source code secret is a typical practice for corporations, who seek to profit from selling the binaries. I don’t know what business model can be built on restricted source code access in science, but I don’t think they’re every going to make any money on that.

What could be other reasons not to release the source code? Remaining the sole author, keeping all the credit? Keeping complete control? Hoping to sell license to business clients?

The main effect of making the source code unavailable is that the program internals cannot be inspected and analyzed. It’s only a binary that is available; people can obtain it and run it, without being able to modify it.

All the general arguments pro open-source software apply to the scientific software. Obstructing the software has several negative results.

  • Fewer people use the program.
  • None of the users can adapt or fix the program.
  • Other developers cannot learn from the program, or base new work on it.

I think that should be enough, but I would like to add two points that apply specifically to scientific software.

Loss of credibility

In scientific research, they key point is to prove and verify the results. With closed source, other scientists can only run the software and examine the output, without being able to check if the program really does what the paper describes. Being unable to do that, the rest of the world has to believe the authors. Do they have something to hide?

I don’t think scientists would actually question a paper as a whole because of the source code unavailability, but it certainly makes raises some concerns about its quality.

It’s antisocial

Scientific research is usually funded from government grants, which in turn come from tax payers. Scientists are not corporations who fund themselves. It’s the society, it’s the other people who effectively pay for the research (through various funding organizations), and I believe it’s a moral obligation to, if they share their research results, share them fully.

By not releasing the source code, they only make an impression of publishing their work. They can get away with that, because many people will think that, if they can download the program and run it, it’s “available”. But it’s not!

Please, dear scientists, do what guys from projects such as GNU Octave, or R project do: share your source code. Everybody will benefit from it, including your projects and yourselves.

Convert XLS to CSV: Catdoc

Being given a task that takes two hours to accomplish, I was usually spending one and a half hour on figuring out, how to get it done in half an hour.

I am solving most problems by writing scripts, small programs that accomplish simple tasks. The key thing is that a script, once it’s written, can be run many times when new data have arrived.

I’ve always had problems with the XLS format. When writing scripts, I need data in a simple, text format. Someone sends me an XLS file, I open it and see something like this:

@O^@G^@Ó^@L^@N^@A^@ ^@:^@^D^@^AI^@M^@I^@^X^
A^H^@^@NAZWISKO^C^@^@OKO^E^ @^@prawe^D^@^@lewe^R^@^

How can I process such data? I can’t.

Luckily, I’ve found Catdoc, a set of tools which are able to convert the binary DOC and XLS formats into simple text, that can be further processed with standard script languages. One of the tools is xls2csv, which reads a binary XLS file and writes a comma-separated CSV.

Abuse of scoring systems

Apgar is a scoring system,

…a simple and repeatable method to quickly and summarily assess the health of newborn children immediately after childbirth. (…) The Apgar score is determined by evaluating the newborn baby on five simple criteria on a scale from zero to two and summing up the five values thus obtained. The resulting Apgar score ranges from zero to 10.

One of the criteria is the skin color, which can be blue all over, blue at extremities and normal. This is an ordinal variable, which means that the variable does not have number values, but named and ordered levels. Blue at extremities is worse than normal, blue all over is worse than blue at extremities. By transition, blue all over is worse than normal.

Apgar score is meant to provide a single number as an outcome. To achieve that, five ordinal criteria need to be aggregated. Unfortunately, there is no way to directly aggregate skin color with pulse, for example. However, numbers are easy to aggregate, by means of addition. Hence the idea of transforming levels to numbers and aggregating them.

This is somewhat dangerous approach. The main purpose of Apgar is:

…to determine quickly whether a newborn needs immediate medical care.

However, having the Apgar outcome in form of numbers, people might be quick to calculate mean value and standard deviation. Looking for “mean apgar” in reveals some 400 documents. It’s not a majority, because ther are 71 thousands documents with word “apgar”, so those 400 are only 0.5%.

Calculating mean and standard deviance of Apgar values wasn’t something that Apgar creator had in mind. Its purpose was to quickly assess if a newborn needs medical care.


Apgar score values are not numbers. They are summed identifiers of five ordinal variables. In order to calculate statistics, the original data (criteria values) should be used, as there are dedicated statistical methods to analyze ordinal variables. These methods, as the reader may already have guessed, are not transforming ordinal variable values into numbers in order to perform calculations on them.

When designing a survey for statistical analysis, Apgar score must not be used. The five original criteria must be included in the survey instead.

All the things that apply to Apgar, apply also to the Aristotle Score, which I have already criticized. Height and weight are numbers. Generally, things that are measured, are numbers. Things that are assessed subjectively, like newborn skin color, are ordinal and do not have values. Aristotle Score values are seemingly numbers. However, it’s important to bear in mind that they are not! Therefore, one must not calculate mean or standard deviation of Aristotle Score.

Current Basic Score reports are based on mean Basic Score values, which is an abuse of a scoring system. I suggest finding another method of quality of care evaluation.

minus 4 days to go

I have proof-read a hardcopy of my thesis, made final corrections and commited them to the repository.

Transmitting file data ……..
Committed revision 384.

I noticed that the error rate varied across chapters. I think the earliest parts were the worst, there was no page left without a change. The newest parts, however, were mostly OK, with just few slight modifications.

Some of the corrections were because of the integrity and continuity. I expected to write or do some things that I haven’t finally written or done. For example, I planned to include an appendix, which occurred to be too big and was finally removed. I spotted and removed two references to the non-existing appendix today. At first, I considered this appendix an integral part of the thesis. However, it could distract readers from the main concept, i.e. the normalization. I wouldn’t like to discuss the details of the way I have normalized the International Nomenclature for CHD. It is a task for a medicine expert, I just had to do it in order to be able to move forward with my analysis.

2×2×2 days to go

I removed a huge appendix from my thesis to make it thinner, but the thesis is still growing. The current chapter, the analysis, contains dumps of the models, where one model can take up a whole page.

It’s 23 days left to the submission. I’ve promised my supervisor a final-candidate on Sunday, so I’ve got today and tomorrow to do it. It’s going to be a busy weekend.

I am somewhat disappointed with the predictive weakness of the models. There are lots of false negatives, even though the classification threshold is low (5%). Fortunately, the classification is not a key point in my thesis. The models can still be used for fair comparing the hospitals.

Last but one step of the analysis

So far, everything I was doing was a preparation. Now, there are 10102 days to go and I’m starting the actual data analysis, i.e. the final multiple logistic regression. Two mighty servers are currently processing my data. They have already calculated the simple additive models without interactions. I hope they will finish the models with interactions by tomorrow.

After many conversations with my expert consultant,  I have achieved results that do make sense to him. No revelations, but I don’t expect them anymore. It’s good enough when the regression results match his expectations.  The calculatated coefficients are informative, as they represent the size of the effect.

Once I have the regressions ready, I’ll be ready to perform the hospital comparisons, the very final phase of the analysis.

2×2×2×2 days to go

It’s 24 days to go. I’ve still got plenty of things to do. Luckily, I’ve also got a schedule agreed with my supervisor, tight, but possible.

I wanted my thesis to contain a nice example analysis. I’ve tried to extract a question from the surgeons I know, but I haven’t succeeded, so my analysis will be a “fishing expedition”. After all my rants about the data structure and the proposed solution, it would be nice if my thesis actually used the normalized data to answer some question. It’s like in La Fontaine’s “The Mountain’s Delivery”:

The mountain and the mouse

A mountain having labor
With clamor rent the air
The neighbors who came running
Predicted she would bear
A city broad as Paris
Or at least a manor house,
But at the crucial moment
The mountain dropped a mouse.
How like so many authors
Who say they’ll set to paper
A vast Promethean epic
But all that comes is vapor.

I don’t want that. I need to think hard on a good example. Or concentrate on the hospitals comparing.

A new quality or mere co-occurence?

Normalizing the list of diagnoses, I’ve removed the “TGA+VSD” entry, changing the representation into two separate diagnoses, “TGA” and “VSD”. If you don’t know anything about the medicine, it seems perfectly fine. But if you do, you’ll know that the “TGA+VSD” is not a mere co-occurence, it’s a very different case from both TGA and VSD, although both TGA and VSD defects are present.

Continue reading “A new quality or mere co-occurence?”