Archive for the ‘Computers’ Category

Converting DVD lectures to mp3

March 9, 2008

I don’t study any more since December 2006. I’ve completed my Masters, and… the game seemed to be over. But when a friend has recommended TTC courses to me, I started listening to them… and become addicted. I’ve completed Philosophy of Religion by James Hall, Great Ideas of Psychology by Daniel N. Robinson and Explaining Social Deviance by Paul Root Wolpe. I’m currently listening to Argumentation: The Study of Effective Reasoning by David Zarefsky. I have a few more courses to go through, so I won’t be out of lectures anytime soon.

However, it turned out that watching them as videos doesn’t work well for me. When I’m at home, I have plenty of other things I’d like to do. But they work well in audio, as I can listen to them when commuting. I’m going to buy new courses in audio, but what about those ones I have already in DVD? Can I convert them into audio?

I’ve put together a shell script, which dumps all tracks from DVD and encodes them into mp3 format. It also normalizes volume, so the voice of the speaker has more or less constant volume, which is good when listening in noisy environment.

It was written on Linux, if you’re a Windows user, you can try running it under Cygwin.

Loud minority of color haters

February 10, 2008

I was struck by one sentence in the git-svn course.  It’s about the way a computer program displays its output. It’s about colors. Some people like to see code annotated with color, like this:

Colorful code

You can see, that keywords from the programming language are highlighted. Colors are a new channel of information transfer, highlighting code syntax allows us to use our natural ability of color recognition, to work with the code more efficiently.

Without colors, this code looks like this:

Code with no colors

Sure, you can still read the code. But… is the word “class” is spelled correctly? Are all the quotes balanced?  You don’t know, unless you examine it quite carefully.  With syntax highlighting, you can determine that by looking: is it yellow or not?

However…

some people hate colors way more than the rest likes them

…and that’s the reason why color highlighting is turned off in git by default. I once heard a convincing argument against colors: that command “ls” with option “–color” calls function stat() on every file it lists, which under certain circumstances is a highly undesired thing; but this argument was then followed by “real UNIX doesn’t have colors” (is it a form of a “real programmer” thing?), which destroyed the whole impression. Yes, sometimes you might want to turn colors off…

But why turn the option off by default for the whole project? The majority has to give in because of loud minority?

In a democratic system, and I believe Free Software movement is about democracy, software should be optimized for the largest benefit of the whole, or at least the majority.

If I find myself among the minority and I can’t get the majority willfully convinced to my option (without hate speeches), I’m going to give in. And I expect the same from others.

Scientists, share your source code

July 22, 2007

It’s a typical example: the paper is published, describing a new algorithm for data analysis. Mathematical background is described in the paper, roughly. A piece of software that implements it, is written and available for download from a web-site. You visit the web site, download it and run it. You get unexpected results. You wonder what’s happening. You go back to the site and look for the source code ― and it’s not there.

I’ve recently visited and tested two pieces of software doing basically the same thing: predicting missing genotypes. There is no source code for any of those two, and fastPHASE additionally needs you to register and accept an academic license to use it, introducing an annoying delay in obtaining the program.

By the way, why are all those scientific program names written in UPPERCASE? Because it creates an impression of IMPORTANCE? Just a side note.

Scientists work for the sake of humanity (I hope), striving to make our world a better place. Right? So why don’t they make the source code available?

Not releasing source code of scientific software is a Bad Thing, because it harms research in the field and is antisocial. The ones that lose, is the closed-source project itself, other projects in the field, and subsequently, everyone who could have benefit from the research. The only one who can possibly benefit from it, is only the author, but I highly doubt that they ever do.

Keeping the source code secret is a typical practice for corporations, who seek to profit from selling the binaries. I don’t know what business model can be built on restricted source code access in science, but I don’t think they’re every going to make any money on that.

What could be other reasons not to release the source code? Remaining the sole author, keeping all the credit? Keeping complete control? Hoping to sell license to business clients?

The main effect of making the source code unavailable is that the program internals cannot be inspected and analyzed. It’s only a binary that is available; people can obtain it and run it, without being able to modify it.

All the general arguments pro open-source software apply to the scientific software. Obstructing the software has several negative results.

  • Fewer people use the program.
  • None of the users can adapt or fix the program.
  • Other developers cannot learn from the program, or base new work on it.

I think that should be enough, but I would like to add two points that apply specifically to scientific software.

Loss of credibility

In scientific research, they key point is to prove and verify the results. With closed source, other scientists can only run the software and examine the output, without being able to check if the program really does what the paper describes. Being unable to do that, the rest of the world has to believe the authors. Do they have something to hide?

I don’t think scientists would actually question a paper as a whole because of the source code unavailability, but it certainly makes raises some concerns about its quality.

It’s antisocial

Scientific research is usually funded from government grants, which in turn come from tax payers. Scientists are not corporations who fund themselves. It’s the society, it’s the other people who effectively pay for the research (through various funding organizations), and I believe it’s a moral obligation to, if they share their research results, share them fully.

By not releasing the source code, they only make an impression of publishing their work. They can get away with that, because many people will think that, if they can download the program and run it, it’s “available”. But it’s not!

Please, dear scientists, do what guys from projects such as GNU Octave, or R project do: share your source code. Everybody will benefit from it, including your projects and yourselves.

Vim: Save highlighted syntax in HTML

July 10, 2007

Vim is able to highlight syntax of a very large number of languages. It also has a nice feature, allowing you to save the highlighted source to a HTML file.

:runtime! syntax/2html.vim

After typing this command, you’ll get a split window with your source in HTML. You can now save it to a file.

Directory renaming in SCM

June 7, 2007

SCM stands for Source Code Management. Pretty much the same thing can be called VCS, Version Control Software. Perhaps even more TLA’s are there out in the wild. It all boils down to a program which allows programmers to manage their source code.

Pretty much everybody who started using SCM, started with CVS and then moved to something else. Probably Subversion, which is meant to be a CVS replacement. For more adventurous or demanding developers, there are many other SCM’s: Git, Bazaar, Monotone, Mercurial, Darcs… and more.

Mark Shuttleworth has written an interesting thing: that file and directory renaming is one of the most important operations to be handled with an SCM. I got curious and wrote a test case for three SCM’s I know: Bazaar, Git and Subversion. The scenario is:

(more…)

Genetic data in PostgreSQL

June 6, 2007

People get usually famous for the things they’ve done. Well, that’s not entirely true. They usually get famous for the things they’ve done, when they were successful. You don’t get famous for attempting and being unsuccessful, now do you?

It works the same way for the scientific publications. All scientists work hard trying various things, and when they finally succeed, they publish a paper. But what happens with all those hours spend on unsuccessful attempts? Nobody seems to be proud of blowing a whole laboratory up. Or whatever didn’t work for them. This means that other people can never learn that something was unsuccessful and they’re likely to get the same, unfeasible, idea and repeat the same research. Needless to say, unsuccessfully.

Not that I’m proud of what I’ve done here, but I will at least allow other people to find this post on Google, when searching for genetic data and relational database. I’ll describe what I did, so they at least don’t do it the way I did.

(more…)

Parallel programming course

May 19, 2007

I have spent this whole week in the Computer Science and Informatics building. I wonder how did “informatics” creep into the English language; I was taught in 2002 that there is no such thing as “informatics”. There’s only Computer Science. Term “informatics” was supposed to be used only by mistake. German has “informatik”, Polish has “informatyka”, it’s probably those non-native English speakers who just kept using it until even English people started believing that it’s a legitimate English word. A lie told a thousand times… well, what was I… yes, the course.

The main topic was parallel programming, harnessing multiple processors to solve a single, computationally-intensive task such as a weather forecast or a car-crash simulation. There’s more than that, there are many more problems that you can solve and lots of money you save by simulating things for you instead of doing them for real.

(more…)

Cartesian product of multiple sets

April 28, 2007

What a cartesian product is, knows everyone who ever saw a table. For example:

        +------------------------+------------------+
        |    hard-working        |      lazy        |
+-------+------------------------+------------------+
| smart | smart and hard-working | smart but lazy   |
| dumb  | dumb but hard-working  | dumb and lazy    |
+-------+------------------------+------------------+

It’s an example of product of two cartesian sets: {hard-working, lazy} and {smart, dumb}. It’s easy to generate such a product in bash:

maciej@clover ~ $ echo {smart,dumb}-{hard-working,lazy}
smart-hard-working smart-lazy dumb-hard-working dumb-lazy

It’s a list of all the possible pairs of elements.

In order to generate a cartesian product of two sets, one usually writes two nested loops. For example, in Python:

for i in ['smart', 'dumb']:
    for j in ['hard-working', 'lazy']:
        print i, j

What if we want to generate a cartesian product of three sets? Three nested loops? What about four sets? What about N sets?

I’ve found a thread with examples of code generating such cartesian products. I especially liked the solution with generators, because it avoids keeping in memory potentially enormous tables with data. The example from the forum thread:

def cartesian_product(L,*lists):
    if not lists:
        for x in L:
            yield (x,)
    else:
        for x in L:
            for y in cartesian_product(lists[0],*lists[1:]):
                yield (x,)+y

It’s a short and effective solution, using recursion. This particular implementation has one distadvantage: lists need to be given as function arguments:

cartesian_product(list1, list2, list3)

I wanted a solution where I could give it a list of lists instead.

UPDATE:  James Hopkin suggested using an asterisk (thanks!):

cartesian_product(*list_of_lists)

Here’s my original solution:

def cartesian_product(lists, previous_elements = []):
    if len(lists) == 1:
        for elem in lists[0]:
            yield previous_elements + [elem, ]
    else:
        for elem in lists[0]:
            for x in cartesian_product(lists[1:], previous_elements + [elem, ]):
                yield x

Usage of this function can look like this:

a = []
a.append(['in', 'out'])
a.append(['put', 'come'])
for i in cartesian_product(a):
    print "%s%s" % (i[0], i[1])

Another example, generating a natural binary code, with the number of bits as a parameter. Please note that when you give it a very large number of bits, it will take a lot of time to execute, but it will not exhaust the memory.

bits = 5
for i in cartesian_product([range(2) for x in range(bits)]):
    print i

How wonderful it is to be loved!

January 21, 2007

I still love you

Django on Dreamhost: incomplete headers

December 1, 2006

I’ve recently bought a hosting in Dreamhost. There were two reasons:

  1. It’s possible to run Django on it
  2. It’s cheap

It’s a shared hosting, where many sites are served from a single physical machine. Each machine probably serves as much sites as possible, where the hardware capacity is the limit. The Dreamhost servers are pretty busy. My server for instance:

[shasta]$ uptime
16:03:42 up 31 days, 13:59, 6 users, load average: 10.48, 9.74, 9.24

It’s not the processing power that is the bottleneck here, at least on “my” server. There’s usually about 40% of idle processor time. However, when I ran tail command, it would get killed every now and then. Strange. Perhaps there’s a “garbage process collector” running on the site, terminating non essential jobs here and there.

Well, I bought the hosting and moved my main site to Dreamhost. PHP software ― PhpBB and MediaWiki ― is working great. I decided to try running Django, so I developed a small application and installed it. I followed the instructions, voila, it worked. I was happy.

At least until the Django app would eventually stop responding. I clicked a link and the browser just waited for data. The data never came. I looked into the logs.

[Thu Nov 30 14:56:16 2006] [error] [client 83.xx.xxx.xx] FastCGI: comm with (dynamic) server “/home/automatthias/atopowe.pl/django.fcgi” aborted: (first read) idle timeout (120 sec)
[Thu Nov 30 14:56:16 2006] [error] [client 83.xx.xxx.xxx] FastCGI: incomplete headers (0 bytes) received from server “/home/automatthias/atopowe.pl/django.fcgi”

Something was wrong. Django isn’t officially supported on Dreamhost, so I couldn’t submit a complaint to the support. I searched the Web and found out that some guys had similar problems. Some other guys hadn’t. Dreamhost has many servers, and I figure it’s got something do to with the load of each server and… perhaps killing “non-essential” processes.

After some more research, I have found out that it’s not only Django users who’ve been experiencing that. There were also Rails users! You need to know, that Rails are officially supported on Dreamhost. I got interested and read on. Dreamhost support responded to the affected Rails user:

Check with our support team and ask if our process monitor has been killing your processes. If you have a lot of processes hanging around that may be the case. We recently updated our process monitor to specifically handle dispatch.fcgi processes specially so that is probably not the problem but it’s worth asking.

“specifically handle dispatch.fcgi processes”? Aha!

Inspecting my processes with “ps -ef” revealed that there were several “django.fcgi” processes running. Some of them were zombie (defunct). If the “dispatch.fcgi” processes are specifically handled, why don’t I pretend to run them? So I’ve changed my setup a little bit: I renamed my “django.fcgi” file to “dispatch.fcgi” and altered two lines in the “.htaccess” file, so they would refer to the new name:

RewriteRule ^(dispatch\.fcgi/.*)$ - [L]
RewriteRule ^(.*)$ dispatch.fcgi/$1 [L]

Guess what?

No timeouts, no 500s, no incomplete headers. It works like a charm.