Sunday, December 20, 2009

The science of closed boxes

A friend pointed me to an article in New Scientist, "Why we shouldn't release all we know about the cosmos". The article suggests that data on the Cosmic Microwave Background Radiation (CMBR) should be released slowly, not all at once.
If the whole data set is released at once, as is planned, any new ideas that cosmologists come up with may have to remain untested because they will have no further data to test them with.
It took a moment, but eventually I realized that they were suggesting the method of blind analysis.

We also used blind analysis in LIGO data (LIGO is the Laser Interferometer Gravitational-wave Observatory, a gigantic device designed to detect gravitational waves). Whenever LIGO records a set of data, only 10% of that data is released. That 10% is called the playground. We analyze the heck out of that playground! There's a huge computer program, called the data analysis pipeline, which is used to decide if there are any events in the playground which look like real gravitational waves. A large group of scientists build on the pipeline, finely adjusting parameters, adding new bells and whistles. And the whole time they are doing this, they are not allowed to peek at the other 90% of the data. That box is closed!

This is the sort of box I want you to visualize

Once the scientists are satisfied with the pipeline, they "open the box". That means they get to look at the other 90% of the data. But once the box is open, they're not allowed to change the pipeline in any way. If they want to add more bells and whistles to the pipeline, they have to wait until the next time LIGO takes a set of data, perhaps in a year or more.

What is the meaning of this silly ritual? Is it some sort of Christmas tradition among data analysts?

There are all sorts of ways you can bias your analysis. If you know what the results are every time you try a different method of data analysis, then you can, to some extent, "select" results you like. That's bad! We want the results to be unbiased, so that everyone can agree on them. Therefore, in blind analysis, there are two stages. First, you choose a method of data analysis without looking at the full results. Then you apply that method to the full results without changing it.

I read the paper which is reported in New Scientist, and they have another cool explanation of the same idea. The goal in science is to compare a bunch of different models, and determine which model best explains our observations. But first, we need to come up with those models. The models will be educated guesses based on all the evidence we've collected thus far. So if we want to test the models, it's somewhat redundant to use the present evidence; we should instead collect new observations to test the models.

The problem in cosmology is that at some point, there will be no new observations to make. There is only one universe. There is only one CMBR map, with all its random statistical fluctuations. If you stare long enough at those statistical fluctuations, chances are good that you'll find some false pattern. The pattern will be very difficult to falsify, since there is no more data to collect after that. The solution? Release data piece by piece, so that there will still be new data to test our models.

So you see, even something which sounds as boring as data analysis can have all these counter-intuitive tricks involved. Hiding data in a closed box? It sounds silly, possibly even counter to science's goal of obtaining as much true information about the world as possible. But if it's necessary to filter out human biases, I think we should do it!