Sunday, December 20, 2009

The science of closed boxes

A friend pointed me to an article in New Scientist, "Why we shouldn't release all we know about the cosmos". The article suggests that data on the Cosmic Microwave Background Radiation (CMBR) should be released slowly, not all at once.
If the whole data set is released at once, as is planned, any new ideas that cosmologists come up with may have to remain untested because they will have no further data to test them with.
It took a moment, but eventually I realized that they were suggesting the method of blind analysis.

We also used blind analysis in LIGO data (LIGO is the Laser Interferometer Gravitational-wave Observatory, a gigantic device designed to detect gravitational waves). Whenever LIGO records a set of data, only 10% of that data is released. That 10% is called the playground. We analyze the heck out of that playground! There's a huge computer program, called the data analysis pipeline, which is used to decide if there are any events in the playground which look like real gravitational waves. A large group of scientists build on the pipeline, finely adjusting parameters, adding new bells and whistles. And the whole time they are doing this, they are not allowed to peek at the other 90% of the data. That box is closed!

This is the sort of box I want you to visualize

Once the scientists are satisfied with the pipeline, they "open the box". That means they get to look at the other 90% of the data. But once the box is open, they're not allowed to change the pipeline in any way. If they want to add more bells and whistles to the pipeline, they have to wait until the next time LIGO takes a set of data, perhaps in a year or more.

What is the meaning of this silly ritual? Is it some sort of Christmas tradition among data analysts?

There are all sorts of ways you can bias your analysis. If you know what the results are every time you try a different method of data analysis, then you can, to some extent, "select" results you like. That's bad! We want the results to be unbiased, so that everyone can agree on them. Therefore, in blind analysis, there are two stages. First, you choose a method of data analysis without looking at the full results. Then you apply that method to the full results without changing it.

I read the paper which is reported in New Scientist, and they have another cool explanation of the same idea. The goal in science is to compare a bunch of different models, and determine which model best explains our observations. But first, we need to come up with those models. The models will be educated guesses based on all the evidence we've collected thus far. So if we want to test the models, it's somewhat redundant to use the present evidence; we should instead collect new observations to test the models.

The problem in cosmology is that at some point, there will be no new observations to make. There is only one universe. There is only one CMBR map, with all its random statistical fluctuations. If you stare long enough at those statistical fluctuations, chances are good that you'll find some false pattern. The pattern will be very difficult to falsify, since there is no more data to collect after that. The solution? Release data piece by piece, so that there will still be new data to test our models.

So you see, even something which sounds as boring as data analysis can have all these counter-intuitive tricks involved. Hiding data in a closed box? It sounds silly, possibly even counter to science's goal of obtaining as much true information about the world as possible. But if it's necessary to filter out human biases, I think we should do it!

4 comments:

Anonymous said...

When designing new computer pattern recognition algorithms, researchers often split up the data into "training data" and "test data". Say if a computer program is desired to recognize human faces (or fingerprints, or handwriting, or whatever), it is "trained" on one set of data, and tested on another. With neural networks as one example, more training on the training data enables the algorithm to recognizes the patterns on the training data better and better, but past a certain point, the algorithm actually gets worse on the test data. That is why it is important to keep the test data and the training data segregated, especially when new data is difficult to acquire (human faces are not a good example).
This also reminds me of stock market predictions made by computer. I have seen cases where huge amounts of historical data are fed into a computer, and the computer looks for patterns that would predict which way the stock market will go based on previously available financial data. With enough analysis of enough variables, the predictions can come out nearly perfect! But the problem is that the patterns don't hold for the future. Immediately after the equations are found, they stop working well. Then the computer program can be improved by including the new data, and it can be made nearly perfect. But then the improved program stops making good predictions, too. The effect is worse than if fewer variables and less data are used, the prediction equations are made simpler, and the computer is not made to match the historical data so perfectly. It's basically the "overtraining" I was describing above.

miller said...

Anonymous,

Incidentally, you've hit precisely the sort of research I did on LIGO. I worked on machine-learning. In fact, we used more than just two sets of data. There were layers and layers of data sets in order to prevent overtraining, and to make sure that all the techniques were unbiased.

Mark Erickson said...

Way cool that you worked on LIGO data. I took part in the Einstein at Home program that used home computers to do something with the data. Can you explain to me how that works in general and also specifically about this issue?

miller said...

Scientists are searching for basically four different kinds of sources of gravitational waves. First are compact binary coalescences (ie two black holes or neutron stars falling into each other), which is what I was working on. Second are gravitational wave bursts, which is a catch-all category for short events. Third are continuous waves, which are long-lasting waves with very constant frequency (ie from a spinning neutron star with asymmetrical mass distribution). Fourth is the stochastic background from the Big Bang, which is very weak, and appears like noise, but is constant over time.

From what I understand, Einstein@Home helps to search for continuous waves. The good thing about continuous waves is that even if they're weak, they last indefinitely. So we can look at a long period of time (say, a year), and let the noise all average out. The bad thing is that for this to work, you have to guess the frequency exactly right, within a very small error. And since you need to account for the Doppler shift and other effects, you also need to guess the location in the sky exactly right. So we have to search through about 10^17 possibilities.

I think they have ways of reducing the number of possibilities, but they still need a lot of computing power. And that's where Einstein@Home comes in. (For those who don't know, Einstein@Home is a project which allows individuals to donate computing time just by installing a screen saver.)