Monday, December 7, 2009

On the hiding of climate data

I've been hearing a lot lately about this "ClimateGate" story? Someone hacked the e-mails of a bunch of climate scientists, and found evidence of fraud. That's pretty outrageous, isn't it? Seriously, what kind of person goes into science, which is a method of revealing truth, only to cover up and fabricate? Not my kind of scientist, that's for sure.

But then I actually saw what they consider evidence of fraud.
I’ve just completed Mike’s Nature trick of adding in the real temps to each series for the last 20 years (ie from 1981 onwards) and from 1961 for Keith’s to hide the decline.
I can see how someone might see this as evidence of a fraud. This scientist is talking about using a "trick" to "hide the decline" in temperature! But because of my limited research experience in data analysis, it's clear to me that it's completely innocuous, even without seeing the context.

In my experience, a significant part of data analysis is all about knowing what data to keep and what data to throw out. That's right, I threw out lots of data. Well, I didn't really throw anything out in the sense of deleting it from computers. I just excluded it from the analysis and from the results.

Let me explain a bit more about my research from over two years ago. I was looking at data from magnetometers, which very precisely measure changes in the Earth's magnetic field. One of the problems was that every so often, the Earth's magnetic field would jump up by a factor of a trillion or more. I wanted to cover this up! The public shouldn't be allowed to know! So what do I do? One by one, I went through these gigantic spikes in the data, and removed them. In retrospect, this was not a very efficient way to do it, but then I was an undergraduate researcher, so my time was pretty worthless anyways.

Why did I throw it out? It was bad data. I didn't like it. I clearly had some sort of personal vendetta against the data. More seriously, it's because these gigantic spikes in the data are caused by glitches in the magnetometer devices or other electronics. What exactly causes these glitches? Well, how should I know? I'm just an undergraduate researcher, not an engineer, and all I know is that the magnetometers occasionally acted really funky. If the earth's magnetic field really were jumping up by a factor of a trillion, I'd expect to see the effects all across the earth, at all magnetometers all at once. And I don't. So it was bad data. I didn't like it. I hid it in a little corner marked "raw data".

In my experience, data analysis is more or less one long string of choosing which data to throw out.

Of course, you don't just throw out data willy nilly. You have to come up with justifications for it. And saying, "I like the conclusions which we would draw from this data, but not that data," is not sufficient justification. It's tricky, because you don't want to bias yourself towards a previously held belief by only selecting the evidence which confirms the belief. There are some famous examples where scientists threw out data they thought was bad, but later turned out to be good. For example, before the cosmic microwave background radiation from the Big Bang was discovered, scientists had actually seen it on radio telescopes, but they thought it was just noise caused by pigeon droppings. Another example is the ozone hole, which was initially filtered out as bad data for about a decade. It's true, cientists make mistakes sometimes, but not because they're conspiring against the public, but because Science Is Hard.

Of course, those examples are the exception, not the rule. Data analysts throw data out on a regular basis, and the vast majority of the time, it's because they ought to.

So in the case of the climate researchers, even without looking at context, we know they probably had a good reason to throw out data. In fact, I know they have a good reason, because I looked it up. Apparently it has to do with the unreliability of using tree growth data to determine the temperatures of the last few decades. I don't really understand any of that, because I don't have much interest in climate science, but it should at least be clear that the justifications for their methods have been published out in the open. If climate scientists are indeed throwing out data that should be kept, it's not because they're part of a secret conspiracy.

So why is it that the e-mail talks about using a "trick" to "hide" data? Isn't that a bit odd word choice? Not really. "Trick" is commonly used to mean simply a clever method. "Hide" means that they're hiding unreliable data by putting more reliable data in its place. I have trouble seeing what the big deal is.

Tell you what it looks like to me now. Confirmation bias. People wanted to find a conspiracy, so they looked through a thousand e-mails and found a few e-mails to confirm their beliefs.* You'd think that if there really were some giant conspiracy, it would show up in more than just a few. But let's all just forget about the rest of the e-mails and documents. We don't like the data, so let's just throw it out, eh?

*Yes, there were a few others, but they don't impress me. I think the worst example was a request to delete some e-mail correspondence. In the interest of brevity, my response consists only of two words, "Hanlon's" and "razor".


DarkSapiens said...

Well said.

And in a certain way, when you worked for LIGO you were contributing to create methods to actually throw out data in a more automated way, wasn't it? :)


miller said...

Yep, pretty much.

Jeffrey Ellis said...

I'm on a crusade to get everyone to correctly attribute it as Heinlein's Razor.