OK, I’ve got the data…now what?

Last month, Dr. Missy Simpson answered a few questions about her new role as epidemiologist for Morris Animal Foundation’s Golden Retriever Lifetime Study. This month she answers a couple questions about interpreting data gathered from big studies.
How do you figure out what is significant and what isn’t when you have thousands of data points? What is the most common mistake made when analyzing big data sets?
It is easy to think that the more data you gather, the better. While this is true to some extent, there are pitfalls that are unique to big data sets. The most common errors relate to making associations. Big data contains a lot of information and it is tempting to just look at everything for a potential association. This can result in two problems: finding associations that really don’t mean anything clinically, and finding associations that are due to chance.
Let me give an example of the first problem:
Suppose we conduct two studies looking at two treatments for separation anxiety in dogs. One study wants to learn if increasing exercise an extra 20 minutes a day reduces anxiety. This study has 30 dogs in it. The second study wants to test the effectiveness of an expensive anti-anxiety medication. This study has 5,000 dogs in it. Suppose both reduce anxiety by 300 percent. Statistically, the expensive drug is looks better, reflecting the large number of dogs in the study.
Which treatment would you use on your own dog to treat separation anxiety? I think most of us would choose exercising our dogs for 20 minutes rather than buying an expensive drug. Although on paper the bigger study looks better, clinically the smaller study results are more useful.
This example simplifies things a bit, but reminds us that we need to be careful in interpreting what is clinically or practically significant when we have a big study.
The second problem with big studies is how we interpret things that might happen by chance. We know coincidences happen all the time. When we have really big studies, it is easier to find these coincidental associations, and if we aren’t careful, we can interpret these findings as significant associations instead of recognizing that they are just chance occurrences. There are precautions epidemiologists take to avoid making this error, but it still happens.
A big study like the Golden Retriever Lifetime Study is great because if we do our statistical analyses carefully, when we find an association we can be confident it is significant. We just need to be careful not to succumb to the temptation to over-interpret our data set.