Skip to main content

Better experimental design through statistics

For my birthday, I want to go to a seminar about the better use of statistics in animal experiments, said no 9 year old, ever. However, add 30 years and there I was. So why, instead of going to the zoo, was I at a stats seminar?

The fault in our p’s


There is increasingly recognised to be a problem with reproducibility in all areas of biological research. This is a particular issue when it comes to studies using animals. Poor reproducibility has contributed to a failure to turn pre-clinical discovery studies into successful medicines. There are a number of causes of this problem, including unreliable research reagents, confirmation bias and the flawed use of statistics, a lot of which can be overcome with better experimental design, in particular better stats. As best I understand it (and I am not a statistician), using p<0.05 means that 1 in 20 studies is ‘significant’ by chance and therefore we see results as positive when they are not. Due to some complicated statistical sleight of hand involving alpha, beta and normal distributions, it could be the case that as much as 60% of significant results are in fact untrue (though it would be better to follow some actual statisticians to check how this works).

Times are a changin’

But why should you care? It is/ought to be a given that we should all do better science. However career pressures – the need to publish positive results in glossy journals may lead to practices that are not in line with best scientific practice. This can lead to a number of tricks that are used consciously or unconsciously to make a more positive story – p-hacking, HARKing (hypothesising after the results are known) etc. However publishing is not the only career pressure; we all need to bring in grant funding. Both the UK and US funders are using this as a tool to change research practice. The funders have updated their guidance (for example, the MRC and NC3Rs) to ensure more rigour in experimental design in grant applications.

Je-S kidding

In 2012, an assessment by review board members about the quality of the justification for animals in research found that the reason for animal usage and selection of species was ok, but the statistical justification was either absent or plain wrong. It used to be that grants could be awarded conditionally, subject to amendment if there were issues with the statistics.  As of next year, grants can be rejected if the quality of the justification of animal usage is not good enough, without the opportunity to amend the application post award. Let me repeat that, because we are all inclined to spend 90+% of effort in perfecting the case for support and then fill the rest of the form out in a mad dash. Grants may be rejected, without a chance to amend post-award, if the case for animals is poor. Stating ‘We did this because we always have,’ won’t cut it anymore.

Get it right

Desk and computer

What follows are some steps that can help you improve the stats part of the justification for animal usage. Obviously this will not automatically get you funding – if it did, I’d be unlikely to share it, would I. But hopefully it will help you frame your application in a clearer way. NB, don’t spend all your word count on the stats, you still need a case for animal usage/species etc.

Step 1 – PICO. Put a single sentence at the beginning of the animal justification explaining in brief what you are aiming to do. Try using PICO: P (population) – to who, I (intervention) – what will you do, C (control) – what will you compare against, O (outcome). Consider your unit of measurement, e.g. 100 sections from a single liver is still n=1.

Step 2 – describe the effect size. What is the biological effect you are looking for, ideally in a human? This should be drawn from your own experience of the disease area. One of the speakers made the very good point that a positive outcome in animal studies is seldom the actual endpoint we are interested in, i.e. human disease. Think about what a real world biologically significant effect would be and justify why that would be important in a patient. This should be informed by your expertise in the area. An example using blood pressure – 160mm is bad, 120mm is good, the ‘effect size’ of a blockbuster drug would be a 40mm drop in blood pressure. (Effect size can be modified by variability – but use a stats package or check with a statistician).

Step 3 – Using your real world effect size, perform a power calculation. Power calculations enable you to give an actual value for the number of animals needed for each study that will lead to real, reproducible results. ‘We use n=6 because everyone else does’ or ‘We use n=5 because that’s how many fit in a cage’ apparently is not good enough. There are four main elements for a power calculation: you need to state all four, and justify why. The four parts are – level of type 1 error (false positives, α or p value normally 0.05), level of type 2 error (false negatives, β normally 20%), the effect you are looking for (see above) and variability (this has to be based on real world data – yours or from literature). There are lots of stats packages that once you have this info can work out the power for you, this one works.

Step 4 – How will you reduce unconscious bias? Use the ARRIVE guidelines as a checklist, try the NC3Rs Experimental Design Assistant. Think about how you will blind, normalise, randomise etc. There are some worked examples on the MRC site which might help.

Step 5 – GET ADVICE FROM A STATISTICIAN. Did I mention I am not a statistician? If you are reading this as your sole guidance you are in trouble! There are multiple caveats with the approach I have suggested: it is clearly not going to work for all cases, discovery science/ hypothesis-free work will need other approaches, multiple time points and repeated sampling change the weighting of the p value (think multiple coin tosses – the chances of six heads is the same as the chance of H,T,T,T,H,H). Since most programs of work are complex and multi-endpoint it is not possible to do a detailed power calculation for each part, one approach would be to do the analysis for the major arm of the work to demonstrate you know what you are doing.

On the whole my birthday trip to the stats seminar was, surprisingly, much better than a trip to the zoo (then again I don’t really like zoos that much). It was very thought provoking about improving experimental design and a step towards better more reproducible science as a community.

The seminar was organised by the NC3Rs and MRC, but I am writing on my own behalf and the opinions stated here are mine.

John Tregoning

John Tregoning studies the immune response to respiratory viral infections. He has been a PI since 2008 and is a Senior Lecturer at Imperial College London. You can read more of his writing here or follow him on Twitter at @DrTregoning.

Further reading

Academy of Medical Sciences 2015 Reproducibility and reliability of biomedical research: improving research practice.

Percie du Sert N 2016 Improving the design of animal experiments: the Experimental Design AssistantImmunology News, March 2016, 24.