Sample Size Determination for Animal Study

Research Projects:

Breast Cancer

Colorectal Cancer

Kidney Cancer

BHD-FLCN papers

BHD Foundation

Myrovlytis Trus

Kidney Cancer Cell Lines

NCI 60 Cancer Cell Lines

Sample Size Determination for Animal Study

Federal Policy Regarding the Numbers of Animals Used in Research

Federal policy currently requires that investigators search for and consider alternatives which would a) obviate the use of animals (replacement), b) minimize the number of animals needed (reduction), and c) decrease the pain or distress experienced by animals in research (refinement). This brief article will address sample size calculations as a simple, but important step in arriving at appropriate estimates of the number of animals needed for an experiment.

Estimation of the number of subjects required to answer an experimental question is an important step in avoiding waste of animal life. It is important to emphasize that experimental waste can occur either as a result of excessive estimates of the number of animals needed or as a result of unrealistically low estimates. On one hand, an excessive sample size can result in waste of animal life and other precious resources, including time and money, because equally valid information could have been gleaned from a smaller number of animals. However, underestimates of sample size are also wasteful, since an insufficient sample size has a low probability of detecting a statistically significant difference between groups, even if a difference really exists. Consequently, an investigator might wrongly conclude that groups do not differ, when in fact they do.

In essence both errors in estimation, too few or too many, result in a waste of animal life. This should, of course, be deplored on ethical grounds, but in addition, the need to search for alternatives which lessen the numbers of animals used in experimentation is mandated by The Animal Welfare Act. The principles articulated in the Animal Welfare Act have increasingly influenced federal regulatory policy. For example, guidelines to Institutional Animal Care and Use Committees (IACUCs) from the Office of Protection from Research Risks (OPRR) state:

Current federal regulatory policy, as well as generally accepted ethical principles, incorporate two general goals. The first is that scientific reliance on live animals should be minimized. The second is that pain, distress, and other harm to laboratory animals should be reduced to the minimum necessary to obtain valid scientific data. Federal policy directs the IACUC to review proposals for animal use to ensure that investigators incorporate these principles into their research.

The number of experimental animals should be the minimum necessary to produce valid results. When possible and appropriate, a non-animal substitute should be used, or a species of lower phylogenetic order substituted if available. Unnecessarily duplicative research should be avoided for scientific and ethical reasons.

From "Institutional Animal Care and Use Committee Guidebook, OPRR, U.S. Department of Health and Human Services, Public Health Service, NIH Publication No. 92-3415

What is Involved in Sample Size Calculations:

While the need to arrive at appropriate estimates of sample size is clear, many scientists are unfamiliar with the factors which influence determination of sample size and with the techniques for calculating estimated sample size. A quick look at how most textbooks of statistics treat this subject indicates why many investigators regard sample size calculations with fear and confusion.

While sample size calculations can become extremely complicated, it is important to emphasize, first, that all of these techniques produce estimates, and, second, that there are just a few major factors influencing these estimates. As a result, it is possible to obtain very reasonable estimates from some relatively simple formulae.

When comparing two groups, the major factors that influence sample size are:

1) How large a difference you need to be able to detect.

2) How much variability there is in the factor of interest.

3) What "p" value you plan to use as a criterion for statistical "significance."

4) How confident you want to be that you will detect a "statistically significant

difference, assuming that a difference does exist.

The size of the sample you need also depends on the "p value" that you use. A "p value" of less than 0.05 is frequently used as the criterion for deciding whether observed differences are likely to be due to chance. If p<0.05, it means that the probability that the difference you observed was due to chance is less than 5%. If want to use a more rigid criterion (say, p< 0.01) you will need a larger sample. Finally, the size of the sample you will need also depends on "power," that is the probability that you will observe a statistically significant difference, assuming that a difference really exists.

To summarize, in order to calculate a sample size estimate if you need some estimate of how different the groups might be or how large a difference you need to be able to detect, and you also need an estimate of how much variability there will be within groups. In addition, your calculations must also take in account what you want to use as a "p value" and how much "power" you want.

The Information You Need to Do Sample Size Calculations

Since you haven’t actually done the experiment yet, you won’t know how different the groups will be or what the variability (as measured by the standard deviation) will be. But you can usually make reasonable guesses. Perhaps from your experience (or from previously published information) you anticipate that the untreated hypertensive subjects will have a mean systolic blood pressure of about 160 mm Hg with a standard deviation of about +10 mm Hg. You decide that a reduction in systolic blood pressure to a mean of 150 mm Hg would represent a clinically meaningful reduction. Since no one has ever done this experiment before, you don’t know how much variability there will be in response, so you will have to assume that the standard deviation for the test group is at least as large as that in the untreated controls. From these estimates you can calculate an estimate of the sample size you need in each group.

Sample Size Calculations for a Difference in Means

The actual calculations can get a little bit cumbersome, and most people don’t even want to see equations. Consequently, I have put together a spreadsheet (SAMPLESZ.XLS) which does all the calculations automatically. All you have to do is enter the estimated means and standard deviations for each group. In the example show here I assumed that my control group (group 1) would have a mean of 160 and a standard deviation of 10. I wanted to know how many subjects I would need in each group to detect a significant difference of 10 mm Hg. So, I plugged in a mean of 150 for group 2 and assumed that the standard deviation for this group would be the same as for group 1.

The spreadsheet actually generates a table which shows estimated sample sizes for different "p values" and different power levels. Many people arbitrarily use p=0.05 and a power level of 80%. With these parameters you would need about 16 subjects in each group. If you want 90% power, you would need about 21 subjects in each group.

The format in this speadsheet makes it easy to play "what if." If you want to get a feel for how many subjects you might need if the treatment reduces pressures by 20 mm Hg, just change the mean for group 2 to 140, and all the calculations will automatically be redone for you.

Sample Size Calculations for a Difference in Proportions

The bottom part of the same spreadsheet generates sample size calculations for comparing differences in frequency of an event. Suppose, for example, that a given treatment was successful 50% of the time and you wanted to test a new treatment with the hope that it would be successful 90% of the time. All you have to do is plug these (as fractions) into the spreadsheet, and the estimated sample sizes will be calculated automatically as shown here:

The illustration from the spreadsheet below shows that to have a 90% probability of showing a statistically significant difference (using P< 0.05) in proportions this great, you would need about 22 subjects in each group.

Availability of the Spreadsheet

The spreadsheet described here is saved in an Excel file called (and can be obtained through this Hyperlink) SAMPLESZ.XLS.

The Power Analysis Method of Estimating Sample Size

The Power Analysis method of estimating sample size depends on a mathematical relationship between the following six variables.

Variability of the material

An estimate of the standard deviation of the experimental subjects is necessary (for quantitative variables). This must come from a previous study, a pilot experiment or from the literature. This is the main weakness of the method because the estimate of sample size depends critically on this estimate.

Effect size of clinical or biological importance

Consider an experiment with just a control and a treated group. A small difference in the means may be of little scientific or clinical interest. However, an investigator would be very interested in being able to detect a large difference. Thus, the investigator needs to be able to specify the minimum effect size likely to be of interest.
For quantitative characters it is often helpful to consider the effect size in terms of standard deviation units by dividing it by the standard deviation (SDev.). In this way all traits are in the same units, and it becomes easier to judge the consequences of choosing various effect sizes. This is described in more detail below. To detect an effect size larger smaller than one SDev. will require a "large" experiment. To detect one greater than two SDevs. will require a "small" experiment.

Significance level

This is usually set at 0.05, but in some circumstances it may be more appropriate to use a different figure. For example, power will be higher if the significance level is set at 0.1 rather than o.05, so if the aim is to prove a negative (i.e. that the treatment is having no effect), then the significance level may be set at 0.1.

Power

The power is the probability of being able to detect the specified effect and call it significant at the designated level of significance. Most people will want a powerful experiment. Usually this is set somewhere between 80% and 95%. The higher the specified power, the larger the sample size that will be needed, other things being equal. High power is needed if the consequences of failing to detect a treatment effect are likely to be serious.

Sidedness of the test

In most circumstance it will not be known whether the treatment will increase or decrease the mean of the character of interest, so a two-sided test should be used. In some circumstances there will be a good biological reason why the effect of the treatment can only go in one direction. In this case a one-sided test should be used.

Sample size

The purpose of the power analysis is usually to determine sample size. However, where resources are limited sample size may be fixed and the aim of the analysis might then be to determine the power of the experiment or the effect size likely to be detected.

Putting it together

The mathematical equations relating these variables are complex. Many modern statistical packages such as MINITAB now offer power analysis calculations. There are number of free web sites such as which will do the calculations for simpler situations. Click here.

There are also several stand-alone statistical packages such as nQuery Advisor which offer power analysis for a wide range of situations, although these are not inexpensive.

The graph in Fig. 1 shows the sample size as a function of the effect size in standard deviation units for a 90% power, a 5% significance level and a two sided test. Thus, using this, an effect size (difference between mean of treated and control group) equal to one standard deviation will about 23 animals per group.

The two sample case

The graph in Fig. 1 shows the sample size as a function of the effect size in standard deviation units for a 90% power, a 5% significance level and a two sided test. Thus, using this, an effect size (difference between mean of treated and control group) equal to one standard deviation will require about 23 animals per group.

Sample1

Fig. 1. Sample size as a function of effect size in standard deviations assuming a 90% power, a 5% significance level and a two-sided test.

Note that this graph may also be used to determine effect size if sample size is fixed.

Large groups sizes are required to detect small effects such as those of less than half a standard deviation. However, anyone using laboratory mice or rats has enormous control over variability. Isogenic strains raised in a controlled environment, free of disease, fed a uniform diet and matched for age and body weight are very uniform so the standard deviation is much smaller than that found, for example, in humans studies. This means that in terms of standard deviations, most research workers are only interested in studying "large" effects of over one standard deviation in magnitude. Effects of two or more standard deviations can be detected withgroups of only about eight animals.

Example using the graph.

An investigator wishes to compare two anaesthetics for dogs, and in particular if there was a differences in blood pressure while under anaesthetic of 10mmHg or more she would like to know about it. She plans to do the experiment using beagles, and previous studies show that their mean blood pressure under an anaesthetic is 108mmHg, with a standard deviation (SD) of 9mmHg. The effect size is therefore 10/9 = 1.1 SDs, and the data will be analysed using a two-sample t-test. Reading from the graph, this will require about 20 dogs per group. The same calculations can be done using www.biomath.info.

Suppose only 30 dogs are available, from the graph it is is possible to estimate that with 15 animals per group the effect size that is likely to be detectable (with the assumptions given) is about 1.3 SDs or 1.3*9= 12 mmHg.

(Note that rather than using a between-animal design it would probably be better to test both anaesthetics on each dog in random order using a within-animal design. Estimation of sample size in this case would require an estimate of the standard deviation of blood pressure of dogs repeatedly anaesthatised with the same anaesthetic. The resulting data would be analysed using a paired t-test. Power calculations for the paired t-test are provided in www.biomath.info)

Table 1 shows the sample size needed when comparing two proportions, assuming a 5% significance level and a 90% power. Thus, to distinguish between a 20% incidence and a 40% incidence of some binary trait will require 109 animals in each group. These are very large sample sizes. Clearly, it is very much better to measure something than to count!

The two sample case with binary outcomes (e.g. percentages)

Power calculations showing the number required in each group for comparing two proportions (based on a normal approximation of the binomial distribution) with a significance level of 0.05 and a power of 90%
Proportion in each group	0.2	0.3	0.4	0.5	0.6	0.7
0.2	-
0.3	392	-
0.4	109	477	-
0.5	52	124	519	-
0.6	90	56	130	519	-
0.7	19a	31	56	124	477	-
0.8	13a	19 a	30	52	109	392
aAssumptions may lead to some inaccuracy.

More complex situations

With more than two groups it is more difficult to specify the effect size of scientific interest, and the more complex situations are not generally catered for by the free web sites. The problem is tackled in different ways by different computer packages. MINITAB, for example, asks you to specify the difference between the two most extreme means, while nQuery Advisor gets you to specify group means and then calculates their standard deviation.

The ILAR web sitehttp://dels.nas.eduprovides extensive information on all aspectes of laboratory animal science. Full text of an excellent article on power analysis is given in

http://dels.nas.edu/ilar_n/ilarjournal/43_4/v4304Dell.shtml

A book on the design of animal experiments aimed ar biomedical research workers can be found at

www.lal.org.uk/hbook14.htm

Sample Size Determination Paper