Statistical tests

These are essential mathematical tests which are applied to statistics to determine their degree of certainty and their significance.

Non-parametric interferential statistical methods:

These are mathematical procedures to test the statistical hypothesis which, unlike parametric statistics, do not make any assumption about the frequency distributions of the variables which are determined.

The level of measure may be nominal or ordinal.

The sample does not have to be random.

The frequency distribution does not have to be normal.

It can be used with smaller samples.

Parametric deductive statistical methods:

These are mathematical procedures to test the statistical hypothesis which assume that the distributions of the determined variables have certain characteristics.

The level of measure must be rational or interval.

The sample must be random.

The frequency distribution must be normal.

The variation in results between each frequency must be similar.

Response variable
Study factor	Nominal qualitative (two categories)	Nominal qualitative (> 2 categories)	Ordinal qualitative	Quantitative
Qualitative (two groups)
Independent	Z-test for comparison of proportions. Chi-squared. Fisher’s exact test	Chi-squared.	Mann-Whitney U-test.	Student’s t–test -Fisher. Welch test
Paired	McNemar test Fisher’s exact test.	Cochran’s Q test.	Sign test. Wilcoxon signed-rank test.	Student’s t-test -Fisher for paired data.
Qualitative (more than two groups)
Independent	Chi-squared.	Chi-squared.	Kruskal-Wallis test.	Analysis of variance.
Paired	Cochran’s Q test.	Cochran’s Q test.	Friedman test.	Two-way analysis of variance.
Quantitative	Student’s t-test-Fisher.	Analysis of variance	Spearman’s correlation. Kendall’s tau.	Pearson’s correlation. Linear regression.

When the statistical tests applicable to quantitative variables do not meet the assumptions needed for their application, corresponding tests should be used as if the response variables were an ordinal variable (non-parametric tests).

KOLMOGOROV-SMIRNOV TEST

Non-parametric statistical significance test for contrasting the null hypothesis when the localization parameters of both groups are equal.

This contrast, which is only valid for continuous variables, compares the theoretical distribution function (accumulated probability) with the observed one, and calculates a discrepancy value, usually represented as D. This value corresponds to the maximum discrepancy in absolute value between the observed distribution and the theoretical distribution, thus providing a probability value P, which corresponds, if we are verifying goodness-to-fit to the normal distribution, to the probability of obtaining a distribution which differs as much as the observed one if a random sample had really been obtained, of size n, with a normal distribution.

If this probability is high, then there will be no statistical reasons to assume that our data does not come from a distribution, whereas if it is very low, it will not be acceptable to assume this probability model for the data.

F-TEST

Statistical test which is used to compare variances.

The experimental F-statistic is the contrast statistic in ANOVA and other variance comparison tests.

CHI-SQUARED TEST

The chi-squared test is any statistical hypothesis test in which the statistical test has a chi-squared distribution if the null hypothesis is true.

It determines whether there is an association between qualitative variables.

If the p-value associated to the contrast statistic is less, the null hypothesis will be rejected.

It is used to analyze contingency tables and comparison of proportions in independent data.

FISHER’S EXACT TEST (p.- 5%)

It enables the effect of chance to be evaluated.

It is a statistical significance test used to analyze categorical data in small samples.

The Fisher test is needed when we have data which is classified into two categories in two different ways.

Statistical significance test used to compare proportions in contingency tables.

It is preferred to the x2 test when the sample size is small (less than 30 subjects).

It is the statistical test of choice when the Chi-squared test cannot be used because the sample size is too small.

McNEMAR TEST.

Statistical test which is used to compare proportions in paired data.

Statistical significance test for testing the null hypothesis of inexistence of changes in the proportion of subjects who experiment an event, when each individual is evaluated twice (in different conditions) and the data is paired.

BINOMIAL TEST

In statistics, the binomial test is an exact test of the statistical significance of deviations of a theoretically forecasted distribution of observations in two categories.

The most common use of the binomial test is in the case where the null hypothesis is that two categories are equally likely to occur.

PEARSON’S CORRELATION TEST

This is used to study the association between a study factor and a quantitative response variable. It measures the degree of association between two variables giving values between -1 and 1.

Values close to 1 will indicate strong positive linear association.
Values close to -1 will indicate strong negative linear association.
Values close to 0 will indicate no linear association, which does not mean that another type of association may exist.

Test in a null hypothesis that the relative frequencies of occurrence of the observed events follow a specified frequency distribution.

The events should be mutually exclusive.

This is a goodness-of-fit test which establishes whether or not an observed frequency distribution differs from a theoretical distribution.

KAPPA COEFFICIENT

The Kappa is a general index of acceptance in interobserver studies. It indicates the degree of interobserver interrelationship.

It permits the level of interobserver agreement to be quantified in order to reduce the subjectivity of the method used (mobility test) and to know whether the degree of agreement is due to chance (luck).

The percentage of agreement along with the Kappa index is used for qualitative variables.

The Kappa coefficient is used for two therapists and the Fleiss coefficient for more than two therapists.

This coefficient ranges between 0 and 1. 0 corresponds to a correlation which is identical to that found by chance and 1 a perfect correlation between the examinations.

Negative values usually indicate that there is disagreement between two therapists as to how to perform the method.

It is calculated as the proportion of agreement, apart from that expected by chance alone, that has been observed between two repetitions of the same instrument (for example, a judgement carried out by two observers separately).

The maximum coefficient of agreement is 1.00.

A value of 0.00 indicates no agreement.

between 0.00 and 0.20: slight.
between 0.21 and 0.40: fair
between 0.41 and 0.60: moderate
between 0.61 and 0.80: substantial
between 0.81 and 1.00: almost perfect.

A coefficient of 0.4 would be considered the limit of acceptable reliability of a test.

The Kappa is “a corrector of the measure of agreement”.

As a statistical test, the Kappa can verify that the agreement exceeds the levels of chance.

All the blocks	block C2-C4	block C5-6
Kappa value	K = 0.675 SE = 0.041 Z = 17.067	K = 0.756 SE = 0.045 Z = 16.823	K = 0.460 SE = 0.091 Z = 5.039
Specificity	98%	98%	91%
Sensitivity	74%	78%	55%

K = Kappa coefficient, SE = standard error, Z =Specificity test of the statistics.

INTRACLASS CORRELATION COEFFICIENT (ICC)

The intraclass correlation coefficient (ICC) is for quantitative variables.

Use Landis and Koch’s model 2 for inter-examiner reliability, and model 3 for intra-examiner reliability (Landis RJ & Koch GG, 1977).

This index also ranges from 0 to 1.

– The value 1 corresponds to a perfect reproductivity between measurements.

– The value 0 will indicate that the same variance exists between the measurements taken in a single patient as the measurements taken among different patients.

TESTS	ICC	KAPPA
Height iliac crests	52	0.26
Height EIPS	75	0.54
SFFT	82	0.62
SFFT	63	0.26
Gillet	60	0.18
Height. active leg extended	93	0.81
Joint play	75	0.61
Thigh thrust	81	0.73
Separation	58	0.17
Gaenslen	80	0.51
Patrick	80	0.65
Sacral thrust	68	0.38
Sensitivity SI. ligament	91	0.83
Compression	85	0.59

SPEARMAN’S CORRELATION TEST

This is a non-parametric correlation measure. It assumes an arbitrary monotonic function to describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables.

Unlike the Pearson’s coefficient test, it does not require the assumption that the relationship between variables is linear, nor that the variables are measured in interval scales; it can be used for variables measured at the ordinal level.

It is used if the conditions for applying the Pearson test are not met.

It is a variant of the Pearson correlation test. It is applied when each value in itself is not as important as its situation with regard to the other values.

Its values are interpreted exactly the same as those of the Pearson correlation coefficient.

The Spearman correlation measures the degree of association between two quantitative variables which follow a tendency to always increase or decrease.

It is more general than the Pearson’s correlation coefficient. The Spearman correlation on the other hand can be calculated for exponential or logarithmic relationships between the variables.

WILCOXON TEST

This contrasts the null hypothesis that the sample comes from a population in which the magnitude of the positive and negative differences between the values of the variables is the same.

Non-parametric statistical test for comparing two samples (two treatments).

The data distributions do not need to follow the normal distribution.

It is therefore a less restrictive test than the Student’s t-test.

SHAPIRO-WILKS TEST

Although this test is less well-known, it is the one which is recommended to contrast the goodness-of-fit of our data to a normal distribution, especially when the sample is small (n<30).

It measures the goodness-of-fit of the sample to a straight line, when drawing it on normal probability paper.

FISHER STUDENT’S t-TEST

Used if two groups are compared with regard to a quantitative variable.

In the opposite case, an equivalent non-parametric test is used, like the Mann-Whitney U test.

It is used to compare two means of independent normal populations.

Parametric statistical significance test for contrasting the null hypothesis with regard to the difference between two means.

When the two means have been calculated from two completely independent observation samples (very unlikely situation in practice, at least from a theoretical point of view), the test is described as unpaired.

When the two means have been extracted from consecutive observations of the same subjects in two different situations, the values of each individual are compared, and a paired test is applied.

The Student’s t-test is a type of deductive statistics.

It is used to determine whether there is a significant difference between the means of two groups.

As with all deductive statistics, we assume that the dependent variables have a normal distribution.

We specify the level of probability (alpha level, level of significance, p) which we are willing to accept before data is collected (p < .05 is a common value which is used).

Notes about the Student’s t-test:

When the difference between two averages of the population is being researched, a t-test is used. That is, it is used when we want to compare two means (the counts should be measured in an interval or ratio scale).
We will use a t-test if we want to compare the reading performance of men and women.
With a t-test, we have an independent variable and a dependent one.
The independent variable (gender in this case) can only have two levels (male and female).
If the independent variable had more than two levels, then we would use a one-way analysis of variance (ANOVA).
The statistical test for the Student’s t is the value t. As a concept, the t-value represents the number of standard units which are separating the means of the two groups.
With a t-test, the researcher wants to indicate with a certain degree of confidence that the difference obtained between the means of the sample groups is too high to be a chance event.
If our t-test produces a t-value which gives a probability of .01, we say that the probability of obtaining the difference that we find would be 1 out of 100 times by chance.

Five factors contribute to indicate whether the difference between two means of the groups can be considered significant:

The bigger the difference between the two means, the greater the probability that a statistically significant difference exists.
The amount of overlap which exists between the groups (it is a function of the variation within the groups). The smaller the variations which exist between the two groups, the greater the probability that a statistically significant difference exists.
Sample size is extremely important in determining the significance of the difference between the means. If the sample size is increased, the means tend to be more stable and more representative.
A higher alpha level requires a smaller difference between the means (p < .05).
A non-directional (two-tailed) hypothesis should be used.

Underlying assumptions of the t-test:

The samples have been drawn randomly from their respective populations.
The population should be normally distributed.
Unimodal (one mode).
Symmetrical (the right and left halves are mirror-images), the same number of people above or below the mean.
Bell-shaped (maximum height (mode) in the middle).
Mean, mode and median are located in the centre.
Asymptotic (the further the curve goes away from the mean, the closer the X axis will be; but the curve should never touch the X axis).
The number of people in the populations should have the same variance (s2 = s2). If this is not the case, another calculation is used for the standard error.

There are 2 types of Student’s t-tests

t-test for paired difference ( dependent groups, t-test correlated) : df= n (number of pairs) -1

This refers to the difference between the mean counts of a single sample of individuals which is determined before the treatment and after the treatment. It can also compare the mean counts of samples of individuals who are paired in a certain way (for example, brothers and sisters, mothers, daughters, people who are paired in terms of specific characteristics).

t-test for independent samples

This refers to the difference between the averages of two populations.
Basically, the procedure compares the averages of two samples which were selected independently from each other.
An example would be to compare mathematical counts of an experimental group with a control group.
How do I decide which type of t-test to use?

Type-I error:

Rejects a null hypothesis which is really true. The probability of making a Type-I error depends on the alpha level which was chosen.
If the alpha probability was fixed at p < 05, then there is a 5% possibility of making a Type-I error.
The possibility of making a Type-I error can be reduced by fixing a smaller alpha level (p < .01). The problem of doing this is that is increases the possibility of a Type-II error.

Type-II error:

Fails to reject a null hypothesis which is false.
The basic idea for calculating a Student test is to find the difference between the means of the two groups and divide it by the standard error (of the difference), that is, the standard deviation of the distribution of the differences.
A confidence interval for a two-tailed t-test is calculated by multiplying the critical values by the standard error and adding or subtracting this from the difference of the two means.
The effect size is used to calculate the practical difference. If there are several thousand patients, it is easy to find a statistically significant difference.
Knowing whether this difference is practical or significant is another question.
With studies involving group differences, the effect size is the difference of the two means divided by the standard deviation of the control group (or the mean standard deviation of both groups if there is no control group).
Generally, effect size is only important if there is a statistical significance.
An effect size of 2 is considered small, 5 is considered medium and 8 is considered big.

MANN-WHITNEY TEST

The Mann-Whitney U test is one of the most well-known significance tests.

It is appropriate when two independent observation samples are measured at an ordinal level, that is, we can say which is the greater of these two observations.

It determines whether the degree of coincidence between two observed distributions is lower than that expected by chance in the null hypothesis that the two samples come from the same population.

Non-parametric statistical significance test to test the null hypothesis that the location parameter (generally the median) is the same when two independent groups are compared, regardless of the type of distribution of the variable (normal distribution or another type).

It is used when wanting to compare two populations using independent samples, that is, it is an alternative test to the t-test for comparing two means using independent samples.

The null hypothesis is that the median of the two populations is equal and the alternative hypothesis could be that the median of population 1 is greater (less or different) from the median of population 2.

Mann-Whitney test for independent samples:

If we have two sets of values of a continuous variable obtained in two independent samples: X1, X2,…, Xn, Y1, Y2,…, Ym, we will proceed to put all the values together in ascending order, allocating their rank, correcting equal values with the average rank.
Then we calculate the rank sum for the observations of the first sample Sx, and the rank sum of the second sample Sy.
If the values of the population from which the random X sample was extracted are located below the values of Y, then the X sample will probably have lower ranks, which will be reflected in a lower value of Sx than the theoretically probable one.
If the lowest of the rank sums is excessively low, very unlikely in the case that the null hypothesis were true, this will be rejected.

KRUSKAL-WALLIS TEST

Non-parametric statistical significance test for contrasting the null hypothesis when the location parameters of two or more groups are equal.

The Kruskal-Wallis test is an alternative to the F-test of the analysis of variance for simple classification designs. In this case, several groups are compared but using the median of each of them, instead of the means.

Ho: The median of the k populations considered are equal and,
Ha: At least one of the populations has a different median from the others.

Where n is the data total.

This contrast, which is only valid for continuous variables, compares the theoretical distribution function (accumulated probability) with the observed one, and calculates a discrepancy value, usually represented as D. This value corresponds to the maximum discrepancy in absolute value between the observed distribution and the theoretical distribution, thus providing a probability value P, which corresponds, if we are verifying goodness-of-fit to the normal distribution, to the probability of obtaining a distribution which differs as much as the observed one if a random sample had really been obtained, of size n, with a normal distribution.

If this probability is high, then there will not be statistical reasons for assuming that our data does not come from a distribution, whereas if it is very low, it will not be acceptable to assume this probability model for the data.

NON-PARAMETRIC TESTS

The analysis of variance assumes that the underlying distributions are distributed normally and that the variations in the distributions which are compared are similar.

Pearson’s correlation coefficient assumes normality.

Although parametric techniques are robust (that is, they often have considerable power for detecting differences or similarities even when these assumptions are infringed), some distributions infringe so much that a non-parametric alternative is more desirable for detecting a difference or a similarity.

Non-parametric tests for related samples

Test	Num. of variables	Variables	Objective
McNemar	2	Qualitative: 2 values	To determine whether the difference between the frequency distributions of the values of the two variables is statistically significant.
Signs	2	At least in ordinal scale	To determine whether the difference between the number of times the value of a variable is greater than that of the other one and the number of times it is less is statistically significant.
Wilcoxon	2	At least in ordinal scale	To determine whether the difference between the magnitude of the positive differences between the values of the two variables and the magnitude of the negative differences is statistically significant.
Cochran’s Q	p > 2	Qualitative: 2 values	To determine whether the differences between the frequency differences of the values of the p variables are statistically significant.
Friedman’s F	p > 2	At least in ordinal scale	To determine whether the differences between the distributions of the p variables are statistically significant.

CHOOSING THE APPROPRIATE STATISTICAL TECHNIQUE

With the elements defined in the earlier paragraphs, decision trees can be established to help choose the appropriate statistical test or technique.

There are more than 300 basic statistical tests, making it difficult to cover all of them exhaustively in this article.

Criterion	Description	Explanations
1	Descriptive statistics	No statistical content or only descriptive statistics
2	Student’s t-tests, z-tests	For one sample or two samples (paired and/or independent)
3	Bivariate tables	Chi-squared, Fisher’s exact test, McNemar Test
4	Non-parametric tests	Signs Test, Mann-Whitney U test, Wilcoxon t-test
5	Demo-epidemiological statistics	Relative risk. Odds ratio. Log. Odds. Measures of association, sensitivity and specificity
6	Pearson’s linear correlation	Classic correlation (linear correlation coefficient r)
7	Pearson’s linear correlation	Classic correlation (linear correlation coefficient r)
8	Simple regression	Regression of squared minimums with a producer variable and a response
9	Analysis of variance	ANOVA, analysis of covariance, F-tests
10	Transformation of variables	Use of transformations (logarithmic…)
11	Non-parametric correlation	Spearman’s Rho, Kendall’s Tau, trend tests
12	Multiple regression	Includes polynomic regression and step-by-step regression
13	Multiple comparisons	Multiple comparisons
14	Goodness-of-fit and standardisation	Standardisation of incidence and prevalence rates
15	Multivariate tables	Mantel-Haenszel procedures- linear log. Models
16	Sample size and power	Determination of the sample size on the basis of a detectable difference
17	Survival analysis	Includes life tables, survival regression and other survival analyzes
18	Cost-benefit analysis	Estimation of the health costs for comparing alternative guidelines (cost-effectiveness)
19	Other analyzes	Tests not included in the preceding categories: Sensitivity analysis, cluster analysis. Discriminating analysis.

Protocol designed by EMERSON and COLDTIZ and adapted by MORA, RIPPOLL et al. Reference levels for the analysis of accessibility.

THE FOLLOWING STEPS
Once the statistics have been carried out, the following actions should be carried out:

Qualitative or quantitative analysis.
Summary and final interpretation of all the data already analyzed.
Writing up of the research report.