ARTICLE

hypothesis test

Hypothesis Test (假设检验) A hypothesis test is the central decision-making framework of frequentist inference, providing a formal procedure to evaluate whether sample data provide suf

浏览 0 更新 2026-07-11

Hypothesis Test (假设检验)

A hypothesis test is the central decision-making framework of frequentist inference, providing a formal procedure to evaluate whether sample data provide sufficient evidence against a prespecified claim about a population parameter. Every hypothesis test begins with two mutually exclusive statements: the null hypothesis H0H_0, which embodies the status quo or a position of no effect, and the alternative hypothesis H1H_1, which represents the research claim the investigator seeks to substantiate. The test then quantifies how incompatible the observed data are with H0H_0 and, if that incompatibility crosses a prespecified threshold, recommends rejecting H0H_0.

Core Components

A hypothesis test is defined by four essential elements. The test statistic T=T(X1,,Xn)T = T(X_1, \ldots, X_n) is a function of the sample whose sampling distribution under H0H_0 is fully known (exactly or asymptotically). Common choices include the Z-statistic, t-statistic, F-statistic, and likelihood-ratio statistic. The significance level α\alpha is the maximum probability of committing a Type I error — rejecting H0H_0 when it is true — that the researcher is willing to tolerate; conventional values are 0.05, 0.01, and 0.10. The critical region (or rejection region) is the set of values of TT for which H0H_0 is rejected. Finally, the p-value is the probability, under H0H_0, of observing a test statistic at least as extreme as the one actually obtained. A small p-value signals that either H0H_0 is false or an event of low probability has occurred.

Fisherian and Neyman-Pearson Frameworks

Modern hypothesis testing synthesizes two historically distinct traditions. Fisher's significance testing treats the p-value as a continuous measure of evidence against H0H_0: the smaller the p-value, the stronger the evidence. No fixed α\alpha is required, and no alternative hypothesis is formally specified. In contrast, the Neyman-Pearson framework treats hypothesis testing as a decision problem between H0H_0 and a specific H1H_1, with explicit control over both Type I error rate (α\alpha) and Type II error rate (β\beta). The power of a test, defined as 1β1 - \beta, is the probability of correctly rejecting H0H_0 when H1H_1 is true. Neyman and Pearson proved that the likelihood-ratio test is uniformly most powerful for simple hypotheses — a result that anchors much of parametric testing theory.

Contemporary applied work blends the two: researchers report p-values to convey strength of evidence while using a prespecified α\alpha for binary decisions, and supplement conclusions with confidence intervals to convey practical significance.

Common Tests and Their Uses

The one-sample t-test assesses whether a population mean equals a hypothesized value μ0\mu_0 when the population variance is unknown. The two-sample t-test compares means of two independent groups. The F-test evaluates joint linear restrictions on multiple parameters, forming the workhorse of ANOVA and regression diagnostics. The chi-squared test handles categorical data — testing independence in contingency tables and goodness-of-fit to a theoretical distribution. In econometrics, the Chow test detects structural breaks, while the Hausman test guides the choice between fixed-effects and random-effects panel specifications. The Wald test, likelihood-ratio test, and score test (LM test) constitute the asymptotic trinity for testing parametric restrictions in maximum-likelihood settings.

Caveats and Best Practices

Statistical significance is not equivalent to economic or practical significance: with sufficiently large nn, even trivially small effects become "significant." The longstanding overreliance on the p<0.05p < 0.05 bright line has drawn sustained criticism, spurring calls to report exact p-values, effect sizes, and confidence intervals alongside test results. Multiple comparisons inflate the family-wise error rate — testing kk independent null hypotheses each at level α\alpha yields an overall Type I error rate of 1(1α)k1 - (1-\alpha)^k, motivating Bonferroni corrections and false-discovery-rate control. Finally, hypothesis tests are valid only when their distributional assumptions (normality, independence, homoskedasticity) are approximately met; robust standard errors, bootstrap methods, and nonparametric tests provide fallbacks when these assumptions fail.

Hypothesis testing remains the lingua franca of empirical economics — from program evaluation and policy analysis to finance and labor economics — not because it is flawless, but because it provides a transparent, replicable protocol for extracting signal from noisy data.