ARTICLE
hypothesis test
Hypothesis Test (假设检验) A hypothesis test is the central decision-making framework of frequentist inference, providing a formal procedure to evaluate whether sample data provide suf
Hypothesis Test (假设检验)
A hypothesis test is the central decision-making framework of frequentist inference, providing a formal procedure to evaluate whether sample data provide sufficient evidence against a prespecified claim about a population parameter. Every hypothesis test begins with two mutually exclusive statements: the null hypothesis , which embodies the status quo or a position of no effect, and the alternative hypothesis , which represents the research claim the investigator seeks to substantiate. The test then quantifies how incompatible the observed data are with and, if that incompatibility crosses a prespecified threshold, recommends rejecting .
Core Components
A hypothesis test is defined by four essential elements. The test statistic is a function of the sample whose sampling distribution under is fully known (exactly or asymptotically). Common choices include the Z-statistic, t-statistic, F-statistic, and likelihood-ratio statistic. The significance level is the maximum probability of committing a Type I error — rejecting when it is true — that the researcher is willing to tolerate; conventional values are 0.05, 0.01, and 0.10. The critical region (or rejection region) is the set of values of for which is rejected. Finally, the p-value is the probability, under , of observing a test statistic at least as extreme as the one actually obtained. A small p-value signals that either is false or an event of low probability has occurred.
Fisherian and Neyman-Pearson Frameworks
Modern hypothesis testing synthesizes two historically distinct traditions. Fisher's significance testing treats the p-value as a continuous measure of evidence against : the smaller the p-value, the stronger the evidence. No fixed is required, and no alternative hypothesis is formally specified. In contrast, the Neyman-Pearson framework treats hypothesis testing as a decision problem between and a specific , with explicit control over both Type I error rate () and Type II error rate (). The power of a test, defined as , is the probability of correctly rejecting when is true. Neyman and Pearson proved that the likelihood-ratio test is uniformly most powerful for simple hypotheses — a result that anchors much of parametric testing theory.
Contemporary applied work blends the two: researchers report p-values to convey strength of evidence while using a prespecified for binary decisions, and supplement conclusions with confidence intervals to convey practical significance.
Common Tests and Their Uses
The one-sample t-test assesses whether a population mean equals a hypothesized value when the population variance is unknown. The two-sample t-test compares means of two independent groups. The F-test evaluates joint linear restrictions on multiple parameters, forming the workhorse of ANOVA and regression diagnostics. The chi-squared test handles categorical data — testing independence in contingency tables and goodness-of-fit to a theoretical distribution. In econometrics, the Chow test detects structural breaks, while the Hausman test guides the choice between fixed-effects and random-effects panel specifications. The Wald test, likelihood-ratio test, and score test (LM test) constitute the asymptotic trinity for testing parametric restrictions in maximum-likelihood settings.
Caveats and Best Practices
Statistical significance is not equivalent to economic or practical significance: with sufficiently large , even trivially small effects become "significant." The longstanding overreliance on the bright line has drawn sustained criticism, spurring calls to report exact p-values, effect sizes, and confidence intervals alongside test results. Multiple comparisons inflate the family-wise error rate — testing independent null hypotheses each at level yields an overall Type I error rate of , motivating Bonferroni corrections and false-discovery-rate control. Finally, hypothesis tests are valid only when their distributional assumptions (normality, independence, homoskedasticity) are approximately met; robust standard errors, bootstrap methods, and nonparametric tests provide fallbacks when these assumptions fail.
Hypothesis testing remains the lingua franca of empirical economics — from program evaluation and policy analysis to finance and labor economics — not because it is flawless, but because it provides a transparent, replicable protocol for extracting signal from noisy data.