STAT 1103 Week 8 Notes: Categorical Data

Summary

Difficulty: ★★★☆☆

Covers: Categorical data, categorical independent and dependent variables, chi-square goodness-of-fit, chi-square test of independence, expected frequencies and assumptions, (Cohen’s W, Cramer’s V), interpreting and reporting results, Stata commands for categorical analyses

What does categorical data mean?
  • A categorical variable puts people or things into groups
    • Examples: yes/no, pass/fail, psychology/neuroscience, junior/senior, Australia/USA
  • This week’s focus is different from earlier weeks
    • Earlier: one categorical variable (goodness-of-fit), or categorical IV with numeric DV (t-tests), or numeric-numeric (correlation)
    • This week: categorical IV and categorical DV together
  • The main question this week answers
    • Is there an association between two categorical variables?
Categorical vs numeric measurement
  • Numeric measurement gives a score on a scale
    • Example: “How much affection do you show?” 0–10
  • Categorical measurement puts you into a category
    • Example: “Are you affectionate?” yes/no
  • If you have a choice when designing a study, numeric is usually better
    • Numeric captures more information
    • Numeric can be turned into categories later if you want
    • Categorical cannot be turned into a true numeric score
    • Numeric outcomes usually give more statistical power
Categorical dependent variables (DV)
  • Numeric DV questions predict a score
    • Example: higher performance, more symptoms, greater engagement
  • Categorical DV questions predict group membership
    • Example: likelihood of passing vs failing, disease vs no disease, convicted vs not convicted
  • Categorical DVs can have more than two categories
    • Example: first preference vs other preference vs not listed
Recognising categorical variable research questions
  • In these questions, you can usually spot the categories in the wording
    • “Are teenagers less likely to pass than older drivers?”
    • “Are people with family history more likely to develop depression?”
  • Typical structure
    • IV: categorical group (teen vs older, attended vs not, history vs no history)
    • DV: categorical outcome (pass vs fail, depression vs no depression, multiple categories)
The key statistical analysis used for Categorical Data
  • Main test: chi-square test of independence
  • Same test is used whether the study is experimental or non-experimental
  • The difference is how you interpret the result
    • Experimental IV (random assignment/manipulation) supports causal language
    • Non-experimental IV (naturally occurring groups) does not support causal language
How chi-square test of independence works
  • You organise two categorical variables into a contingency table (a cross-tab)
    • Rows = groups/levels of IV
    • Columns = categories of DV
  • The test compares
    • Observed counts (what you actually got)
    • Expected counts (what you would expect if there were no association)
  • Assumption you must check
    • All expected counts should be at least 5
    • If any expected value is below 5, you should not run this chi-square test in the usual way
  • Degrees of freedom
    • df = (rows − 1) × (columns − 1)
  • The conclusion answers
    • Is there evidence of an association between the two variables?
    • Then describe the pattern using percentages or comparing observed vs expected
Expected counts (the rule you use)
  • Expected count for a cell =
    • (Row total / Grand total) × Column total
  • Practical tip
    • If calculating by hand, keep many decimals to avoid rounding errors

Effect sizes for chi-square tests (how big is the association?)

Effect size (W / V)Interpretation
< 0.1Negligible
0.1–0.3Small
0.3–0.5Moderate
≥ 0.5Large

Important note!

A result can be statistically significant but have a negligible/small effect size

What to report for chi-square test of independence (APA-style)
  • State the association result and the test
    • χ²(df, N = total) = value, p = value
  • Then describe the pattern clearly
    • Percentages or proportions in each group
  • Add effect size when possible
    • W or Cramer’s V and its size label (small/moderate/etc.)
Relevant Stata Commands
  • Chi-square goodness-of-fit (one categorical variable)
    • Column of data:
      • csgof varname, expperc(pct1, pct2, ...)
    • Summary counts typed directly:
      • chitesti Obs_1 Obs_2 ... Obs_k \ exp_1 exp_2 ... exp_k
  • Chi-square test of independence (two categorical variables)
    • Raw data in two columns:
      • tabulate var1 var2, exp row chi2 V
      • exp shows expected counts
      • row shows row percentages (often easiest for interpretation)
      • chi2 runs the chi-square test
      • V reports Cramer’s V (especially useful for 2×2 tables)
    • Summary table typed directly (2×2 layout):
      • tabi Obs_1 Obs_2 \ Obs_3 Obs_4, exp row chi2 V
How to choose the right categorical test
  • One categorical variable, comparing observed counts to expected proportions
    • Use chi-square goodness-of-fit
  • Two categorical variables, testing whether they are associated
    • Use chi-square test of independence
  • Experimental vs non-experimental does not change the test
    • It changes how strong your conclusion can be (causal vs non-causal language)
So Overall…
  • Categorical outcomes are about chance/likelihood of being in a category, not a score
  • The main tool for two categorical variables is the chi-square test of independence
  • Always check expected counts are at least 5
  • Use percentages to explain the direction of the association
  • Always look at effect size as well as p-value
  • Your conclusion wording depends on whether the IV was experimentally manipulated or naturally occurring

Leave a comment