{
"cells": [
{
"cell_type": "markdown",
"id": "29a86434-d56f-4f14-8c3c-6e65c1e3d5ad",
"metadata": {},
"source": [
"# Two Sample Tests of Proportion - Python"
]
},
{
"cell_type": "markdown",
"id": "34ce63d9-a78a-4dd2-a94d-e5ac13210987",
"metadata": {},
"source": [
"## Fisher's Exact Test\n",
"\n",
"* **Samples:** `2`\n",
"* **Response Categories:** `2`\n",
"* **Exact?:** Yes, use with `N≤200`\n",
"* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. Fisher’s exact test indicated a statistically significant association between X and Y (p < .0001).\""
]
},
{
"cell_type": "markdown",
"id": "0552b66c-40a0-4cde-a1b0-50b9eefcf5fc",
"metadata": {},
"source": [
"```{warning}\n",
"The isn't a readily available and non-GPL licensed Fisher's Exact Test that works on more than 2 response categories like it does in R. It is recommended to try a [G-Test](#g-test) or [Two-Sample Pearson Chi-Squared Test](#two-sample-pearson-chi-squared-test) instead if working in Python.\n",
"\n",
"Related GitHub Issue: https://github.com/scipy/scipy/issues/7099\n",
"\n",
"The following example will be only for 2 samples and 2 response categories.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "be7eb19b-5317-4802-8172-2309546d27e9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" S | \n",
" X | \n",
" Y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" a | \n",
" x | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 10 | \n",
" 11 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 12 | \n",
" 13 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 13 | \n",
" 14 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 14 | \n",
" 15 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 15 | \n",
" 16 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 16 | \n",
" 17 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 17 | \n",
" 18 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 18 | \n",
" 19 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 20 | \n",
" 21 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 21 | \n",
" 22 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 23 | \n",
" 24 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" S X Y\n",
"0 1 a y\n",
"1 2 b x\n",
"2 3 a x\n",
"3 4 b y\n",
"4 5 a y\n",
"5 6 b x\n",
"6 7 a y\n",
"7 8 b x\n",
"8 9 a y\n",
"10 11 a y\n",
"12 13 a y\n",
"13 14 b y\n",
"14 15 a y\n",
"15 16 b y\n",
"16 17 a y\n",
"17 18 b x\n",
"18 19 a y\n",
"20 21 a y\n",
"21 22 b y\n",
"23 24 b x"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Example data\n",
"# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n",
"df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n",
"\n",
"# Filter the data for the sake of making it work for the example\n",
"df = df.loc[df[\"Y\"].isin([\"x\", \"y\"])]\n",
"df.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "ca79e707-b1ef-4a79-a999-4d4d346b3223",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Y | \n",
" x | \n",
" y | \n",
"
\n",
" \n",
" X | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 26 | \n",
"
\n",
" \n",
" b | \n",
" 14 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Y x y\n",
"X \n",
"a 3 26\n",
"b 14 9"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n",
"xt"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "e727c563-613b-4dd7-9b05-5926fe2c3f02",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.00021895317390182282"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import fisher_exact\n",
"\n",
"_, p = fisher_exact(xt)\n",
"p"
]
},
{
"cell_type": "markdown",
"id": "8674f910-0a20-4102-a194-b4bdba58555f",
"metadata": {},
"source": [
"## G-Test\n",
"\n",
"* **Samples:** `2`\n",
"* **Response Categories:** `≥2`\n",
"* **Exact?:** No, use with `N>200`\n",
"* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. A G-test indicated a statistically significant association between X and Y (G(2) = 21.40, p < .0001).\""
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "7b65ff5f-fe61-4b8d-9438-d0434b95f1b2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" S | \n",
" X | \n",
" Y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" a | \n",
" x | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 9 | \n",
" 10 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
" 10 | \n",
" 11 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 11 | \n",
" 12 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
" 12 | \n",
" 13 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 13 | \n",
" 14 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 14 | \n",
" 15 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 15 | \n",
" 16 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 16 | \n",
" 17 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 17 | \n",
" 18 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 18 | \n",
" 19 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 19 | \n",
" 20 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" S X Y\n",
"0 1 a y\n",
"1 2 b x\n",
"2 3 a x\n",
"3 4 b y\n",
"4 5 a y\n",
"5 6 b x\n",
"6 7 a y\n",
"7 8 b x\n",
"8 9 a y\n",
"9 10 b z\n",
"10 11 a y\n",
"11 12 b z\n",
"12 13 a y\n",
"13 14 b y\n",
"14 15 a y\n",
"15 16 b y\n",
"16 17 a y\n",
"17 18 b x\n",
"18 19 a y\n",
"19 20 b z"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Example data\n",
"# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n",
"df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n",
"df.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "ccfc2702-6fae-40f5-908c-e7ba4a194444",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Y | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" X | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 26 | \n",
" 1 | \n",
"
\n",
" \n",
" b | \n",
" 14 | \n",
" 9 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Y x y z\n",
"X \n",
"a 3 26 1\n",
"b 14 9 7"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n",
"xt"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "e4ca1f61-4ad5-4059-83f5-5369fceeafb4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21.402062415325055, 2.252170138338781e-05, 2)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import chi2_contingency\n",
"\n",
"g_stat, p, dof, exp_freq = chi2_contingency(xt, lambda_=\"log-likelihood\")\n",
"g_stat, p, dof"
]
},
{
"cell_type": "markdown",
"id": "85091a9f-0863-48ee-8c16-cac8d31927a2",
"metadata": {},
"source": [
"## Two-Sample Pearson Chi-Squared Test\n",
"\n",
"* **Samples:** `2`\n",
"* **Response Categories:** `≥2`\n",
"* **Exact?:** No, use with `N>200`\n",
"* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. A two-sample Pearson Chi-Squared test indicated a statistically significant association between X and Y (χ2 (2, N=60) = 19.88, p < .0001).\""
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "f9818172-7156-4593-a3ce-5b094c530ba9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" S | \n",
" X | \n",
" Y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" a | \n",
" x | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 9 | \n",
" 10 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
" 10 | \n",
" 11 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 11 | \n",
" 12 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
" 12 | \n",
" 13 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 13 | \n",
" 14 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 14 | \n",
" 15 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 15 | \n",
" 16 | \n",
" b | \n",
" y | \n",
"
\n",
" \n",
" 16 | \n",
" 17 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 17 | \n",
" 18 | \n",
" b | \n",
" x | \n",
"
\n",
" \n",
" 18 | \n",
" 19 | \n",
" a | \n",
" y | \n",
"
\n",
" \n",
" 19 | \n",
" 20 | \n",
" b | \n",
" z | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" S X Y\n",
"0 1 a y\n",
"1 2 b x\n",
"2 3 a x\n",
"3 4 b y\n",
"4 5 a y\n",
"5 6 b x\n",
"6 7 a y\n",
"7 8 b x\n",
"8 9 a y\n",
"9 10 b z\n",
"10 11 a y\n",
"11 12 b z\n",
"12 13 a y\n",
"13 14 b y\n",
"14 15 a y\n",
"15 16 b y\n",
"16 17 a y\n",
"17 18 b x\n",
"18 19 a y\n",
"19 20 b z"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Example data\n",
"# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n",
"df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n",
"df.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "01ad604b-d141-446d-9708-4a3f6220dd47",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Y | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" X | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 26 | \n",
" 1 | \n",
"
\n",
" \n",
" b | \n",
" 14 | \n",
" 9 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Y x y z\n",
"X \n",
"a 3 26 1\n",
"b 14 9 7"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n",
"xt"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "27008a6d-0a81-4594-aad7-2a0f46ecd319",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(19.87478991596639, 4.8333050401877814e-05, 2)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import chi2_contingency\n",
"\n",
"g_stat, p, dof, exp_freq = chi2_contingency(xt)\n",
"g_stat, p, dof"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}