{ "cells": [ { "cell_type": "markdown", "id": "29a86434-d56f-4f14-8c3c-6e65c1e3d5ad", "metadata": {}, "source": [ "# Two Sample Tests of Proportion - Python" ] }, { "cell_type": "markdown", "id": "34ce63d9-a78a-4dd2-a94d-e5ac13210987", "metadata": {}, "source": [ "## Fisher's Exact Test\n", "\n", "* **Samples:** `2`\n", "* **Response Categories:** `2`\n", "* **Exact?:** Yes, use with `N≤200`\n", "* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. Fisher’s exact test indicated a statistically significant association between X and Y (p < .0001).\"" ] }, { "cell_type": "markdown", "id": "0552b66c-40a0-4cde-a1b0-50b9eefcf5fc", "metadata": {}, "source": [ "```{warning}\n", "The isn't a readily available and non-GPL licensed Fisher's Exact Test that works on more than 2 response categories like it does in R. It is recommended to try a [G-Test](#g-test) or [Two-Sample Pearson Chi-Squared Test](#two-sample-pearson-chi-squared-test) instead if working in Python.\n", "\n", "Related GitHub Issue: https://github.com/scipy/scipy/issues/7099\n", "\n", "The following example will be only for 2 samples and 2 response categories.\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "id": "be7eb19b-5317-4802-8172-2309546d27e9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SXY
01ay
12bx
23ax
34by
45ay
56bx
67ay
78bx
89ay
1011ay
1213ay
1314by
1415ay
1516by
1617ay
1718bx
1819ay
2021ay
2122by
2324bx
\n", "
" ], "text/plain": [ " S X Y\n", "0 1 a y\n", "1 2 b x\n", "2 3 a x\n", "3 4 b y\n", "4 5 a y\n", "5 6 b x\n", "6 7 a y\n", "7 8 b x\n", "8 9 a y\n", "10 11 a y\n", "12 13 a y\n", "13 14 b y\n", "14 15 a y\n", "15 16 b y\n", "16 17 a y\n", "17 18 b x\n", "18 19 a y\n", "20 21 a y\n", "21 22 b y\n", "23 24 b x" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Example data\n", "# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n", "df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n", "\n", "# Filter the data for the sake of making it work for the example\n", "df = df.loc[df[\"Y\"].isin([\"x\", \"y\"])]\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": 19, "id": "ca79e707-b1ef-4a79-a999-4d4d346b3223", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Yxy
X
a326
b149
\n", "
" ], "text/plain": [ "Y x y\n", "X \n", "a 3 26\n", "b 14 9" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n", "xt" ] }, { "cell_type": "code", "execution_count": 21, "id": "e727c563-613b-4dd7-9b05-5926fe2c3f02", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.00021895317390182282" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import fisher_exact\n", "\n", "_, p = fisher_exact(xt)\n", "p" ] }, { "cell_type": "markdown", "id": "8674f910-0a20-4102-a194-b4bdba58555f", "metadata": {}, "source": [ "## G-Test\n", "\n", "* **Samples:** `2`\n", "* **Response Categories:** `≥2`\n", "* **Exact?:** No, use with `N>200`\n", "* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. A G-test indicated a statistically significant association between X and Y (G(2) = 21.40, p < .0001).\"" ] }, { "cell_type": "code", "execution_count": 22, "id": "7b65ff5f-fe61-4b8d-9438-d0434b95f1b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SXY
01ay
12bx
23ax
34by
45ay
56bx
67ay
78bx
89ay
910bz
1011ay
1112bz
1213ay
1314by
1415ay
1516by
1617ay
1718bx
1819ay
1920bz
\n", "
" ], "text/plain": [ " S X Y\n", "0 1 a y\n", "1 2 b x\n", "2 3 a x\n", "3 4 b y\n", "4 5 a y\n", "5 6 b x\n", "6 7 a y\n", "7 8 b x\n", "8 9 a y\n", "9 10 b z\n", "10 11 a y\n", "11 12 b z\n", "12 13 a y\n", "13 14 b y\n", "14 15 a y\n", "15 16 b y\n", "16 17 a y\n", "17 18 b x\n", "18 19 a y\n", "19 20 b z" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Example data\n", "# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n", "df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": 23, "id": "ccfc2702-6fae-40f5-908c-e7ba4a194444", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Yxyz
X
a3261
b1497
\n", "
" ], "text/plain": [ "Y x y z\n", "X \n", "a 3 26 1\n", "b 14 9 7" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n", "xt" ] }, { "cell_type": "code", "execution_count": 24, "id": "e4ca1f61-4ad5-4059-83f5-5369fceeafb4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(21.402062415325055, 2.252170138338781e-05, 2)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import chi2_contingency\n", "\n", "g_stat, p, dof, exp_freq = chi2_contingency(xt, lambda_=\"log-likelihood\")\n", "g_stat, p, dof" ] }, { "cell_type": "markdown", "id": "85091a9f-0863-48ee-8c16-cac8d31927a2", "metadata": {}, "source": [ "## Two-Sample Pearson Chi-Squared Test\n", "\n", "* **Samples:** `2`\n", "* **Response Categories:** `≥2`\n", "* **Exact?:** No, use with `N>200`\n", "* **Reporting:** \"Table 1 shows the counts of the ‘x’, ‘y’, and ‘z’ outcomes for each of ‘a’ and ‘b’. A two-sample Pearson Chi-Squared test indicated a statistically significant association between X and Y (χ2 (2, N=60) = 19.88, p < .0001).\"" ] }, { "cell_type": "code", "execution_count": 25, "id": "f9818172-7156-4593-a3ce-5b094c530ba9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SXY
01ay
12bx
23ax
34by
45ay
56bx
67ay
78bx
89ay
910bz
1011ay
1112bz
1213ay
1314by
1415ay
1516by
1617ay
1718bx
1819ay
1920bz
\n", "
" ], "text/plain": [ " S X Y\n", "0 1 a y\n", "1 2 b x\n", "2 3 a x\n", "3 4 b y\n", "4 5 a y\n", "5 6 b x\n", "6 7 a y\n", "7 8 b x\n", "8 9 a y\n", "9 10 b z\n", "10 11 a y\n", "11 12 b z\n", "12 13 a y\n", "13 14 b y\n", "14 15 a y\n", "15 16 b y\n", "16 17 a y\n", "17 18 b x\n", "18 19 a y\n", "19 20 b z" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Example data\n", "# df is a long-format data table w/subject (S), categorical factor (X) and outcome (Y)\n", "df = pd.read_csv(\"data/1F2LBs_multinomial.csv\")\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": 26, "id": "01ad604b-d141-446d-9708-4a3f6220dd47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Yxyz
X
a3261
b1497
\n", "
" ], "text/plain": [ "Y x y z\n", "X \n", "a 3 26 1\n", "b 14 9 7" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xt = pd.crosstab(df[\"X\"], df[\"Y\"])\n", "xt" ] }, { "cell_type": "code", "execution_count": 27, "id": "27008a6d-0a81-4594-aad7-2a0f46ecd319", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(19.87478991596639, 4.8333050401877814e-05, 2)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import chi2_contingency\n", "\n", "g_stat, p, dof, exp_freq = chi2_contingency(xt)\n", "g_stat, p, dof" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }