{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "Read the data and plot 'lead_time'" ], "metadata": { "id": "GO5gTVYInUxo" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "\n", "data = pd.read_csv('/content/BackOrders.csv')\n", "\n", "# Extract the 'lead_time' feature\n", "lead_time = data['lead_time']\n", "\n", "# Plot the original distribution\n", "plt.hist(lead_time, bins=30, edgecolor='k', alpha=0.7)\n", "plt.title('Original Distribution of Lead Time')\n", "plt.xlabel('Lead Time')\n", "plt.ylabel('Frequency')\n", "plt.show()\n", "\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 983 }, "id": "HeQLoEignPpE", "outputId": "e4c2e735-e929-4a9a-ccb5-acf3354499aa" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ ":6: DtypeWarning: Columns (1,3) have mixed types. Specify dtype option on import or set low_memory=False.\n", " data = pd.read_csv('/content/BackOrders.csv')\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "**Confidence Intervals**\n", "\n", "Construct and interpret confidence intervals for the mean 'lead_time'.\n", "\n", "Steps:\n", "\n", "1. Sample Data: Select a random sample of 'lead_time' values.\n", "2. Compute the sample mean and standard deviation.\n", "3. Calculate the 95% confidence interval for the mean 'lead_time'.\n", "4. Plot the sample mean and the confidence interval on a graph." ], "metadata": { "id": "QiRFMsesEy3O" } }, { "cell_type": "code", "source": [ "import scipy.stats as stats\n", "\n", "# Draw a random sample\n", "sample = np.random.choice(lead_time, size=30)\n", "\n", "# Calculate mean and standard error\n", "sample_mean = np.mean(sample)\n", "standard_error = stats.sem(sample)\n", "\n", "# Calculate confidence interval\n", "confidence_level = 0.95\n", "confidence_interval = stats.t.interval(confidence_level, len(sample)-1, loc=sample_mean, scale=standard_error)\n", "\n", "print(f\"Sample Mean: {sample_mean}\")\n", "print(f\"95% Confidence Interval: {confidence_interval}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MAVBNwB4FIcs", "outputId": "2259dd66-ee63-421c-de59-9f3a3114b449" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Sample Mean: 6.566666666666666\n", "95% Confidence Interval: (5.441380822217193, 7.69195251111614)\n" ] } ] }, { "cell_type": "markdown", "source": [ "**Hypothesis Testing**\n", "\n", "Test whether there is a significant difference in the mean 'lead_time' between products that went on backorder and those that did not.\n", "\n", "Steps:\n", "\n", "1. Formulate Hypotheses:\n", "Null Hypothesis (H0): There is no difference in the mean 'lead_time' between the two groups.\n", "Alternative Hypothesis (H1): There is a difference in the mean 'lead_time' between the two groups.\n", "2. Choose a significance level (e.g., α = 0.05).\n", "3. Use an independent t-test to compare the means of the two groups.\n", "4. Interpret Results: Determine whether to reject or fail to reject the null hypothesis based on the p-value." ], "metadata": { "id": "XRTNzQ2GFeGF" } }, { "cell_type": "code", "source": [ "# Define groups based on the target variable 'went_on_back_order'\n", "group_backorder = data[data['went_on_backorder'] == 'Yes']['lead_time']\n", "group_no_backorder = data[data['went_on_backorder'] == 'No']['lead_time']\n", "\n", "\n", "# Calculate the mean lead time for each group\n", "mean_backorder = np.mean(group_backorder)\n", "mean_no_backorder = np.mean(group_no_backorder)\n", "\n", "# Perform a two-sample t-test\n", "t_statistic, p_value = stats.ttest_ind(group_backorder.dropna(),group_no_backorder.dropna())\n", "\n", "print(f\"Mean Lead Time for Backorder: {mean_backorder}\")\n", "print(f\"Mean Lead Time for No Backorder: {mean_no_backorder}\")\n", "print(f\"T-statistic: {t_statistic}, P-value: {p_value}\")\n", "\n", "if p_value < 0.05:\n", " print(\"The difference in mean lead time is statistically significant.\")\n", "else:\n", " print(\"The difference in mean lead time is not statistically significant.\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Y3A1GcByFooK", "outputId": "169ce7a8-2039-449c-f044-914d0c35b6ec" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mean Lead Time for Backorder: 6.322545355091622\n", "Mean Lead Time for No Backorder: 7.847004256941356\n", "T-statistic: -22.224278817834143, P-value: 5.697539588215606e-109\n", "The difference in mean lead time is statistically significant.\n" ] } ] }, { "cell_type": "markdown", "source": [ "**Task:**\n", "\n", "Consider two categorical variables; went_on_backorder and potential_issue. Which test is suitable to determine if there is an association between them or not" ], "metadata": { "id": "rB9Lgc6UPvlY" } }, { "cell_type": "markdown", "source": [ "Steps:\n", "\n", "Formulate Hypotheses:\n", "\n", "1. Null Hypothesis (H0): went_on_back_order and potential_issue are independent (no association).\n", "2. Alternative Hypothesis (H1): There is an association between went_on_back_order and potential_issue.\n", "Create a Contingency Table: Use the two categorical variables to create a contingency table.\n", "\n", "3. Apply the Chi-Square test to evaluate the relationship between these two variables.\n", "\n", "4. Interpret Results: Determine whether to reject or fail to reject the null hypothesis based on the p-value." ], "metadata": { "id": "jijGtQzPUqWy" } }, { "cell_type": "code", "source": [ "# Create a contingency table between 'went_on_backorder' and 'potential_issue'\n", "\n", "\n", "# Perform Chi-Square test of independence\n", "\n", "\n", "if p_val < 0.05:\n", " print(\"Reject the null hypothesis - There is an association between 'went_on_back_order' and 'potential_issue'.\")\n", "else:\n", " print(\"Fail to reject the null hypothesis - No association between 'went_on_back_order' and 'potential_issue'.\")" ], "metadata": { "id": "TC3XNb5tVXUl" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Task**:\n", "\n", "Compare the variances of two continuous variables, such as national_inv (national inventory level) and lead_time, for products that went on backorder and those that did not." ], "metadata": { "id": "gN1mEtdTWhbR" } }, { "cell_type": "markdown", "source": [ "Steps:\n", "\n", "1.Formulate Hypotheses:\n", "\n", "Null Hypothesis (H0): The variances of national_inv and lead_time for products that went on backorder and those that did not are equal.\n", "Alternative Hypothesis (H1): The variances are not equal.\n", "\n", "2. Separate the data into two groups based on the went_on_back_order column.\n", "\n", "3. Use the ____ test to compare the variances of the two groups.\n", "\n", "4. Use the p-value to determine if the variances differ significantly." ], "metadata": { "id": "jyXoZQQsWpUa" } }, { "cell_type": "code", "source": [ "# Split the data into two groups based on 'went_on_back_order'\n", "\n", "\n", "#converting str and float values in national_inv and lead_time columns to numeric\n", "\n", "\n", "# Calculate variances for 'national_inv' and 'lead_time' for each group\n", "\n", "\n", "# Perform ____ test for both features\n", "\n", "\n", "#Interpret the result\n", "if ________ :\n", " print(\"The variances of 'national_inv' are significantly different between groups.\")\n", " print(\"Reject the null hypothesis.\")\n", "else:\n", " print(\"The variances of 'national_inv' are not significantly different between groups.\")" ], "metadata": { "id": "F_ehRJgEWmb9" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Task:\n", "\n", "Compare the means of a continuous variable, such as lead_time, across multiple categories, like different levels of forecast_3_month, to see if there is a significant difference between the groups.\n", "\n", "Steps:\n", "\n", "1. Formulate Hypotheses:\n", "\n", "Null Hypothesis (H0): The mean lead_time is the same across different levels of forecast_3_month.\n", "Alternative Hypothesis (H1): At least one of the means differs.\n", "2. Group the Data: Split the data based on different levels of forecast_3_month.\n", "\n", "3. Perform _____ analysis to test if there is a significant difference in lead_time across the groups.\n", "\n", "4. Interpret Results: Use the F-statistic and p-value to decide if the means differ significantly.\n", "\n" ], "metadata": { "id": "3uAH_MNiqdkq" } }, { "cell_type": "code", "source": [ "import statsmodels.api as sm\n", "from statsmodels.formula.api import ols\n", "\n", "# Group data by 'forecast_3_month'\n", "\n", "\n", "# Perform _____ analysis\n", "\n", "\n", "# Interpret the results" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-LLORcN9q1CO", "outputId": "7a59659b-5724-4636-ff4d-3da0da6e7772" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " sum_sq df F PR(>F)\n", "C(forecast_3_month) 6.338178e+04 1622.0 0.923212 0.986031\n", "Residual 2.394494e+06 56572.0 NaN NaN\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:1894: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 1622, but rank is 1613\n", " warnings.warn('covariance of constraints does not have full '\n" ] } ] } ] }