{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "m23ypqC8r6Xk"
},
"source": [
"## Logistic Regression:\n",
"\n",
"### **Business Problem Definition:**\n",
"The business problem revolves around **predicting the likelihood of order cancellations** for a product fulfillment company based on historical order data. This prediction helps the company:\n",
"\n",
"- `Proactively address potential cancellations,`\n",
"- `Optimize operations, and`\n",
"- `Improve customer satisfaction.`\n",
"\n",
"By identifying high-risk orders early, the business `can take preventive actions`, such as offering incentives or expedited shipping, to reduce the likelihood of cancellations.\n",
"\n",
"This is a `binary classification` problem where the target variable is the `Order_Cancelled` column.\n",
"\n",
"The model needs to classify whether an order will be cancelled based on features like the delivery time, order value, region, and other order-related attributes.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Jw4-U7HQXeFa"
},
"source": [
"## Dataset Description:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MLcEsAn-0qtm"
},
"source": [
"Explanation of Attributes/Target Variable:\n",
"\n",
"\\\n",
"Each row represents an `order placed by the customer` while columms represent the varied specifics of that order.\n",
"\n",
"\\\n",
"\n",
"- `Days_to_Delivery` (numeric): \n",
" The number of days it takes for the order to be delivered, generated based on a normal distribution with a mean of 5 days and a standard deviation of 2 days.\n",
"\n",
"- `Num_Items_Ordered` (numeric): \n",
" The total number of items in the order, represented as an integer between 1 and 20, reflecting the quantity of products ordered.\n",
"\n",
"- `Order_Value` (numeric): \n",
" The total value of the order in USD. It is generated from a normal distribution centered around $500, with a standard deviation of $100.\n",
"\n",
"- `Discount_Rate` (numeric): \n",
" The discount rate applied to the order, represented as a value between 0 and 0.5, indicating various discounts offered during sales.\n",
"\n",
"- `Num_Previous_Orders` (numeric): \n",
" The number of previous orders placed by the customer. It is an integer between 0 and 10.\n",
"\n",
"- `Delivery_Time_Variation` (numeric): \n",
" The variation between the estimated and actual delivery time, measured in days, with values ranging from 0 to 3 days.\n",
"\n",
"- `Region` (categorical): \n",
" The geographic region of the customer. Possible values:\n",
" - North America\n",
" - EMEA (Europe, Middle East, Africa)\n",
" - APAC (Asia-Pacific)\n",
" - LATAM (Latin America)\n",
"\n",
"- `Product_Category` (categorical)\n",
" The category of the product ordered. Possible values:\n",
" - Cloud\n",
" - On-premise\n",
" - SaaS (Software as a Service)\n",
" - Hardware\n",
"\n",
"- `Order_Priority` (categorical): \n",
" The urgency level of the order. Possible values:\n",
" - Low\n",
" - Medium\n",
" - High\n",
"\n",
"- `Payment_Method` (categorical): \n",
" The method used for payment in the order. Possible values include:\n",
" - Credit Card\n",
" - Bank Transfer\n",
" - PayPal\n",
" - Bitcoin\n",
"\n",
"- `Correlated_Order_Value` (numeric): \n",
" Represents an alternative estimation of the total order value, calculated by factoring in **historical customer spending behavior** and **product pricing trends**.It incorporates additional business insights such as customer loyalty and purchasing history.\n",
"\n",
"\n",
"- `Order_Cancelled (Target):` This is the target column indicating whether the order was cancelled or not. Values are \"Cancelled\" or \"Not-Cancelled\".\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OzrRK5hmMpq1"
},
"source": [
"### Comprehensive Data Science Project Workflow: From Business Understanding to Model Monitoring:\n",
"\n",
"1. `Business Understanding` (Define project goals and objectives.)\n",
"\n",
"2. `Data Requirement` (Identify necessary data for analysis)\n",
"\n",
"3. `Data Collection` (Data gathering from different sources with varied tools and technologies)\n",
"\n",
"4. `Data Preparation` (EDA/Data Preparation/Data Cleaning/Data Munging)\n",
"\n",
"5. `Data Modeling` ( Clean Data + Algorithms = Model)\n",
"\n",
"6. `Model Evaluation` (Test Model perf)\n",
"\n",
"7. `Model Tuning`(Optimize model hyperparameters)\n",
"\n",
"8. `Model Deployment`(Deploy model for real-time use)\n",
"\n",
"9. `Monitoring`(Track model performance over time)\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wESk_sbeMvmu"
},
"source": [
"### EDA/ Data Preparation/Data Cleaning Steps:\n",
"\n",
"1. `Removing Duplicate data`\n",
"2. `Missing Value Treatment`\n",
"3. `Outlier Treatment`\n",
"4. `Categorical to Numerical Conversion`\n",
"5. `Numerical to Categorical Conversion`\n",
"6. `Feature Scaling`\n",
"7. `Feature Transformation`\n",
"8. `Feature selection`\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-1Tl3l3pMzFp"
},
"source": [
"1. Import Necessary Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"collapsed": true,
"id": "BZ1UmMexcFEh",
"jupyter": {
"outputs_hidden": true
},
"outputId": "80c1f9b1-9f3f-45b9-ce42-312c1107ef1d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting statsmodels\n",
" Downloading statsmodels-0.14.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.2 kB)\n",
"Requirement already satisfied: numpy<3,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (1.26.4)\n",
"Requirement already satisfied: scipy!=1.9.2,>=1.8 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (1.13.1)\n",
"Requirement already satisfied: pandas!=2.1.0,>=1.4 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (2.1.4)\n",
"Collecting patsy>=0.5.6 (from statsmodels)\n",
" Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)\n",
"Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels) (24.1)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.2)\n",
"Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.6->statsmodels) (1.16.0)\n",
"Downloading statsmodels-0.14.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.8/10.8 MB\u001b[0m \u001b[31m89.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hDownloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m233.9/233.9 kB\u001b[0m \u001b[31m14.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: patsy, statsmodels\n",
"Successfully installed patsy-0.5.6 statsmodels-0.14.3\n"
]
}
],
"source": [
"# %pip install statsmodels"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "_CcitolYMyph"
},
"outputs": [],
"source": [
"# Importing all necessary Libaries: Data Science Packages\n",
"\n",
"import numpy as np # numpy used for mathematical operation on array\n",
"import pandas as pd # pandas used for data manipulation on dataframe\n",
"import seaborn as sns # seaborn used for data visualization\n",
"import matplotlib.pyplot as plt # matplotlib used for data visualization\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "tjP2FYHmM1_B"
},
"outputs": [],
"source": [
"# Read the data with pandas\n",
"\n",
"df = pd.read_csv(\"order_mgmt_binary_classification.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 226
},
"id": "O-bgGSmiM2B6",
"outputId": "bf28022a-3881-4b83-ad1f-ecae22b9e45f"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Region \n",
" Product_Category \n",
" Order_Priority \n",
" Payment_Method \n",
" Correlated_Order_Value \n",
" Order_Cancelled \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.993428 \n",
" 6 \n",
" 716.505607 \n",
" 0.098621 \n",
" 5 \n",
" 1.129165 \n",
" EMEA \n",
" On-premise \n",
" Low \n",
" Bitcoin \n",
" 677.114236 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 1 \n",
" 4.723471 \n",
" 11 \n",
" 619.054859 \n",
" 0.106769 \n",
" 7 \n",
" 0.889863 \n",
" APAC \n",
" Cloud \n",
" Medium \n",
" Bitcoin \n",
" 591.127949 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 2 \n",
" 6.295377 \n",
" 3 \n",
" 521.257403 \n",
" 0.338047 \n",
" 0 \n",
" 2.668340 \n",
" LATAM \n",
" On-premise \n",
" Medium \n",
" Bank Transfer \n",
" 502.195055 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 3 \n",
" 8.046060 \n",
" 8 \n",
" 602.698626 \n",
" 0.202501 \n",
" 4 \n",
" 2.998095 \n",
" LATAM \n",
" On-premise \n",
" High \n",
" PayPal \n",
" 566.850797 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 4 \n",
" 4.531693 \n",
" 3 \n",
" 610.590042 \n",
" 0.465772 \n",
" 2 \n",
" 2.011061 \n",
" APAC \n",
" Hardware \n",
" High \n",
" Bank Transfer \n",
" 581.693494 \n",
" Not-Cancelled \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"0 5.993428 6 716.505607 0.098621 \n",
"1 4.723471 11 619.054859 0.106769 \n",
"2 6.295377 3 521.257403 0.338047 \n",
"3 8.046060 8 602.698626 0.202501 \n",
"4 4.531693 3 610.590042 0.465772 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Region Product_Category \\\n",
"0 5 1.129165 EMEA On-premise \n",
"1 7 0.889863 APAC Cloud \n",
"2 0 2.668340 LATAM On-premise \n",
"3 4 2.998095 LATAM On-premise \n",
"4 2 2.011061 APAC Hardware \n",
"\n",
" Order_Priority Payment_Method Correlated_Order_Value Order_Cancelled \n",
"0 Low Bitcoin 677.114236 Not-Cancelled \n",
"1 Medium Bitcoin 591.127949 Not-Cancelled \n",
"2 Medium Bank Transfer 502.195055 Not-Cancelled \n",
"3 High PayPal 566.850797 Not-Cancelled \n",
"4 High Bank Transfer 581.693494 Not-Cancelled "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading first 5 Rows of the data\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 226
},
"id": "HiLVof1DM2Ew",
"outputId": "fa6f4568-ccd0-41fe-8e31-bac8049bd864"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Region \n",
" Product_Category \n",
" Order_Priority \n",
" Payment_Method \n",
" Correlated_Order_Value \n",
" Order_Cancelled \n",
" \n",
" \n",
" \n",
" \n",
" 3995 \n",
" 5.471104 \n",
" 17 \n",
" 432.846658 \n",
" 0.041176 \n",
" 6 \n",
" 0.124429 \n",
" LATAM \n",
" On-premise \n",
" Medium \n",
" Bitcoin \n",
" 410.113830 \n",
" Cancelled \n",
" \n",
" \n",
" 3996 \n",
" 0.689275 \n",
" 10 \n",
" 321.994624 \n",
" 0.180283 \n",
" 7 \n",
" 1.824347 \n",
" EMEA \n",
" Cloud \n",
" Low \n",
" Credit Card \n",
" 303.608794 \n",
" Cancelled \n",
" \n",
" \n",
" 3997 \n",
" 5.855895 \n",
" 6 \n",
" 416.044214 \n",
" 0.249029 \n",
" 2 \n",
" 0.954662 \n",
" EMEA \n",
" On-premise \n",
" Medium \n",
" PayPal \n",
" 402.431913 \n",
" Cancelled \n",
" \n",
" \n",
" 3998 \n",
" 5.337336 \n",
" 11 \n",
" 489.287493 \n",
" 0.215548 \n",
" 5 \n",
" 0.560760 \n",
" LATAM \n",
" SaaS \n",
" Low \n",
" PayPal \n",
" 466.154060 \n",
" Cancelled \n",
" \n",
" \n",
" 3999 \n",
" 4.185158 \n",
" 14 \n",
" 559.004143 \n",
" 0.140213 \n",
" 0 \n",
" 2.450295 \n",
" EMEA \n",
" Cloud \n",
" Medium \n",
" Bitcoin \n",
" 532.405465 \n",
" Cancelled \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"3995 5.471104 17 432.846658 0.041176 \n",
"3996 0.689275 10 321.994624 0.180283 \n",
"3997 5.855895 6 416.044214 0.249029 \n",
"3998 5.337336 11 489.287493 0.215548 \n",
"3999 4.185158 14 559.004143 0.140213 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Region Product_Category \\\n",
"3995 6 0.124429 LATAM On-premise \n",
"3996 7 1.824347 EMEA Cloud \n",
"3997 2 0.954662 EMEA On-premise \n",
"3998 5 0.560760 LATAM SaaS \n",
"3999 0 2.450295 EMEA Cloud \n",
"\n",
" Order_Priority Payment_Method Correlated_Order_Value Order_Cancelled \n",
"3995 Medium Bitcoin 410.113830 Cancelled \n",
"3996 Low Credit Card 303.608794 Cancelled \n",
"3997 Medium PayPal 402.431913 Cancelled \n",
"3998 Low PayPal 466.154060 Cancelled \n",
"3999 Medium Bitcoin 532.405465 Cancelled "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading last 5 Rows of the data\n",
"\n",
"df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CeeludwGM2Hp",
"outputId": "680afd73-d9d9-49ab-ec11-48c3a5e70bd0"
},
"outputs": [
{
"data": {
"text/plain": [
"(4000, 12)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking the shape of the data\n",
"\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zeI-xGVtM2J-",
"outputId": "2124dda5-8b7f-44d7-f32b-92428fe1dd59"
},
"outputs": [
{
"data": {
"text/plain": [
"4000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking the shape of the data\n",
"\n",
"df.shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ruMfNGlGLg74",
"outputId": "71c745fd-3fca-47bb-e94e-087c03a19479"
},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking the shape of the data\n",
"\n",
"df.shape[1]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "I5gImKt3NN9t",
"outputId": "7cc7f093-73a2-49cb-ec5f-74f89365ad5b"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Days_to_Delivery', 'Num_Items_Ordered', 'Order_Value', 'Discount_Rate',\n",
" 'Num_Previous_Orders', 'Delivery_Time_Variation', 'Region',\n",
" 'Product_Category', 'Order_Priority', 'Payment_Method',\n",
" 'Correlated_Order_Value', 'Order_Cancelled'],\n",
" dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Reading the name of the columns\n",
"\n",
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 460
},
"id": "BQISHgDLNOAh",
"outputId": "cbd91402-f6fa-485c-ad0f-283bc009cbce"
},
"outputs": [
{
"data": {
"text/plain": [
"Days_to_Delivery float64\n",
"Num_Items_Ordered int64\n",
"Order_Value float64\n",
"Discount_Rate float64\n",
"Num_Previous_Orders int64\n",
"Delivery_Time_Variation float64\n",
"Region object\n",
"Product_Category object\n",
"Order_Priority object\n",
"Payment_Method object\n",
"Correlated_Order_Value float64\n",
"Order_Cancelled object\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# View datatypes of allcolumns of dataset\n",
"\n",
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eBJAd-ZJNODY",
"outputId": "1734aa11-77b4-42c7-e2c4-175dc024baa7"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 4000 entries, 0 to 3999\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Days_to_Delivery 4000 non-null float64\n",
" 1 Num_Items_Ordered 4000 non-null int64 \n",
" 2 Order_Value 4000 non-null float64\n",
" 3 Discount_Rate 4000 non-null float64\n",
" 4 Num_Previous_Orders 4000 non-null int64 \n",
" 5 Delivery_Time_Variation 4000 non-null float64\n",
" 6 Region 4000 non-null object \n",
" 7 Product_Category 4000 non-null object \n",
" 8 Order_Priority 4000 non-null object \n",
" 9 Payment_Method 4000 non-null object \n",
" 10 Correlated_Order_Value 4000 non-null float64\n",
" 11 Order_Cancelled 4000 non-null object \n",
"dtypes: float64(5), int64(2), object(5)\n",
"memory usage: 375.1+ KB\n"
]
}
],
"source": [
"# View info of Columns of the dataset such as number of entries, name of columns and data type\n",
"\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 425
},
"id": "rYTOY_8QNwD_",
"outputId": "4babd1fe-d66e-4cda-ef08-8bbb07fe153e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" 0 \n",
" \n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" float64 \n",
" \n",
" \n",
" Num_Items_Ordered \n",
" int64 \n",
" \n",
" \n",
" Order_Value \n",
" float64 \n",
" \n",
" \n",
" Discount_Rate \n",
" float64 \n",
" \n",
" \n",
" Num_Previous_Orders \n",
" int64 \n",
" \n",
" \n",
" Delivery_Time_Variation \n",
" float64 \n",
" \n",
" \n",
" Region \n",
" object \n",
" \n",
" \n",
" Product_Category \n",
" object \n",
" \n",
" \n",
" Order_Priority \n",
" object \n",
" \n",
" \n",
" Payment_Method \n",
" object \n",
" \n",
" \n",
" Correlated_Order_Value \n",
" float64 \n",
" \n",
" \n",
" Order_Cancelled \n",
" object \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"Days_to_Delivery float64\n",
"Num_Items_Ordered int64\n",
"Order_Value float64\n",
"Discount_Rate float64\n",
"Num_Previous_Orders int64\n",
"Delivery_Time_Variation float64\n",
"Region object\n",
"Product_Category object\n",
"Order_Priority object\n",
"Payment_Method object\n",
"Correlated_Order_Value float64\n",
"Order_Cancelled object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creating the Data Dictionary with first column being datatype.\n",
"\n",
"Data_dict = pd.DataFrame(df.dtypes)\n",
"Data_dict"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 425
},
"id": "AFEIru6ONOF_",
"outputId": "d3011442-0ae7-4184-82d5-e4ebc1bf80e7"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" 0 \n",
" UniqueVal \n",
" \n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" float64 \n",
" 4000 \n",
" \n",
" \n",
" Num_Items_Ordered \n",
" int64 \n",
" 19 \n",
" \n",
" \n",
" Order_Value \n",
" float64 \n",
" 4000 \n",
" \n",
" \n",
" Discount_Rate \n",
" float64 \n",
" 4000 \n",
" \n",
" \n",
" Num_Previous_Orders \n",
" int64 \n",
" 10 \n",
" \n",
" \n",
" Delivery_Time_Variation \n",
" float64 \n",
" 4000 \n",
" \n",
" \n",
" Region \n",
" object \n",
" 4 \n",
" \n",
" \n",
" Product_Category \n",
" object \n",
" 4 \n",
" \n",
" \n",
" Order_Priority \n",
" object \n",
" 3 \n",
" \n",
" \n",
" Payment_Method \n",
" object \n",
" 4 \n",
" \n",
" \n",
" Correlated_Order_Value \n",
" float64 \n",
" 4000 \n",
" \n",
" \n",
" Order_Cancelled \n",
" object \n",
" 2 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 UniqueVal\n",
"Days_to_Delivery float64 4000\n",
"Num_Items_Ordered int64 19\n",
"Order_Value float64 4000\n",
"Discount_Rate float64 4000\n",
"Num_Previous_Orders int64 10\n",
"Delivery_Time_Variation float64 4000\n",
"Region object 4\n",
"Product_Category object 4\n",
"Order_Priority object 3\n",
"Payment_Method object 4\n",
"Correlated_Order_Value float64 4000\n",
"Order_Cancelled object 2"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Identifying unique values . For this we used nunique() which returns unique elements in the object.\n",
"\n",
"Data_dict['UniqueVal'] = df.nunique()\n",
"Data_dict"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vqLry_YsN2DX"
},
"source": [
"# **Discriptive Statistics**"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "7IZ56wrSNOJH",
"outputId": "8b7ca19f-d1d6-4cbf-9ec7-42844febaa89"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Correlated_Order_Value \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" \n",
" \n",
" mean \n",
" 5.042184 \n",
" 10.116250 \n",
" 501.181391 \n",
" 0.245425 \n",
" 4.534500 \n",
" 1.497161 \n",
" 476.032221 \n",
" \n",
" \n",
" std \n",
" 1.969642 \n",
" 5.401438 \n",
" 100.872740 \n",
" 0.142842 \n",
" 2.877826 \n",
" 0.869979 \n",
" 95.924241 \n",
" \n",
" \n",
" min \n",
" -1.482535 \n",
" 1.000000 \n",
" 166.510729 \n",
" 0.000063 \n",
" 0.000000 \n",
" 0.003067 \n",
" 150.331593 \n",
" \n",
" \n",
" 25% \n",
" 3.707940 \n",
" 5.000000 \n",
" 433.520531 \n",
" 0.122622 \n",
" 2.000000 \n",
" 0.757221 \n",
" 411.936377 \n",
" \n",
" \n",
" 50% \n",
" 5.025536 \n",
" 10.000000 \n",
" 501.165442 \n",
" 0.243107 \n",
" 5.000000 \n",
" 1.509437 \n",
" 475.467864 \n",
" \n",
" \n",
" 75% \n",
" 6.309499 \n",
" 15.000000 \n",
" 569.059613 \n",
" 0.366216 \n",
" 7.000000 \n",
" 2.238121 \n",
" 540.587883 \n",
" \n",
" \n",
" max \n",
" 13.295790 \n",
" 19.000000 \n",
" 860.283214 \n",
" 0.499804 \n",
" 9.000000 \n",
" 2.999922 \n",
" 819.532127 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"count 4000.000000 4000.000000 4000.000000 4000.000000 \n",
"mean 5.042184 10.116250 501.181391 0.245425 \n",
"std 1.969642 5.401438 100.872740 0.142842 \n",
"min -1.482535 1.000000 166.510729 0.000063 \n",
"25% 3.707940 5.000000 433.520531 0.122622 \n",
"50% 5.025536 10.000000 501.165442 0.243107 \n",
"75% 6.309499 15.000000 569.059613 0.366216 \n",
"max 13.295790 19.000000 860.283214 0.499804 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Correlated_Order_Value \n",
"count 4000.000000 4000.000000 4000.000000 \n",
"mean 4.534500 1.497161 476.032221 \n",
"std 2.877826 0.869979 95.924241 \n",
"min 0.000000 0.003067 150.331593 \n",
"25% 2.000000 0.757221 411.936377 \n",
"50% 5.000000 1.509437 475.467864 \n",
"75% 7.000000 2.238121 540.587883 \n",
"max 9.000000 2.999922 819.532127 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the descriptive statistics of the dataset\n",
"\n",
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "L9xbDtWNNOLu",
"outputId": "966985b4-6ec4-44e0-bdbc-756b14d08b6e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Correlated_Order_Value \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" 4000.000000 \n",
" \n",
" \n",
" mean \n",
" 5.042184 \n",
" 10.116250 \n",
" 501.181391 \n",
" 0.245425 \n",
" 4.534500 \n",
" 1.497161 \n",
" 476.032221 \n",
" \n",
" \n",
" std \n",
" 1.969642 \n",
" 5.401438 \n",
" 100.872740 \n",
" 0.142842 \n",
" 2.877826 \n",
" 0.869979 \n",
" 95.924241 \n",
" \n",
" \n",
" min \n",
" -1.482535 \n",
" 1.000000 \n",
" 166.510729 \n",
" 0.000063 \n",
" 0.000000 \n",
" 0.003067 \n",
" 150.331593 \n",
" \n",
" \n",
" 25% \n",
" 3.707940 \n",
" 5.000000 \n",
" 433.520531 \n",
" 0.122622 \n",
" 2.000000 \n",
" 0.757221 \n",
" 411.936377 \n",
" \n",
" \n",
" 50% \n",
" 5.025536 \n",
" 10.000000 \n",
" 501.165442 \n",
" 0.243107 \n",
" 5.000000 \n",
" 1.509437 \n",
" 475.467864 \n",
" \n",
" \n",
" 75% \n",
" 6.309499 \n",
" 15.000000 \n",
" 569.059613 \n",
" 0.366216 \n",
" 7.000000 \n",
" 2.238121 \n",
" 540.587883 \n",
" \n",
" \n",
" max \n",
" 13.295790 \n",
" 19.000000 \n",
" 860.283214 \n",
" 0.499804 \n",
" 9.000000 \n",
" 2.999922 \n",
" 819.532127 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"count 4000.000000 4000.000000 4000.000000 4000.000000 \n",
"mean 5.042184 10.116250 501.181391 0.245425 \n",
"std 1.969642 5.401438 100.872740 0.142842 \n",
"min -1.482535 1.000000 166.510729 0.000063 \n",
"25% 3.707940 5.000000 433.520531 0.122622 \n",
"50% 5.025536 10.000000 501.165442 0.243107 \n",
"75% 6.309499 15.000000 569.059613 0.366216 \n",
"max 13.295790 19.000000 860.283214 0.499804 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Correlated_Order_Value \n",
"count 4000.000000 4000.000000 4000.000000 \n",
"mean 4.534500 1.497161 476.032221 \n",
"std 2.877826 0.869979 95.924241 \n",
"min 0.000000 0.003067 150.331593 \n",
"25% 2.000000 0.757221 411.936377 \n",
"50% 5.000000 1.509437 475.467864 \n",
"75% 7.000000 2.238121 540.587883 \n",
"max 9.000000 2.999922 819.532127 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get discriptive statistics on \"number\" datatypes\n",
"\n",
"df.describe(include = ['number'])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 175
},
"id": "agAPiTi2N7Xt",
"outputId": "193b47cc-2de4-481e-cacf-80adc0136f5b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Region \n",
" Product_Category \n",
" Order_Priority \n",
" Payment_Method \n",
" Order_Cancelled \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 4000 \n",
" 4000 \n",
" 4000 \n",
" 4000 \n",
" 4000 \n",
" \n",
" \n",
" unique \n",
" 4 \n",
" 4 \n",
" 3 \n",
" 4 \n",
" 2 \n",
" \n",
" \n",
" top \n",
" EMEA \n",
" SaaS \n",
" Low \n",
" Bank Transfer \n",
" Not-Cancelled \n",
" \n",
" \n",
" freq \n",
" 1022 \n",
" 1046 \n",
" 1356 \n",
" 1029 \n",
" 2614 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Region Product_Category Order_Priority Payment_Method Order_Cancelled\n",
"count 4000 4000 4000 4000 4000\n",
"unique 4 4 3 4 2\n",
"top EMEA SaaS Low Bank Transfer Not-Cancelled\n",
"freq 1022 1046 1356 1029 2614"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get discriptive statistics on \"objects\" datatypes\n",
"\n",
"df.describe(include = ['object'])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 333
},
"id": "eS24nbz9s2Jo",
"outputId": "e48861f4-0722-4ecd-d0fa-53ca02518b89"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Drawing a count plot for the target column 'Order_Status'\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"plt.figure(figsize=(5,3))\n",
"sns.countplot(x='Order_Cancelled', data=df, width=0.3) # Default width is 0.8, 0.3 is 70% shorter\n",
"plt.title('Count Plot of Order_Status')\n",
"plt.xlabel('Order Status')\n",
"plt.ylabel('Count')\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "_t9iG84UtK2J",
"outputId": "aab70c7b-bafe-4b0d-dc90-323cc16b9781"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Redefining the cleaned categorical columns for plotting\n",
"categorical_columns = df.select_dtypes(include=['object']).columns.tolist()\n",
"\n",
"# Calculate the number of rows needed for subplots\n",
"num_rows = int(np.ceil(len(categorical_columns) / 3)) # Calculate rows needed\n",
"\n",
"# Step 2: Draw count plots for each categorical column\n",
"plt.figure(figsize=(16, 8 * num_rows)) # Adjust figure height based on rows\n",
"for i, column in enumerate(categorical_columns, 1):\n",
" plt.subplot(num_rows, 3, i) # Use calculated num_rows\n",
" sns.countplot(x=df[column], width=0.6) # Reduce width by 40% (default is 0.8)\n",
" plt.title(f'Count Plot of {column}')\n",
" plt.xticks(rotation=45)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 226
},
"id": "tZy5sPVDYY3g",
"outputId": "2440bb5d-6e96-4879-d4d7-1cc2687a5cae"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Region \n",
" Product_Category \n",
" Order_Priority \n",
" Payment_Method \n",
" Correlated_Order_Value \n",
" Order_Cancelled \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.993428 \n",
" 6 \n",
" 716.505607 \n",
" 0.098621 \n",
" 5 \n",
" 1.129165 \n",
" EMEA \n",
" On-premise \n",
" Low \n",
" Bitcoin \n",
" 677.114236 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 1 \n",
" 4.723471 \n",
" 11 \n",
" 619.054859 \n",
" 0.106769 \n",
" 7 \n",
" 0.889863 \n",
" APAC \n",
" Cloud \n",
" Medium \n",
" Bitcoin \n",
" 591.127949 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 2 \n",
" 6.295377 \n",
" 3 \n",
" 521.257403 \n",
" 0.338047 \n",
" 0 \n",
" 2.668340 \n",
" LATAM \n",
" On-premise \n",
" Medium \n",
" Bank Transfer \n",
" 502.195055 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 3 \n",
" 8.046060 \n",
" 8 \n",
" 602.698626 \n",
" 0.202501 \n",
" 4 \n",
" 2.998095 \n",
" LATAM \n",
" On-premise \n",
" High \n",
" PayPal \n",
" 566.850797 \n",
" Not-Cancelled \n",
" \n",
" \n",
" 4 \n",
" 4.531693 \n",
" 3 \n",
" 610.590042 \n",
" 0.465772 \n",
" 2 \n",
" 2.011061 \n",
" APAC \n",
" Hardware \n",
" High \n",
" Bank Transfer \n",
" 581.693494 \n",
" Not-Cancelled \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"0 5.993428 6 716.505607 0.098621 \n",
"1 4.723471 11 619.054859 0.106769 \n",
"2 6.295377 3 521.257403 0.338047 \n",
"3 8.046060 8 602.698626 0.202501 \n",
"4 4.531693 3 610.590042 0.465772 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Region Product_Category \\\n",
"0 5 1.129165 EMEA On-premise \n",
"1 7 0.889863 APAC Cloud \n",
"2 0 2.668340 LATAM On-premise \n",
"3 4 2.998095 LATAM On-premise \n",
"4 2 2.011061 APAC Hardware \n",
"\n",
" Order_Priority Payment_Method Correlated_Order_Value Order_Cancelled \n",
"0 Low Bitcoin 677.114236 Not-Cancelled \n",
"1 Medium Bitcoin 591.127949 Not-Cancelled \n",
"2 Medium Bank Transfer 502.195055 Not-Cancelled \n",
"3 High PayPal 566.850797 Not-Cancelled \n",
"4 High Bank Transfer 581.693494 Not-Cancelled "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading first 5 Rows of the data\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nUdbr1o_N_gS"
},
"source": [
"# **Data Cleaning**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QfjjBlsFN7am",
"outputId": "f435e4a4-651c-4207-8ccd-bca054be3edf"
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.duplicated().sum()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 460
},
"id": "6--mdHoFN7dH",
"outputId": "398a2ad1-2fc8-4012-c604-aff6a95f6300"
},
"outputs": [
{
"data": {
"text/plain": [
"Days_to_Delivery 0\n",
"Num_Items_Ordered 0\n",
"Order_Value 0\n",
"Discount_Rate 0\n",
"Num_Previous_Orders 0\n",
"Delivery_Time_Variation 0\n",
"Region 0\n",
"Product_Category 0\n",
"Order_Priority 0\n",
"Payment_Method 0\n",
"Correlated_Order_Value 0\n",
"Order_Cancelled 0\n",
"dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking the null values columns wise\n",
"\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YTObg1D8QQFP"
},
"source": [
"## **Categorical attributes: Dummy Encoding:**"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"id": "Lf0g2aV5uxKD"
},
"outputs": [],
"source": [
"# Encoding categorical variables\n",
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"df_encoded = df.copy()\n",
"label_encoder = LabelEncoder()\n",
"\n",
"#df_encoded['Region'] = label_encoder.fit_transform(df_encoded['Region'])\n",
"#df_encoded['Product_Category'] = label_encoder.fit_transform(df_encoded['Product_Category'])\n",
"#df_encoded['Order_Priority'] = label_encoder.fit_transform(df_encoded['Order_Priority'])\n",
"#df_encoded['Payment_Method'] = label_encoder.fit_transform(df_encoded['Payment_Method'])\n",
"df_encoded['Order_Cancelled'] = label_encoder.fit_transform(df_encoded['Order_Cancelled'])\n",
"\n",
"ohe = OneHotEncoder()\n",
"df_encoded[ohe.get_feature_names_out()] = ohe.fit_transform(df_encoded[['Region']]).toarray()\n",
"df_encoded[ohe.get_feature_names_out()] = ohe.fit_transform(df_encoded[['Product_Category']]).toarray()\n",
"df_encoded[ohe.get_feature_names_out()] = ohe.fit_transform(df_encoded[['Order_Priority']]).toarray()\n",
"df_encoded[ohe.get_feature_names_out()] = ohe.fit_transform(df_encoded[['Payment_Method']]).toarray()\n",
"#df_encoded[ohe.get_feature_names_out()] = ohe.fit_transform(df_encoded[['Order_Cancelled']]).toarray()"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "YMwpQYrHQrz-",
"outputId": "c0903189-bd37-4e01-a870-158cfa61af5c"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Days_to_Delivery', 'Num_Items_Ordered', 'Order_Value', 'Discount_Rate',\n",
" 'Num_Previous_Orders', 'Delivery_Time_Variation', 'Region',\n",
" 'Product_Category', 'Order_Priority', 'Payment_Method',\n",
" 'Correlated_Order_Value', 'Order_Cancelled', 'Region_APAC',\n",
" 'Region_EMEA', 'Region_LATAM', 'Region_North America',\n",
" 'Product_Category_Cloud', 'Product_Category_Hardware',\n",
" 'Product_Category_On-premise', 'Product_Category_SaaS',\n",
" 'Order_Priority_High', 'Order_Priority_Low', 'Order_Priority_Medium',\n",
" 'Payment_Method_Bank Transfer', 'Payment_Method_Bitcoin',\n",
" 'Payment_Method_Credit Card', 'Payment_Method_PayPal'],\n",
" dtype='object')"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded.columns"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Days_to_Delivery \n",
" Num_Items_Ordered \n",
" Order_Value \n",
" Discount_Rate \n",
" Num_Previous_Orders \n",
" Delivery_Time_Variation \n",
" Region \n",
" Product_Category \n",
" Order_Priority \n",
" Payment_Method \n",
" ... \n",
" Product_Category_Hardware \n",
" Product_Category_On-premise \n",
" Product_Category_SaaS \n",
" Order_Priority_High \n",
" Order_Priority_Low \n",
" Order_Priority_Medium \n",
" Payment_Method_Bank Transfer \n",
" Payment_Method_Bitcoin \n",
" Payment_Method_Credit Card \n",
" Payment_Method_PayPal \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.993428 \n",
" 6 \n",
" 716.505607 \n",
" 0.098621 \n",
" 5 \n",
" 1.129165 \n",
" EMEA \n",
" On-premise \n",
" Low \n",
" Bitcoin \n",
" ... \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 1 \n",
" 4.723471 \n",
" 11 \n",
" 619.054859 \n",
" 0.106769 \n",
" 7 \n",
" 0.889863 \n",
" APAC \n",
" Cloud \n",
" Medium \n",
" Bitcoin \n",
" ... \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 2 \n",
" 6.295377 \n",
" 3 \n",
" 521.257403 \n",
" 0.338047 \n",
" 0 \n",
" 2.668340 \n",
" LATAM \n",
" On-premise \n",
" Medium \n",
" Bank Transfer \n",
" ... \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 3 \n",
" 8.046060 \n",
" 8 \n",
" 602.698626 \n",
" 0.202501 \n",
" 4 \n",
" 2.998095 \n",
" LATAM \n",
" On-premise \n",
" High \n",
" PayPal \n",
" ... \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" \n",
" \n",
" 4 \n",
" 4.531693 \n",
" 3 \n",
" 610.590042 \n",
" 0.465772 \n",
" 2 \n",
" 2.011061 \n",
" APAC \n",
" Hardware \n",
" High \n",
" Bank Transfer \n",
" ... \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 27 columns
\n",
"
"
],
"text/plain": [
" Days_to_Delivery Num_Items_Ordered Order_Value Discount_Rate \\\n",
"0 5.993428 6 716.505607 0.098621 \n",
"1 4.723471 11 619.054859 0.106769 \n",
"2 6.295377 3 521.257403 0.338047 \n",
"3 8.046060 8 602.698626 0.202501 \n",
"4 4.531693 3 610.590042 0.465772 \n",
"\n",
" Num_Previous_Orders Delivery_Time_Variation Region Product_Category \\\n",
"0 5 1.129165 EMEA On-premise \n",
"1 7 0.889863 APAC Cloud \n",
"2 0 2.668340 LATAM On-premise \n",
"3 4 2.998095 LATAM On-premise \n",
"4 2 2.011061 APAC Hardware \n",
"\n",
" Order_Priority Payment_Method ... Product_Category_Hardware \\\n",
"0 Low Bitcoin ... 0.0 \n",
"1 Medium Bitcoin ... 0.0 \n",
"2 Medium Bank Transfer ... 0.0 \n",
"3 High PayPal ... 0.0 \n",
"4 High Bank Transfer ... 1.0 \n",
"\n",
" Product_Category_On-premise Product_Category_SaaS Order_Priority_High \\\n",
"0 1.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 \n",
"2 1.0 0.0 0.0 \n",
"3 1.0 0.0 1.0 \n",
"4 0.0 0.0 1.0 \n",
"\n",
" Order_Priority_Low Order_Priority_Medium Payment_Method_Bank Transfer \\\n",
"0 1.0 0.0 0.0 \n",
"1 0.0 1.0 0.0 \n",
"2 0.0 1.0 1.0 \n",
"3 0.0 0.0 0.0 \n",
"4 0.0 0.0 1.0 \n",
"\n",
" Payment_Method_Bitcoin Payment_Method_Credit Card Payment_Method_PayPal \n",
"0 1.0 0.0 0.0 \n",
"1 1.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 \n",
"3 0.0 0.0 1.0 \n",
"4 0.0 0.0 0.0 \n",
"\n",
"[5 rows x 27 columns]"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded.head()"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "12myDXivQJi7",
"outputId": "290fa02b-5529-435d-f8fd-dcd06699f78a"
},
"outputs": [],
"source": [
"# Visulaizing the Pairplot of complete dataset\n",
"\n",
"#sns.pairplot(df_encoded)"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"id": "3gOaZeOc2EEC"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report\n",
"from sklearn.naive_bayes import GaussianNB\n",
"import statsmodels.api as sm\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"from statsmodels.stats.outliers_influence import variance_inflation_factor"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"id": "nLcNJky812Mj"
},
"outputs": [],
"source": [
"# Separate features and target\n",
"X = df_encoded.drop(['Order_Cancelled', 'Correlated_Order_Value', 'Order_Value'], axis=1)\n",
"y = df_encoded['Order_Cancelled']"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"id": "4IH0XfB012O4"
},
"outputs": [
{
"ename": "TypeError",
"evalue": "ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[109], line 4\u001b[0m\n\u001b[1;32m 2\u001b[0m vif_data \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame()\n\u001b[1;32m 3\u001b[0m vif_data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfeature\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m X\u001b[38;5;241m.\u001b[39mcolumns\n\u001b[0;32m----> 4\u001b[0m vif_data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mVIF\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m [variance_inflation_factor(X\u001b[38;5;241m.\u001b[39mvalues, i) \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(\u001b[38;5;28mlen\u001b[39m(X\u001b[38;5;241m.\u001b[39mcolumns))]\n\u001b[1;32m 5\u001b[0m \u001b[38;5;66;03m# Arrange VIF values in descending order\u001b[39;00m\n\u001b[1;32m 6\u001b[0m vif_data \u001b[38;5;241m=\u001b[39m vif_data\u001b[38;5;241m.\u001b[39msort_values(by\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mVIF\u001b[39m\u001b[38;5;124m\"\u001b[39m, ascending\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/stats/outliers_influence.py:196\u001b[0m, in \u001b[0;36mvariance_inflation_factor\u001b[0;34m(exog, exog_idx)\u001b[0m\n\u001b[1;32m 194\u001b[0m mask \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marange(k_vars) \u001b[38;5;241m!=\u001b[39m exog_idx\n\u001b[1;32m 195\u001b[0m x_noti \u001b[38;5;241m=\u001b[39m exog[:, mask]\n\u001b[0;32m--> 196\u001b[0m r_squared_i \u001b[38;5;241m=\u001b[39m OLS(x_i, x_noti)\u001b[38;5;241m.\u001b[39mfit()\u001b[38;5;241m.\u001b[39mrsquared\n\u001b[1;32m 197\u001b[0m vif \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m1.\u001b[39m \u001b[38;5;241m/\u001b[39m (\u001b[38;5;241m1.\u001b[39m \u001b[38;5;241m-\u001b[39m r_squared_i)\n\u001b[1;32m 198\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m vif\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/regression/linear_model.py:924\u001b[0m, in \u001b[0;36mOLS.__init__\u001b[0;34m(self, endog, exog, missing, hasconst, **kwargs)\u001b[0m\n\u001b[1;32m 921\u001b[0m msg \u001b[38;5;241m=\u001b[39m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWeights are not supported in OLS and will be ignored\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 922\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAn exception will be raised in the next version.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 923\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(msg, ValueWarning)\n\u001b[0;32m--> 924\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(endog, exog, missing\u001b[38;5;241m=\u001b[39mmissing,\n\u001b[1;32m 925\u001b[0m hasconst\u001b[38;5;241m=\u001b[39mhasconst, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 926\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mweights\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_keys:\n\u001b[1;32m 927\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_keys\u001b[38;5;241m.\u001b[39mremove(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mweights\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/regression/linear_model.py:749\u001b[0m, in \u001b[0;36mWLS.__init__\u001b[0;34m(self, endog, exog, weights, missing, hasconst, **kwargs)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 748\u001b[0m weights \u001b[38;5;241m=\u001b[39m weights\u001b[38;5;241m.\u001b[39msqueeze()\n\u001b[0;32m--> 749\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(endog, exog, missing\u001b[38;5;241m=\u001b[39mmissing,\n\u001b[1;32m 750\u001b[0m weights\u001b[38;5;241m=\u001b[39mweights, hasconst\u001b[38;5;241m=\u001b[39mhasconst, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 751\u001b[0m nobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mexog\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 752\u001b[0m weights \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mweights\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/regression/linear_model.py:203\u001b[0m, in \u001b[0;36mRegressionModel.__init__\u001b[0;34m(self, endog, exog, **kwargs)\u001b[0m\n\u001b[1;32m 202\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, endog, exog, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[0;32m--> 203\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(endog, exog, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 204\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpinv_wexog: Float64Array \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 205\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_data_attr\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mpinv_wexog\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mwendog\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mwexog\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mweights\u001b[39m\u001b[38;5;124m'\u001b[39m])\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/model.py:270\u001b[0m, in \u001b[0;36mLikelihoodModel.__init__\u001b[0;34m(self, endog, exog, **kwargs)\u001b[0m\n\u001b[1;32m 269\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, endog, exog\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[0;32m--> 270\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(endog, exog, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 271\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39minitialize()\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/model.py:95\u001b[0m, in \u001b[0;36mModel.__init__\u001b[0;34m(self, endog, exog, **kwargs)\u001b[0m\n\u001b[1;32m 93\u001b[0m missing \u001b[38;5;241m=\u001b[39m kwargs\u001b[38;5;241m.\u001b[39mpop(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmissing\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnone\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 94\u001b[0m hasconst \u001b[38;5;241m=\u001b[39m kwargs\u001b[38;5;241m.\u001b[39mpop(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mhasconst\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[0;32m---> 95\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_handle_data(endog, exog, missing, hasconst,\n\u001b[1;32m 96\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 97\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mk_constant \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata\u001b[38;5;241m.\u001b[39mk_constant\n\u001b[1;32m 98\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mexog \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata\u001b[38;5;241m.\u001b[39mexog\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/model.py:135\u001b[0m, in \u001b[0;36mModel._handle_data\u001b[0;34m(self, endog, exog, missing, hasconst, **kwargs)\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_handle_data\u001b[39m(\u001b[38;5;28mself\u001b[39m, endog, exog, missing, hasconst, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[0;32m--> 135\u001b[0m data \u001b[38;5;241m=\u001b[39m handle_data(endog, exog, missing, hasconst, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 136\u001b[0m \u001b[38;5;66;03m# kwargs arrays could have changed, easier to just attach here\u001b[39;00m\n\u001b[1;32m 137\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key \u001b[38;5;129;01min\u001b[39;00m kwargs:\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/data.py:675\u001b[0m, in \u001b[0;36mhandle_data\u001b[0;34m(endog, exog, missing, hasconst, **kwargs)\u001b[0m\n\u001b[1;32m 672\u001b[0m exog \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39masarray(exog)\n\u001b[1;32m 674\u001b[0m klass \u001b[38;5;241m=\u001b[39m handle_data_class_factory(endog, exog)\n\u001b[0;32m--> 675\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m klass(endog, exog\u001b[38;5;241m=\u001b[39mexog, missing\u001b[38;5;241m=\u001b[39mmissing, hasconst\u001b[38;5;241m=\u001b[39mhasconst,\n\u001b[1;32m 676\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/data.py:88\u001b[0m, in \u001b[0;36mModelData.__init__\u001b[0;34m(self, endog, exog, missing, hasconst, **kwargs)\u001b[0m\n\u001b[1;32m 86\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconst_idx \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 87\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mk_constant \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m---> 88\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_handle_constant(hasconst)\n\u001b[1;32m 89\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_integrity()\n\u001b[1;32m 90\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_cache \u001b[38;5;241m=\u001b[39m {}\n",
"File \u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/statsmodels/base/data.py:133\u001b[0m, in \u001b[0;36mModelData._handle_constant\u001b[0;34m(self, hasconst)\u001b[0m\n\u001b[1;32m 131\u001b[0m check_implicit \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[1;32m 132\u001b[0m exog_max \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mmax(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mexog, axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[0;32m--> 133\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m np\u001b[38;5;241m.\u001b[39misfinite(exog_max)\u001b[38;5;241m.\u001b[39mall():\n\u001b[1;32m 134\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MissingDataError(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mexog contains inf or nans\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 135\u001b[0m exog_min \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mmin(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mexog, axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n",
"\u001b[0;31mTypeError\u001b[0m: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
]
}
],
"source": [
"# Check for multicollinearity using Variance Inflation Factor (VIF)\n",
"vif_data = pd.DataFrame()\n",
"vif_data[\"feature\"] = X.columns\n",
"vif_data[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]\n",
"# Arrange VIF values in descending order\n",
"vif_data = vif_data.sort_values(by=\"VIF\", ascending=False)\n"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 394
},
"id": "zi5sGQBQ12Rw",
"outputId": "a5171acd-7487-40a2-f959-14fa9083fb7c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" feature \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Days_to_Delivery \n",
" \n",
" \n",
" 1 \n",
" Num_Items_Ordered \n",
" \n",
" \n",
" 2 \n",
" Discount_Rate \n",
" \n",
" \n",
" 3 \n",
" Num_Previous_Orders \n",
" \n",
" \n",
" 4 \n",
" Delivery_Time_Variation \n",
" \n",
" \n",
" 5 \n",
" Region \n",
" \n",
" \n",
" 6 \n",
" Product_Category \n",
" \n",
" \n",
" 7 \n",
" Order_Priority \n",
" \n",
" \n",
" 8 \n",
" Payment_Method \n",
" \n",
" \n",
" 9 \n",
" Region_APAC \n",
" \n",
" \n",
" 10 \n",
" Region_EMEA \n",
" \n",
" \n",
" 11 \n",
" Region_LATAM \n",
" \n",
" \n",
" 12 \n",
" Region_North America \n",
" \n",
" \n",
" 13 \n",
" Product_Category_Cloud \n",
" \n",
" \n",
" 14 \n",
" Product_Category_Hardware \n",
" \n",
" \n",
" 15 \n",
" Product_Category_On-premise \n",
" \n",
" \n",
" 16 \n",
" Product_Category_SaaS \n",
" \n",
" \n",
" 17 \n",
" Order_Priority_High \n",
" \n",
" \n",
" 18 \n",
" Order_Priority_Low \n",
" \n",
" \n",
" 19 \n",
" Order_Priority_Medium \n",
" \n",
" \n",
" 20 \n",
" Payment_Method_Bank Transfer \n",
" \n",
" \n",
" 21 \n",
" Payment_Method_Bitcoin \n",
" \n",
" \n",
" 22 \n",
" Payment_Method_Credit Card \n",
" \n",
" \n",
" 23 \n",
" Payment_Method_PayPal \n",
" \n",
" \n",
" 24 \n",
" Order_Cancelled_0 \n",
" \n",
" \n",
" 25 \n",
" Order_Cancelled_1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feature\n",
"0 Days_to_Delivery\n",
"1 Num_Items_Ordered\n",
"2 Discount_Rate\n",
"3 Num_Previous_Orders\n",
"4 Delivery_Time_Variation\n",
"5 Region\n",
"6 Product_Category\n",
"7 Order_Priority\n",
"8 Payment_Method\n",
"9 Region_APAC\n",
"10 Region_EMEA\n",
"11 Region_LATAM\n",
"12 Region_North America\n",
"13 Product_Category_Cloud\n",
"14 Product_Category_Hardware\n",
"15 Product_Category_On-premise\n",
"16 Product_Category_SaaS\n",
"17 Order_Priority_High\n",
"18 Order_Priority_Low\n",
"19 Order_Priority_Medium\n",
"20 Payment_Method_Bank Transfer\n",
"21 Payment_Method_Bitcoin\n",
"22 Payment_Method_Credit Card\n",
"23 Payment_Method_PayPal\n",
"24 Order_Cancelled_0\n",
"25 Order_Cancelled_1"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vif_data"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"id": "tjvbGXiw12UX"
},
"outputs": [],
"source": [
"# Split dataset into training and testing\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"id": "XI3z6Wn112b8"
},
"outputs": [
{
"ename": "ValueError",
"evalue": "could not convert string to float: 'APAC'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/12/8kgz6g6j7r9d3hb2nqwh_q3w0000gn/T/ipykernel_28372/4251899716.py\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Standardize numeric features\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m#scaler = StandardScaler()\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpreprocessing\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mMinMaxScaler\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mscaler\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mMinMaxScaler\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mX_train_scaled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscaler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mX_test_scaled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscaler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/utils/_set_output.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, *args, **kwargs)\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mwraps\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 312\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mwrapped\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 313\u001b[0;31m \u001b[0mdata_to_wrap\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 314\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_to_wrap\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtuple\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 315\u001b[0m \u001b[0;31m# only wrap the first output for cross decomposition\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 316\u001b[0m return_tuple = (\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, **fit_params)\u001b[0m\n\u001b[1;32m 1094\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1095\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1096\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0my\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1097\u001b[0m \u001b[0;31m# fit method of arity 1 (unsupervised transformation)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1098\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1099\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1100\u001b[0m \u001b[0;31m# fit method of arity 2 (supervised transformation)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1101\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/preprocessing/_data.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 446\u001b[0m \u001b[0mFitted\u001b[0m \u001b[0mscaler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 447\u001b[0m \"\"\"\n\u001b[1;32m 448\u001b[0m \u001b[0;31m# Reset internal state before fitting\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 449\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 450\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpartial_fit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1469\u001b[0m skip_parameter_validation=(\n\u001b[1;32m 1470\u001b[0m \u001b[0mprefer_skip_nested_validation\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mglobal_skip_validation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1471\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1472\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1473\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfit_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/preprocessing/_data.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 486\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 487\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_namespace\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 488\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 489\u001b[0m \u001b[0mfirst_pass\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"n_samples_seen_\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 490\u001b[0;31m X = self._validate_data(\n\u001b[0m\u001b[1;32m 491\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 492\u001b[0m \u001b[0mreset\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfirst_pass\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 493\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_array_api\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msupported_float_dtypes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 629\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 630\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 631\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 632\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 633\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"X\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 634\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 635\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_check_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m \u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/homebrew/Caskroom/miniconda/base/envs/accelai/lib/python3.12/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m \u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'APAC'"
]
}
],
"source": [
"# Standardize numeric features\n",
"#scaler = StandardScaler()\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"scaler = MinMaxScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"id": "FwZw7TSe12ex"
},
"outputs": [],
"source": [
"# Logistic Regression\n",
"log_reg = LogisticRegression()\n",
"log_reg.fit(X_train_scaled, y_train)\n",
"y_pred_log = log_reg.predict(X_test_scaled)\n",
"y_prob_log = log_reg.predict_proba(X_test_scaled)[:, 1]"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"id": "v07y43RI12hY"
},
"outputs": [],
"source": [
"# ROC and AUC for Logistic Regression\n",
"roc_auc_log = roc_auc_score(y_test, y_prob_log)\n",
"fpr_log, tpr_log, _ = roc_curve(y_test, y_prob_log)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ybtCuHEr12kN",
"outputId": "19d898c5-d339-4dcd-b99d-7f0e1c697bc3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5309127199250899\n"
]
}
],
"source": [
"print(roc_auc_log)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"id": "bo7IJj_f4FxJ"
},
"outputs": [],
"source": [
"# Confusion Matrix and Classification Report for Logistic Regression\n",
"conf_matrix_log = confusion_matrix(y_test, y_pred_log)\n",
"class_report_log = classification_report(y_test, y_pred_log)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "daorTXbW4F4d",
"outputId": "2e89160e-a73d-4373-acba-3aeabd917f47"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0 412]\n",
" [ 0 788]]\n"
]
}
],
"source": [
"print(conf_matrix_log)\n",
"\n",
"#[ TN FP ]\n",
"#[ FN TP ]"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kAl3fKZk4Jer",
"outputId": "e76468f0-fd0a-4cfb-ea98-0cc4eee22e9a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.00 0.00 0.00 412\n",
" 1 0.66 1.00 0.79 788\n",
"\n",
" accuracy 0.66 1200\n",
" macro avg 0.33 0.50 0.40 1200\n",
"weighted avg 0.43 0.66 0.52 1200\n",
"\n"
]
}
],
"source": [
"print(class_report_log)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 564
},
"id": "hhGaw9_S2dR2",
"outputId": "a4831af8-b817-4e94-ce24-d39faafce307"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot ROC Curves\n",
"plt.figure(figsize=(9, 6))\n",
"plt.plot(fpr_log, tpr_log, label=f'Logistic Regression (AUC = {roc_auc_log:.2f})')\n",
"#plt.plot(fpr_nb, tpr_nb, label=f'Naive Bayes (AUC = {roc_auc_nb:.2f})')\n",
"plt.plot([0, 1], [0, 1], linestyle='--', color='grey')\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate (Sensitivity/Recall)')\n",
"plt.title('ROC Curve Comparison')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eBDRoceuwVwZ"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s7PeHP5PxK46"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ElX1NSYQe3BC"
},
"source": [
"#### `In-Class Activity - 1:` Duration: 15 minutes\n",
"- Now instead of using `Standard scaler` for feature scaling , apply Min-Max normalization as a feature scaling and check the effect in performance of model in test.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Standardize numeric features\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"scaler = MinMaxScaler()\n",
"X_train_scaled_1 = scaler.fit_transform(X_train)\n",
"X_test_scaled_1 = scaler.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"# Logistic Regression\n",
"log_reg_1 = LogisticRegression()\n",
"log_reg_1.fit(X_train_scaled_1, y_train)\n",
"y_pred_log_1 = log_reg_1.predict(X_test_scaled_1)\n",
"y_prob_log_1 = log_reg_1.predict_proba(X_test_scaled_1)[:, 1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ROC and AUC for Logistic Regression\n",
"roc_auc_log_1 = roc_auc_score(y_test, y_prob_log_1)\n",
"fpr_log_1, tpr_log_1, _ = roc_curve(y_test, y_prob_log_1)"
]
}
],
"metadata": {
"accelerator": "TPU",
"colab": {
"gpuType": "V28",
"machine_shape": "hm",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}