The Qustion

P-value is common in scientific researches, but it’s not often used in the practical data science. Why?

Concept

Let’s revisit the core conept of p-value in my word.

A p-value is a “Surprise Index.”

It tells you how surprising your data would be if there was actually no effect or no difference (e.g., if the new drug you’re testing was just a sugar pill).

Low p-value (e.g., below 0.05): “This is a very surprising result! It’s very unlikely I’d see this data by pure random chance. I should probably reject the idea that there’s ‘no effect’.”

High p-value (e.g., above 0.05): “This result isn’t surprising at all. It could easily happen by random chance. I don’t have enough evidence to say there’s any real effect.”

Key Takeaway

A p-value is NOT the probability that your theory is correct.

It IS the probability of seeing your data, assuming your “no effect” theory (null hypothesis) is correct. If that probability is low, you have a good reason to think your null hypothesis is wrong.

Investigation

My first intuition of using p-value is in the feature selection of the ML model.

The short answer is: Yes, you can, but you must be very careful, and it’s often not the best tool for the job, especially for models like LightGBM.

TLDR;

Let’s break it down.

1. The Classic Use Case: Linear Models (like Linear Regression)

This is where p-values shine and where the technique originated.

When you fit a linear regression model, you are essentially trying to find the best coefficients (or weights) for each feature in a linear equation:

Target = C + (coeff_1 * feature_1) + (coeff_2 * feature_2) + ...

Statistical software (like Python’s statsmodels library) will run this regression and give you a summary that includes a p-value for each and every coefficient.

How to Interpret It:

  • The “no effect” assumption here is: “This feature’s true coefficient is zero” (i.e., it has no linear relationship with the target).
  • A low p-value (< 0.05) for feature_1 means: “It’s very surprising to see a coefficient this far from zero if the feature truly had no effect. We can reject the ‘no effect’ idea and conclude this feature has a statistically significant linear relationship with the target.”
  • A high p-value (> 0.05) for feature_2 means: “A coefficient of this size could easily happen by random chance. We don’t have enough evidence to say this feature is useful in our linear model.”

So, for linear regression, a common feature selection strategy is to remove features with high p-values.

2. The Big Problems and Caveats (Why it’s tricky for ML)

This is the most important part. Relying only on p-values for feature engineering in a general ML context can be misleading.

Problem 1: P-values only test for LINEAR relationships. A feature could have a powerful non-linear relationship with your target, but a linear model would completely miss it, giving it a high p-value.

  • Example: Engine temperate vs. fuel efficiency. Efficiency is highest in the middle, and poor when it’s too cold or too hot (a U-shaped curve). A linear model would find no relationship, yielding a high p-value, and you might wrongly discard “engine temperature” as a feature.

Problem 2: P-values are irrelevant to Tree-Based Models (LightGBM, XGBoost). Models like LightGBM don’t work by finding linear coefficients. They work by repeatedly splitting the data on the features that best separate the target variable (e.g., reducing impurity or error).

  • LightGBM doesn’t calculate p-values internally. It has its own method for measuring feature usefulness: Feature Importance. This tells you how often a feature was used to make a split and how much that split improved the model’s accuracy. This is a much more direct measure of a feature’s value to that specific model.

Problem 3: Multicollinearity messes up p-values. If you have two highly correlated features (e.g., house_age and year_built), the linear model gets confused about how to assign credit. Both features might get a high p-value, suggesting they are useless, even though both are highly predictive.

Problem 4: Statistical Significance ≠ Predictive Power. With a huge dataset, a feature with a tiny, practically meaningless effect can still have a very low p-value. The p-value just says the effect is “not zero,” it doesn’t say the effect is large or important for making predictions.

3. Better Alternatives for Machine Learning Feature Selection

Instead of relying on p-values, ML practitioners typically use methods that directly measure a feature’s impact on predictive performance.

  1. Model-Based Feature Importance: (Best for LGBM, XGBoost, etc.)
    • Train your LightGBM model on all the features.
    • Then, simply look at the lgbm_model.feature_importances_ attribute.
    • This gives you a score for each feature based on how useful it was to the trained model. You can then discard features with zero or very low importance. This is the most common and effective method.
  2. Permutation Importance: (Model-agnostic and very powerful)
    • Train a model and calculate its accuracy (e.g., on a validation set).
    • Take one feature column and randomly shuffle it. This breaks any relationship it had with the target.
    • Predict again with the shuffled data.
    • The decrease in model accuracy tells you exactly how important that feature was. Repeat for all features.
  3. Recursive Feature Elimination (RFE):
    • The model is trained on all features, and the least important one is removed. The process is repeated until the desired number of features is left. This is good for dealing with correlated features.

Summary

Situation Recommended Approach
Your model is Linear Regression and your goal is interpretation (understanding why). Use p-values. It’s a great tool to understand which features have a statistically significant linear relationship with the target.
Your model is LightGBM, XGBoost, or Random Forest, and your goal is predictive accuracy. Do NOT use p-values. Use built-in Feature Importance or Permutation Importance. These methods are directly tied to the model’s performance and respect its non-linear nature.
You are in the early exploratory data analysis (EDA) phase, before any modeling. Using statistical tests (like ANOVA F-test or Chi-Squared, which produce p-values) to get a quick “first look” at relationships between individual features and the target can be a helpful pre-screening step. Just don’t let it be your final decision.

Second Thought

I conclude that p-value is not appropriate for the feature selection of non-linear ML model.

Anything else we can utilize in the work? Below is the use-case I searched with sample codes.

Use Case 1: A/B Testing a Website Change

We test if a new green ‘Buy’ button (Group B) gets more conversions than the old blue button (Group A).

  • Key Question: Is the observed difference in conversions real, or just random luck?
  • Null Hypothesis (H₀): Button color has no effect on the conversion rate.
  • Statistical Test: Chi-Squared Test (for comparing proportions of categorical data).

Python Implementation & Interpretation

import pandas as pd
from scipy import stats

# Step 1: Define our observed data in a contingency table
# Group A (Blue Button): 100 converted, 4900 did not
# Group B (Green Button): 125 converted, 4875 did not
observed_data = pd.DataFrame([
    [100, 4900], # Group A
    [125, 4875]  # Group B
], columns=['Converted', 'Did Not Convert'], index=['Group A', 'Group B'])

print("Observed Data:\n", observed_data)

# Step 2: Run the Chi-Squared test
# It returns the test statistic, p-value, degrees of freedom, and expected frequencies
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed_data)

print(f"\nP-Value: {p_value:.4f}")

# Step 3: Make a decision based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Conclusion: Result is statistically significant. We reject the null hypothesis.")
    print("Business Action: The green button has a different conversion rate. Deploy the new button.")
else:
    print("Conclusion: Result is not significant. We fail to reject the null hypothesis.")
    print("Business Action: We can't prove a difference. Stick with the old button.")


Use Case 2: Comparing Customer Segments

We want to know if ‘Premium’ customers have a different average monthly spending than ‘Free’ customers.

  • Key Question: Is the difference in average spending between the two groups statistically significant?
  • Null Hypothesis (H₀): The average spending for both ‘Premium’ and ‘Free’ groups is the same.
  • Statistical Test: Independent T-test (for comparing the means of two independent groups).

Python Implementation & Interpretation

import numpy as np
from scipy import stats

# Step 1: Create sample spending data for the two groups
premium_spending = np.array([55, 60, 45, 70, 80, 50, 65, 58, 62, 75])
free_spending    = np.array([30, 45, 50, 40, 35, 25, 40, 38, 42, 30])

print(f"Average Premium Spending: ${premium_spending.mean():.2f}")
print(f"Average Free Spending:    ${free_spending.mean():.2f}")

# Step 2: Run the Independent T-test
t_stat, p_value = stats.ttest_ind(premium_spending, free_spending)

print(f"\nP-Value: {p_value:.4f}")

# Step 3: Make a decision based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Conclusion: Result is statistically significant. We reject the null hypothesis.")
    print("Business Action: Premium users spend a different amount. This justifies targeted pricing or marketing.")
else:
    print("Conclusion: Result is not significant. We fail to reject the null hypothesis.")
    print("Business Action: We can't prove a meaningful difference in spending exists.")

Use Case 3: Detecting Data Drift in a Production Model

An ML model was trained on user income data from 2021. We need to check if the income distribution of new users in 2023 has changed, as this could hurt model performance.

  • Key Question: Is our model seeing data from a different distribution than it was trained on?
  • Null Hypothesis (H₀): The new data comes from the same distribution as the training data.
  • Statistical Test: Kolmogorov-Smirnov (K-S) Test (for checking if two samples are drawn from the same distribution).

Python Implementation & Interpretation

import numpy as np
from scipy import stats

# Step 1: Generate sample data representing income distributions
# Training data (2021): mean income of 50k
training_income = np.random.normal(loc=50000, scale=15000, size=1000)
# Live data (2023): let's assume mean income has drifted up to 55k
live_income = np.random.normal(loc=55000, scale=18000, size=1000)

print(f"Average Training Income: ${training_income.mean():,.2f}")
print(f"Average Live Income:     ${live_income.mean():,.2f}")

# Step 2: Run the K-S test
ks_stat, p_value = stats.ks_2samp(training_income, live_income)

print(f"\nP-Value: {p_value:.4f}")

# Step 3: Make a decision based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Conclusion: Result is statistically significant. We reject the null hypothesis.")
    print("Business Action: Data drift detected! Alert the MLOps team; the model likely needs to be retrained.")
else:
    print("Conclusion: Result is not significant. We fail to reject the null hypothesis.")
    print("Business Action: No significant data drift detected. The model is stable for now.")