Prediction
What is prediction? Why is it different?
- Prediction is everywhere:
- Basically all modern AI system employ prediction at their core
- Success of AI is largely a function of three advances:
- Computation
- Data
- Algorithms
- Some algorithms/Forms of Prediction:
- Machine Learning
- Linear regression, logistic regression, PCA, LASSO, random forests, K-NN, etc.
- Artificial Intelligence
- Deep Learning
- Why prediction?
- Prediction seeks to find a best guess for an outcome given some meaningfully associated data.
- Why Prediction?
- What is machine learning?
- The study of algorithms that improve through repeated experience
- Given observations of an outcome and some important correlations, build models that predict outcomes.
- Machine learning is mostly concerned with prediction.
- Prediction vs. Explanatory Modeling:
- Different in goals:
- Explanatory Modeling (Causal Inference): What is the effect of X on Y?
- Prediction: Can we forecast what Y will be given X?
- Casual Inference: We want to know the effect of X on Y, so we
- Run an experiment randomizing X
- Use a DiD or Regression Discontinuity
- Run a regression controlling for all possible confounders.
- Use modeling to estimate unobservable potential outcomes, control for confounders, and quantify uncertainty around our estimated effect.
- Predictive Recipe: find a model for X and Y that minimizes my prediction error
Prediction Error=i=1∑Nf(yi−y^i).
- In words, over N observations, how far off from the truth amd I with the predictions produced by my model?
- Explanation can aid in some types of prediction problems but is not necessary in many cases.
- Even if we do not know the causal effect (or even if the causal effect is zero), there might be predictive information in knowing a treatment status.
- They key is that these features contain information about the outcome we are trying to predict.
- Rule of Thumb: avoid noise, add new information
- The main predictive question:
- Can I predict the value of Y given some correlates for an observation that was not included in the modeling step?
- Sometimes, we care about future observations that have not yet occurred (forecasting).
- Other times, we care about with-in sample prediction ability.
- Types of Prediction.
- Static Prediction
- Key assumption: Environment is static (does not change)
- No policy change
- No Adaptation
- Dynamic prediction:
- Environment may change due to
Good and Bad Prediction
- What makes for good predictors:
- Avoid Redundancy
- Avoid pure noise
- Precise information reduces noise
- Predictors from a variety of domains provide different information
- Evaluating Predictive Performance:
- The higher order polynomial, the less error we have in the training data.
- Overfitting occurs when we tune a predictive model only on the data we observe.
- Since the goal of prediction is to minimize the prediction errors, a model that only accounts for the observed data will produce fits that are too specific for general prediction.
- We need to balance fitting the observed data with generality - favor parsimonious models over complex models while still avoiding underfitting.
- Overfitting/Bias-Variance Trade-off
- Overfitting: aka the bias-variance trade-off in machine learning parlance.

- Overreacts to every tiny random perturbation in the data.
- Excellent, maybe perfect fit to in-sample data
- Models noise -> won't extrapolate/generalize well to new out-of-sample data.
- Underfitting: misses the true patterns/ relationships in the data (bias)
- Underreacts to real changes
- Won't describe in-sample or out-of-sample data well.
- Overfitting can occur in two circumstances:
- Model is too flexible to observed data
- Model includes many predictors that are only loosely related to the outcome - especially problematic when the predictors are correlated with one another (kitchen sink models).
How to avoid overfitting?
- Evaluating Predictive Performance:
- The prediction recipe:
- Collect predictors and outcomes for some number of observations.
- Hold out some portion of the observations as a test set.
- Find the optimal coefficients for the predictive model on the training subset of the data.
- Evaluate the prediction error on the test set.
- Sometimes, there is a natural training/test split: e.g. time intervals
- Other times, we need to randomly split the data into training and test sets. e.g. randomly hold-out 20% of the data as a test set and fit the model on the other 80%. Quantify predictive performance on the held-out test set.
- This is better than including all data in predictive model at first because it guards against overfitting to the data - if the model fits the training set well, but doesn't fit the test set, then the model predictions aren't very good.
- Cross-Validation:
- The cross-validation prediction recipe:
- Collect predictors and outcomes for some number of observations.
- Divide the data into K equally sized folds.
- Treating fold 1 as the test set, find the predictive model using the training set that minimizes prediction error in fold 1.
- Repeat for each fold, treating them as the test set in turn.
- Average the predictions of each "best model" to get the K-fold cross-validated prediction.
- K-fold cross validation addresses both problem of selecting test set and overfitting by ensuring that all observations are held-out at one point or another.
- Prevents overfitting by averaging over the models - what's good for one may not be good for another, so find a happy medium by averaging them all together.
- Avoid Spurious Predictors
- Overfitting can also happen if our model includes many predictors that are only loosely related to the outcome.
- Espeically problematic when the predictors are correlated with one another (kitchen sink models).
- Is the predictor really related to Y or is the correlation in this data significant by chance/noise?
- Signal vs. Noise
- Let Z be an unimportant predictor
- When finding the model that minimizes prediction error, random chance dictates that there will be a relationship between Z and Y.
- If we try to apply this estimate to new observations, then the spurious correlation between Z and Y can be lead predictions to be off.
- A final thought on preventing overfitting:
- A data scientist who has domain knowledge can prevent overfitting by only including predictors and potential models that make sense in the context of the problem.
- Even if the problem at hand isn't directly explanatory in nature, predictive models benefit from the same considerations that make a good causal model:
- Only include predictors that meaningfully correlate with the outcome of interest.
- Avoid having many correlated predictors.
- Carefully consider functional forms for the predictive model - use a best-fit line when a line will do.