Jiuru Lyu
  • Home
  • CV
  • Notes
  • Photograph
  • Blogs

On this page

  • Feature Engineering
  • Feature Selection
  • Fitting Nonlinear Functions – Kernel Methods
  • Edit this page
  • View source
  • Report an issue

7 Feature Selection and Kernels

Feature Engineering
Feature Selection
Kernels
This lecture discusses the feature selection and kernel methods. We will cover feature engineering and selection methods, kernel methods, and kernel tricks.
Author

Jiuru Lyu

Published

February 18, 2025

Feature Engineering

  • Types of Variables:
    • Numerical: Discrete / Continuous.
    • Categorical
    • Ordinal: categocial with order. E.g., Temperature: Low, Medium, High.
  • Discretization of Numerical Features (Quantization / Bining)
    • Numerical \(\to\) Categorical
      • E.g., age: \([18,25], (25, 45], (45, 65], (65, 85], >85\) or binary: \(\leq 65, >65\)
    • Why? Allow a linear model to learn nonlinear relationships
  • Missing data:
    • Depend on how missing they are:
      • Drop examples with missing values
      • Drop features with missing values
    • Imputation approaches:
      • Univariate imputation
      • Multivariate imputation
      • Nearest neighbor imputation
      • Missing indicators
  • Time Series Data:
    • Summary statistics:
      • mean, std
      • min, median, max, Q1, Q3
      • cumulative sum
      • count
      • count above/below threshold
    • Trend:
      • linear slopes
      • piece-wise slopes
    • Periodic trends: fast fourier transform

Feature Selection

  • Motivation:
    • When having few examples, but a learge number of features (\(d\gg N\)), it becomes very easy to overfit the model.
    • We need to remove uninformative features.
  • Methods:
    • Filter method: before model fitting
    • Wrapper method: after model fitting
    • Embedded method: during model fitting
  • Filter Method: preprocessing*
    • Rank individual features according to some statistical measure
    • Filter out features that fall below a certain threshold
      • E.g., sample variance, Person’s correlation coerfficient between feature and label, Mutal information, \(\chi^2\)-statistic for categorical/ordinal features.
  • Wrapper Method: Search
    • View feature selection as a model selection problem

    • Which feature to use: hyperparameter

      • E.g,b Brute-force, exhausitve search: If we have \(d\) features, there are \(2^d-1\) possible combinations. Computatianlly infeasible.
      • Greedy search: Forward selection, backward elimination.
    • Forward selection:

      \begin{algorithm} \caption{Forward Selection} \begin{algorithmic} \State initialize $F$ to the set of all features \State initialize $S$ to the empty set $\qquad$\Comment{Start with $0$ features} \For{$i=1:d-1$} \For{ each feature $f\in F$} \State frain model using $(S, f)$ \State evaluate model \EndFor \State Select best set $\qty{S, f^*}$. $S=S+F^*$, and $F=F-f^*$ \EndFor \end{algorithmic} \end{algorithm}
      • This algorithm has complexity \(\sim\bigO(d^2)\).
    • Backward elimination:

      \begin{algorithm} \caption{Forward elimination} \begin{algorithmic} \State initialize $F$ to the set of all features \For{$i=1:d-1$} \State $S=$ set of all subsets of $F$ of size $d-i$. $\qquad$\Comment{Start with $d$ features} \For{ each subset $s\in S$} \State train model using $s$ \State evaluate model \EndFor \State Select best subset $s^*$, and $F=s^*$ \EndFor \end{algorithmic} \end{algorithm}
      • This algorithm has complexity \(\sim\bigO(d^2)\). However, forward selection is less computationally expensive because we train more smaller modedls.
  • Embedded Method: Regularization
    • Incorporate feature selection as part of the model fitting process
    • Use \(L_1\) (LASSO) regularization
Filter Wrapper Embedded
Computational Cost Fast. Only run once scalable to high dimension Slow Between filter and wrapper
Model/Algorithm Specific Generic, agnostic to models and algorithms Specific, considers model performance Only applies to specific model / algorithms
Overfitting Unlikely Higher risk of overfitting Regualrization controls overfitting
  • In practice, we use combinations of these methods.

Fitting Nonlinear Functions – Kernel Methods

  • Question: How can we use linear models to learn nonlinear trends?

    • Map to higher order dimensions: e.g., \(\mqty[1&x&x^2&\cdots&x^m]\)
  • For example, classification:

    Figure 1: Higher Dimensional Classification
    • Let \(\va x=\mqty[x_1&x_2]^\top\) and \(\phi(\va x)=\mqty[1&x_1&x_2&x_1x_2&x_1^2&x_2^2]^\top\).
    • Find a linear classifier such that \(\va\theta\cdot\phi(\va x)=0\) that perfectlly separates the two classes.
    • Equation of a circle: \((x-h)^2+(y-k)^2=r^2\).
    • In this example, \((x_1-2)+(x_2-2)^2=1\). So, \(7-4x_1-4x_2+x_1^2+x_2^2=0\).
    • Therfore, \(\phi(\va x)=\mqty[1&x_1&x_2&x_1x_2&x_1^2&x_2^2]^\top\) and \(\va\theta=\mqty[7&-4&-4&0&1&1]^\top\).
    • But, we don’t want to use this method. Why?
      • Because dimension will blow-up. Suppose \(\va x\in\R^d\) and \(\phi(\va x)\in\R^p\). Then, \[p=\text{number of first and second order terms}=(d+1)+\dfrac{d(d+1)}{2}\sim\bigO(d^2).\]
      • We have \(p\gg d\). It’s also likely that \(p\gg N\).
      • In perceptron algorithm, our update complexity will then be \(\sim\bigO(p)=\bigO(d^2)\), which is inefficient.
  • Key Observation:

    • If we initialize \(\va\theta^{(k)}=\va 0\), then \(\va\theta^{(k)}\) is always in the span of feature vectors and can be expressed to linear combination of feature vectors: \[\va\theta^{(k)}=\sum_{i=1}^N\alpha_i\va x^{(i)}\quad\text{for some }\alpha_1,\alpha_2,\dots,\alpha_N.\]
    • Then, we can rewrite the classifier to get \[\begin{aligned}h(\va x;\va\theta)&=\operatorname{sign}(\va\theta\cdot\va x)\\&=\operatorname{sign}\qty(\qty(\sum_{i=1}^N\alpha_i\va x^{(i)})\cdot\va x)\\&=\operatorname{sign}\qty(\sum_{i=1}^N\alpha_i\va x^{(i)}\cdot\va x).\end{aligned}\]
    • Even if we are in higher dimensional feature space: \[h(\phi(\va x);\va\theta)=\operatorname{sign}\qty(\sum_{i=1}^N\alpha_i\qty(\phi(\va x^{(i)})\cdot\phi(\va x))).\]
    • Classifiers now is written in \(\alpha_i\) (\(N\)-dimensional) space, not in \(\va x\) (\(d\)-dimensional) space.
    • Then, the update rule is \[\alpha_i^{(k+1)}=\alpha_i^{(k)}+y^{(i)}.\]
\begin{algorithm} \caption{New Perceptron} \begin{algorithmic} \State $\va\alpha^{(0)}=\va 0\in\R^N$, $k=0$ \While{not converged} \For{$i=1,\dots,N$} \If{$y^{(i)}\qty(\sum_{j=1}^N\alpha_j\phi(\va x^{(i)})\cdot\phi(\va x^{(j)}))\leq 0$} \State $\alpha_{i}^{(k+1)}=\alpha_i^{(k)}+y^{(i)}$ \State $\alpha_j^{(k+1)}=\alpha_j^{(k)}$ when $j\neq i$ \State $k++$ \EndIf \EndFor \EndWhile \end{algorithmic} \end{algorithm}
  • \(\phi(\va x^{(i)})\cdot\phi(\va x^{(j)})\) can be pre-computed. Complexity is \(\bigO(N^2p)\)
  • The update complexity \(\bigO(1)\).
  • How can we speed up dot product \(\phi(\va x)\cdot\phi(\va x')\)? We can calculate the dot producrt \(\phi(\va x)\cdot\phi(\va x')\) withtout calculating \(\phi(\va x)\) and \(\phi(\va x')\).

Example 1 Suppose \(\phi(\va x)=\mqty[x_1^2&x_2^2&\sqrt{2}x_1x_2]^\top\). Then, \[ \begin{aligned} \phi(\va u)\cdot\phi(\va v)&=\mqty[u_1^2&u_2^2&\sqrt{2}u_1u_2]^\top\cdot\mqty[v_1^2&v_2^2&\sqrt{2}v_1v_2]^\top\\ &=u_1^2v_1^2+u_2v_2^2+2u_1u_2v_1v_2\\ &=(u_1v_1+u_2v_2)^2\\ &=\qty(\va u\cdot\va v)^2. \end{aligned} \]

Definition 1 (Kernel Function) Kernel function is an implicit feature mapping that allows us to compute the dot product in the feature space without explicitly computing the feature vectors. \[K:\underbrace{\R^d\times\R^d}_\text{feature vectors}\to\R\]

Theorem 1 (Common Kernels:)  

  • Linear Kernel: \[K\qty(\va x^{(i)},\va x^{(j)})=\va x^{(i)}\cdot\va x^{(j)}.\]
  • Polynomial Kernel: \[K\qty(\va x^{(i)},\va x^{(j)})=\qty(\va x^{(i)}\cdot\va x^{(j)}+1)^d.\]
  • Radial Basis Function (RBF) Kernel: infinite-dimensional feature space \[\begin{aligned}K_\text{RBF}(\va u,\va v)&=\exp\qty(-\gamma\norm{\va u-\va v}^2)\\&=C\sum_{n=0}^\infty\dfrac{K_{\operatorname{poly}(n)}(\va u,\va v)}{n!}\end{aligned}\]
\begin{algorithm} \caption{Kernelized Perceptron} \begin{algorithmic} \State $\alpha_j^{(0)}=0$, $k=0$ \While{not converged} \For{$i=1,\dots,N$} \If{$y^{(i)}\qty(\sum_{j=1}^N\alpha_j^{(k)}K(\va x^{(j)}, \va x^{(i)}))\leq 0$} \State $\alpha_{i}^{(k+1)}=\alpha_i^{(k)}+y^{(i)}$ \State $\alpha_j^{(k+1)}=\alpha_j^{(k)}$ when $j\neq i$ \State $k++$ \EndIf \EndFor \EndWhile \end{algorithmic} \end{algorithm}
Back to top

© Copyright 2024, Jiuru Lyu

 
  • Edit this page
  • View source
  • Report an issue