Course Link ：Week8  Unsupervised Learning & Dimensionality Reduction
24 Clustering
24.1 Unsupervised Learning: Introduction
 Unsupervised learning: learning from unlabeled data
 Supervised learning & unsupervised learning: Compare and contrast:
 Supervised learning:
 Given a set of labels, fit a hypothesis to it
 Unsupervised learning:
 Try and determining structure in the data
 Clustering algorithm groups data together based on data features
 Supervised learning:
 What is clustering good for
 Market segmentation: group customers into different market segments
 Social network analysis: Facebook “smartlists”
 Organizing computer clustering and data centers for network layout and location
 Astronomical data analysis: understanding galaxy formation
24.2 KMeans Algorithm
 Algorithm overview
 1) Randomly allocate two points as the cluster centroids
 Have as many cluster centroids as clusters you want to do ($K$ cluster centroids, in fact)
 In our example we just have two clusters, so there’re two cluster centroids (red and blue point)
 2) Cluster assignment step
 Go through each example and depending on if it’s closer the red or blue centroid assign each point to one of the two clusters
 To demonstrate this, we’ve gone through the data and “color” each point red or blue
 3) Move centroid step
 Take each centroid and move to the average of the correspondingly assigned datapoints
 4) Repeat 2) and 3) until convergence
 1) Randomly allocate two points as the cluster centroids
 More formal definition
 Input:
 $K$: number of clusters in the data
 Training set ${x^{(1)}, x^{(n)}, \dots,x^{(m)}}$, $x^{(i)} \in \Bbb{R}^n$ (drop $x_0 = 1$ convention)
 Algorithm:
 Randomly initialize $K$ cluster centroids as ${\mu_1, \mu_2, \dots, u_K \in \Bbb{R}^n }$
 Input:
 Kmeans for nonseparated clusters
24.3 Optimization Objective

Notations:
 $c^{(i)}$ = index of cluster ($1, 2, \dots, K$) to which example $x^{(i)}$ is currently assigned
 $c^{(i)} = k \in {1, 2, \dots, K}$
 $i \in {1, 2, \dots, m}$
 $\mu_k$ = cluster centroid k ($\mu_k \in \Bbb{R}^n$)
 $\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned
 $c^{(i)}$ = index of cluster ($1, 2, \dots, K$) to which example $x^{(i)}$ is currently assigned

Optimization objective

If we consider the Kmeans algorithm
 The cluster assigned step is minimizing $J(\cdot)$ with respect to $c^{(i)}$
 i.e. find the centroid closet to each example
 Doesn’t change the centroids themselves
 The move centroid step
 This step is choosing the values of $\mu$ which minimizes $J(\cdot)$ with respect to $\mu_k$
 So, we’re partitioning the algorithm into two parts:
 First part minimizes the $c$ variables
 Second part minimizes the $J$ variables
 The cluster assigned step is minimizing $J(\cdot)$ with respect to $c^{(i)}$
24.4 Random Initialization
 Have number of centroids set to less than number of examples ($K < m$)
 Randomly pick $K$ training examples
 Set $\mu_1$ up to $\mu_K$ to these examples values
 Kmeans can converge to different solutions depending on the initialization setup
 Risk of local optimum
 We can do multiple random initializations to see if we get the same result  many same result are likely to indicate a global optimum
24.5 Choosing the Number of Clusters
 What is right value of $K$ ?
 Elbow method
 Vary $K$ and compute cost function at a range of $K$ values
 As $K$ increases, $J(\cdot)$ minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)
 Plot $K$ vs. $J(\cdot)$
 Chose the “elbow” number of clusters (if you get a nice plot this is a reasonable way of choosing $K$)
 Risks: normally you don’t get a a nice line → no clear elbow on curve
 Another method for choosing $K$
 Sometimes, you’re, running Kmeans to get clusters to use for some later/downstream purpose. Evaluate K‐means based on a metric for how well it performs for that later purpose.
25 Motivation
25.1 Motivation I: Data Compression
 A second type of unsupervised learning algorithm: dimensionality reduction
 Data compression
 Speeds up algorithms
 Reduces space used by data for them
25.2 Motivation II: Visualization
 It’s hard to visualize highly dimensional data
 Dimensionality reduction can improve how we display information in a tractable manner for human consumption
 Why do we care?
 Often helps to develop algorithms if we can understand our data better
 Dimensionality reduction helps us do this, see data in a helpful manner
 Good for explaining something to someone if you can “show” it in the data
26 Principal Component Analysis
26.1 Principal Component Analysis Problem Formulation

For the problem of dimensionality reduction the most commonly used algorithm is PCA

Say we have a 2D data set which we wish to reduce to 1D

In other words, find a single line onto which to project this data. How do we determine this line?

The distance between each point and the projected version should be small (blue lines below are short)

PCA tries to find a lower dimensional surface so the sum of squares onto that surface is minimized

The blue lines are sometimes called the projection error
 PCA tries to find the surface (a straight line in this case) which has the minimum projection error

As an aside, you should normally do mean normalization and feature scaling on your data before PCA



A more formal description is
 For 2D1D, we must find a vector $u^{(1)}$, which is of some dimensionality
 Onto which you can project the data so as to minimize the projection error
 $u^{(1)}$ can be positive or negative ($u^{(1)}$) which makes no difference
 Each of the vectors define the same red line

In the more general case, To reduce from $n$D to $k$D we
 Find $k$ vectors ($u^{(1)} , u^{(2)}, \dots, u^{(k)}$) onto which to project the data to minimize the projection error
 So lots of vectors onto which we project the data
 Find a set of vectors which we project the data onto the linear subspace spanned by that set of vectors
 We can define a point in a plane with $k$ vectors
 e.g. 3D → 2D
 Find pair of vectors which define a 2D plane (surface) onto which you’re going to project your data
 Much like the “shallow box” example in compression, we’re trying to create the shallowest box possible (by defining two of it’s three dimensions, so the box’ depth is minimized)

PCA is NOT linear regression
 For linear regression, fitting a straight line to minimize the straight line between a point and a squared line
 Note: VERTICAL distance between point
 For PCA minimizing the magnitude of the shortest orthogonal distance
 Gives very different effects
 More generally
 With linear regression we’re trying to predict “$y$”
 With PCA there is no “$y$”  instead we have a list of features and all features are treated equally
 If we have 3D dimensional data 3D>2D
 Have 3 features treated symmetrically
 If we have 3D dimensional data 3D>2D
 For linear regression, fitting a straight line to minimize the straight line between a point and a squared line
26.2 Principal Component Analysis Algorithm

Before applying PCA must do data preprocessing
 Mean normalization
 Feature scaling (depending on data)

With preprocessing done, PCA finds the lower dimensional subspace which minimizes the sum of the square
 Need to compute two things: $\mu$ vectors & $z$ vectors

Algorithm description

Reducing data from $n$dimensional to $k$dimensional

Compute the covariance matrix: $\Sigma = \frac{1}{m}\sum_{i=1}^n \left(x^{(i)} \right)\left(x^{(i)} \right)^T$
 $\Sigma$: This is commonly denoted as Σ (Greek upper case sigma)  NOT summation symbol
 $\Sigma = \frac{1}{m} x^Tx$
 $\Sigma$ is an $n \times n$ matrix
 $x^{(i)}$ is an $n \times 1$ matrix

Compute eigenvectors of matrix $\Sigma$:
[U, S, V] = svd(sigma)
svd
= singular value decomposition

$U$, $S$ and $V$ are matrices

$U$ matrix is also an $n \times n$ matrix

Turns out the columns of $U$ are the $u$ vectors we want

So to reduce a system from $n$D to $k$D: just take the first $k$ vectors from $U$:



Next we need to find some way to change $x$ (which is $n$ dimensional) to $z$ (which is $k$ dimensional)  reduce the dimensionality

Take first $k$ columns of the $u$ matrix and stack in columns:

$z = U_{\text{reduce}}^T * x=[k \times n] * [n \times 1] = [k \times 1]$


Exactly the same as with supervised learning except we’re now doing it with unlabeled data

So in summary
 Preprocessing
 Calculate $\Sigma$ (covariance matrix)
 Calculate eigenvectors with
svd
 Take $k$ vectors from $U$, i.e. $U_{\text{reduce}} = U(:,1:k)$
 Calculate $z$ ($z =U_{\text{reduce}}^T * x$)

27 Applying PCA
27.1 Reconstruction from compressed Representation

Earlier spoke about PCA as a compression algorithm
 If this is the case, is there a way to decompress the data from low dimensionality back to a higher dimensionality format?

Reconstruction

Say we have an example as follows:

We have our examples ($x^{(1)}$, $x^{(2)}$ etc.)

Project onto $z$surface

Given a point $z^{(1)}$, how can we go back to the 2D space?


Considering
 $z = U_{\text{reduce}}^T * x$

To go in the opposite direction we must do
 $x_{\text{approx}} = U_{\text{reduce}} * z\,(=[n\times k] * [k * 1] = [n \times 1])$

So this creates the following representation

We lose some of the information (i.e. everything is now perfectly on that line) but it is now projected into 2D space
27.2 Choosing the Number of Principal Components

PCA tries to minimize average squared projection error:

Total variation in data can be defined as the average over data saying how far are the training examples from the origin:

Typically, choose $k$ to be smallest value so that
 means “99% of variance is retained”

Algorithms:

Try PCA with $k = 1$

Compute

check if the ratio mentioned above is less than (or equal to) 0.01?
 if not, try $k = 2, \dots$


Easier way:
[U, S, V] = svd(Sigma)

$S$ is an $n \times n$ diagonal matrix:

And

Pick smallest value of $k$ which: $\frac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}} \ge 0.99$

27.3 Advice for Applying PCA
 Speeding up supervised learning algorithm
 Say you have a supervised learning problem
 Input $x$ and $y$
 $x$ is a 10,000 dimensional feature vector
 e.g. $100 \times100$ images = 10,000 pixels
 Such a huge feature vector will make the algorithm slow
 With PCA we can reduce the dimensionality and make it tractable
 How
 1) Extract $x$
 So we now have an unlabeled training set
 2) Apply PCA to $x$ vectors
 So we now have a reduced dimensional feature vector $z$
 3) This gives you a new training set
 Each vector can be reassociated with the label
 4) Take the reduced dimensionality data set and feed to a learning algorithm
 Use $y$ as labels and $z$ as feature vector
 5) if you have a new example map from higher dimensionality vector to lower dimensionality vector, then feed into learning a algorithm
 1) Extract $x$
 Input $x$ and $y$
 PCA maps one vector to a lower dimensionality vector
 $x \rightarrow z$
 Defined by PCA only on the training set
 The mapping computes a set of parameters
 Feature scaling values
 $U_{\text{reduce}}$
 Parameter learned by PCA
 Should be obtained only by determining PCA on your training set
 So we use those learned parameters for our
 Cross validation set
 Test set
 Say you have a supervised learning problem
 Applications of PCA
 Compression
 Reduce memory/disk needed to store data
 Speed up learning algorithm
 Visualization
 Typically chose $k =2$ or $k = 3$
 Because we can plot these values!
 A bad use of PCA: Use it to prevent overfitting
 Reasoning
 If we have $x^i$ we have $n$ features, $z^i$ has $k$ features which can be lower
 If we only have k features then maybe we’re less likely to over fit…
 This doesn’t work
 BAD APPLICATION
 Might work OK, but not a good way to address over fitting
 Better to use regularization
 PCA throws away some data without knowing what the values it’s losing
 Probably OK if you’re keeping most of the data
 But if you’re throwing away some crucial data bad
 So you have to go to like 9599% variance retained
 So here regularization will give you AT LEAST as good a way to solve over fitting
 Reasoning
 A second PCA myth
 Used for compression or visualization  good
 Sometimes used to
 Design ML system with PCA from the outset
 But, what if you did the whole thing without PCA? (try it)
 See how a system performs without PCA
 When to use PCA: ONLY if you have a reason to believe PCA will help should you then add PCA
 PCA is easy enough to add on as a processing step
 Try without first!
 Design ML system with PCA from the outset
 Compression
Ex7: KMeans Clustering and PCA👨💻
See this exercise on CourseraMachineLearningPython/ex7/ex7.ipynb
Thistledown