# 偏最小二乘回归和主成分回归

### 加载数据

```load spectra whos NIR octane ```
``` Name Size Bytes Class Attributes NIR 60x401 192480 double octane 60x1 480 double ```
```[dummy,h] = sort(octane); oldorder = get(gcf,'DefaultAxesColorOrder'); set(gcf,'DefaultAxesColorOrder',jet(60)); plot3(repmat(1:401,60,1)',repmat(octane(h),1,401)',NIR(h,:)'); set(gcf,'DefaultAxesColorOrder',oldorder); xlabel('Wavelength Index'); ylabel('Octane'); axis('tight'); grid on ```

### 使用两个成分拟合数据

```X = NIR; y = octane; [n,p] = size(X); [Xloadings,Yloadings,Xscores,Yscores,betaPLS10,PLSPctVar] = plsregress(... X,y,10); ```

```plot(1:10,cumsum(100*PLSPctVar(2,:)),'-bo'); xlabel('Number of PLS components'); ylabel('Percent Variance Explained in Y'); ```

```[Xloadings,Yloadings,Xscores,Yscores,betaPLS] = plsregress(X,y,2); yfitPLS = [ones(n,1) X]*betaPLS; ```

```[PCALoadings,PCAScores,PCAVar] = pca(X,'Economy',false); betaPCR = regress(y-mean(y), PCAScores(:,1:2)); ```

```betaPCR = PCALoadings(:,1:2)*betaPCR; betaPCR = [mean(y) - mean(X)*betaPCR; betaPCR]; yfitPCR = [ones(n,1) X]*betaPCR; ```

```plot(y,yfitPLS,'bo',y,yfitPCR,'r^'); xlabel('Observed Response'); ylabel('Fitted Response'); legend({'PLSR with 2 Components' 'PCR with 2 Components'}, ... 'location','NW'); ```

```TSS = sum((y-mean(y)).^2); RSS_PLS = sum((y-yfitPLS).^2); rsquaredPLS = 1 - RSS_PLS/TSS ```
```rsquaredPLS = 0.9466 ```
```RSS_PCR = sum((y-yfitPCR).^2); rsquaredPCR = 1 - RSS_PCR/TSS ```
```rsquaredPCR = 0.1962 ```

```plot3(Xscores(:,1),Xscores(:,2),y-mean(y),'bo'); legend('PLSR'); grid on; view(-30,30); ```

```plot3(PCAScores(:,1),PCAScores(:,2),y-mean(y),'r^'); legend('PCR'); grid on; view(-30,30); ```

```plot(1:10,100*cumsum(PLSPctVar(1,:)),'b-o',1:10, ... 100*cumsum(PCAVar(1:10))/sum(PCAVar(1:10)),'r-^'); xlabel('Number of Principal Components'); ylabel('Percent Variance Explained in X'); legend({'PLSR' 'PCR'},'location','SE'); ```

PCR 曲线更均匀的事实说明了为什么具有两个成分的 PCR 在拟合 `y` 方面比 PLSR 差。PCR 构造成分是为了最好地解释 `X`，因此前两个成分忽略了数据中对拟合观察到的 `y` 重要的信息。

### 使用更多成分拟合数据

```yfitPLS10 = [ones(n,1) X]*betaPLS10; betaPCR10 = regress(y-mean(y), PCAScores(:,1:10)); betaPCR10 = PCALoadings(:,1:10)*betaPCR10; betaPCR10 = [mean(y) - mean(X)*betaPCR10; betaPCR10]; yfitPCR10 = [ones(n,1) X]*betaPCR10; plot(y,yfitPLS10,'bo',y,yfitPCR10,'r^'); xlabel('Observed Response'); ylabel('Fitted Response'); legend({'PLSR with 10 components' 'PCR with 10 Components'}, ... 'location','NW'); ```

### 通过交叉验证选择成分数量

`plsregress` 有一个选项可以通过交叉验证来估计均方预测误差 (MSEP)，此处我们使用 10 折交叉验证。

```[Xl,Yl,Xs,Ys,beta,pctVar,PLSmsep] = plsregress(X,y,10,'CV',10); ```

```PCRmsep = sum(crossval(@pcrsse,X,y,'KFold',10),1) / n; ```

PLSR 的 MSEP 曲线表明，两个或三个成分的效果已经足够好。另一方面，PCR 需要四个成分才能获得同样的预测准确度。

```plot(0:10,PLSmsep(2,:),'b-o',0:10,PCRmsep,'r-^'); xlabel('Number of components'); ylabel('Estimated Mean Squared Prediction Error'); legend({'PLSR' 'PCR'},'location','NE'); ```

### 模型精简

PLS 权重是定义 PLS 成分的原始变量的线性组合，即，它们描述 PLSR 中的每个成分依赖原始变量的程度和方向。

```[Xl,Yl,Xs,Ys,beta,pctVar,mse,stats] = plsregress(X,y,3); plot(1:401,stats.W,'-'); xlabel('Variable'); ylabel('PLS Weight'); legend({'1st Component' '2nd Component' '3rd Component'}, ... 'location','NW'); ```

```plot(1:401,PCALoadings(:,1:4),'-'); xlabel('Variable'); ylabel('PCA Loading'); legend({'1st Component' '2nd Component' '3rd Component' ... '4th Component'},'location','NW'); ```