Description
1. Model identification using PCA
Consider the flow process shown in Fig. 1 consisting of five streams, the flow rates
of all of which are measured. A data set (flowdata3.mat) consisting of 1000
samples corresponding to different steady states have been obtained.
(a) Apply PCA to identify the linear constraint model relating the variables
(assuming that you know that the number of linear relations that exist between
variables). In order to verify whether your constraint model is good, choose F3
and F5 as independent variables and obtain the relationship between the
dependent and independent variables (regression form of the model) using your
estimated constraint model and find the maximum absolute difference (maxdiff)
between estimated regression model coefficients and true regression model
coefficients. Report the eigenvalues and maxdiff value.
(b) Apply IPCA to estimate diagonal error variances and identify the linear steady
state model relating the flow variables (assuming that you know that the number
of linear relations that exist between variables). Report the estimated variances,
eigenvalues and maxdiff value.
(c) Apply IPCA assuming incorrectly that there are four constraints. Report the
eigenvalues obtained? Are you able to determine from the eigenvalues that the
number of constraints has been incorrectly guessed? Give reasons for your
answer.
(d) From the constraint model identified in (b) suggest a procedure (a measure)
by which you can determine a set of independent variables for the process.
Determine the best and worst possible choice of independent variable set for this
system based on your proposed measure and justify whether these inferences
(obtained from data) are consistent with the physical process.
2. Multivariate calibration model using PCA
Multivariate calibration of spectral measurements is a technique that is used in
chemometrics to develop a model relating spectral measurements (obtained using
instruments such as UV, FIR or NIR or MS spectrophotometers) to properties such
as concentration or other properties of species (usually liquid or gases). The
application we consider is to obtain a model relating UV absorbance spectra to
compositions (concentrations) of mixtures. Such a model is useful in online
monitoring of chemical and biochemical reactions.
Twenty six samples of different concentrations of a mixture of Co, Cr, and Ni ions
in dilute nitric acid were prepared in a laboratory and their spectra recorded over
the range 300-650 nm using a HP 8452 UV diode array spectrophotometer (data
in Inorfull.mat). (Water and ethanol are generally used as solvents since these do
not absorb in the UV range. Also the nitrate ions do not absorb in the UV range.
So an aqueous solution of nitric acid is used to dissolve the metals in this
experiment). Five replicates for each mixture were obtained. The measurements
were made at 2 nm intervals giving rise to an absorbance matrix of size 130 x 176.
The concentrations of the 26 samples, which is a 26 x 3 matrix are also given in
the data file. In order to predict the concentration of the mixture using absorbance
measurements, it is necessary to build a calibration model relating concentration
of mixtures to its absorbance spectra. According to Beer-Lambert’s law the
absorbance spectra of a dilute mixture is a linear (weighted) combination of the
pure component spectra with the weights corresponding to the concentrations of
the species in the mixture.
If absorbances are measured only a minimum number of wavelengths, then OLS
can be used to build a calibration model. For example, if a mixture containing ns
non-reacting species, then absorbances at ns wavelengths need to be measured.
Typically, the wavelengths are chosen corresponding to the maximum absorbing
wavelengths of individual species.
However, if we measure absorbances at nw >
ns wavelengths, then the absorbance matrix will not be full column rank. In this
case, Principal Component Regression can be used to develop a multivariate
calibration model. In this method PCA is first applied to the absorbance matrix to
obtain the scores corresponding to different mixtures. In the second step, a
regression model is used to relate the concentrations to the scores using OLS
(assuming concentrations are the dependent variables).
In order to use this model
for predicting the concentrations of a mixture whose absorbance spectra is given,
we first obtain the scores and then use the OLS regression model to predict the
concentrations. Note that the true rank of the absorbance matrix is equal to the
number of species in the mixture.
The quality of the linear calibration model is evaluated using leave-one-sampleout cross-validation (LOOCV) and computing the root mean square error (RMSE)
in predicting the left out sample concentrations. Pick the first replicate for each
mixture to obtain a data matrix of size 26 x 176 and use it for the following different
multivariate calibration modelling methods. For each method report the LOOCV
RMSE results in the form of a table for number of PCs chosen between 1 and 5.
Based on the RMSE values indicate whether you are able to estimate the number
of species correctly?
(a) Develop a multivariate calibration model using PCR.
(b) The absorbances are very noisy near the ends of the instrument. Estimate the
standard deviation of errors in absorbance measurements using the five replicates
for each wavelength and for each mixture. Assume that the error standard
deviations vary significantly with respect to wavelength but are almost same for all
mixtures (verify this by plotting the estimated standard deviations wrt wavelength
and mixtures).
Therefore, obtain the average standard deviation or errors with
respect to each wavelength. Use these standard deviations to scale the
absorbance measurements for each wavelength before applying PCR to develop
the calibration model (known as scaled PCR).
(c) Use IPCA to estimate the error variances with respect to wavelength in step 1
of PCR and use it to develop the calibration model (known as IPCR).
(d) If the error variances varies with respect to both mixtures and wavelengths,
then Maximum Likelihood PCA (MLPCA) proposed by Wentzell et al. (1997) can
be used to reduce the rank of the absorbance matrix and then use OLS to develop
the calibration model (also known as MLPCR).
Write a MATLAB function to
implement MLPCA given a data matrix, corresponding error standard deviation
matrix, and number of factors (or PCs). The function should return the scores
matrix. Use this function and the standard deviation of errors for each wavelength
and mixture estimated directly from the replicate measurements to develop the
calibration model using MLPCR.