Main Content

Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling

This example shows how to manipulate, preprocess and visualize data from Liquid Chromatography coupled with Mass Spectrometry (LC/MS). These large and high dimensional data sets are extensively utilized in proteomics and metabolomics research. Visualizing complex peptide or metabolite mixtures provides an intuitive method to evaluate the sample quality. In addition, methodical correction and preprocessing can lead to automated high throughput analysis of samples allowing accurate identification of significant metabolites and specific peptide features in a biological sample.


In a typical hyphenated mass spectrometry experiment, proteins and metabolites are harvested from cells, tissues, or body fluids, dissolved and denatured in solution, and enzymatically digested into mixtures. These mixtures are then separated either by High Performance Liquid Chromatography (HPLC), capillary electrophoresis (CE), or gas chromatography (GC) and coupled to a mass-spectrometry identification method, such as Electrospray Ionization Mass Spectrometry (ESI-MS), matrix assisted ionization (MALDI or SELDI TOF-MS), or tandem mass spectrometry (MS/MS).

Open Data Repositories and mzXML File Format

For this example, we use a test sample LC-ESI-MS data set with a seven protein mix. The data in this example (7MIX_STD_110802_1) is from the Sashimi Data Repository. The data set is not distributed with MATLAB®. To complete this example, you must download the data set into a local directory or your own repository. Alternatively, you can try other data sets available in other public databases for protein expression data such as the Peptide Atlas at the Institute of Systems Biology [1].

Most of the current mass spectrometers can translate or save the acquisition data using the mzXML schema. This standard is an XML (eXtensible Markup Language)-based common file format developed by the Sashimi project to address the challenges involved in representing data sets from different manufacturers and from different experimental setups into a common and expandable schema. mzXML files used in hyphenated mass spectrometry are usually very large. The MZXMLINFO function allows you to obtain basic information about the dataset without reading it into memory. For example, you can retrieve the number of scans, the range of the retention time, and the number of tandem MS instruments (levels).

info = mzxmlinfo('7MIX_STD_110802_1.mzXML','NUMOFLEVELS',true)
info = 

  struct with fields:

                            Filename: '7MIX_STD_110802_1.mzXML'
                         FileModDate: '01-Feb-2013 11:54:30'
                            FileSize: 26789612
                       NumberOfScans: 7161
                           StartTime: 'PT0.00683333S'
                             EndTime: 'PT200.036S'
       DataProcessingIntensityCutoff: 'N/A'
            DataProcessingCentroided: 'true'
            DataProcessingDeisotoped: 'N/A'
    DataProcessingChargeDeconvoluted: 'N/A'
       DataProcessingSpotIntegration: 'N/A'
                    NumberOfMSLevels: 2

The MZXMLREAD function reads the XML document into a MATLAB structure. The fields scan and index are placed at the first level of the output structure for improved access to the spectral data. The remainder of the mzXML document tree is parsed according to the schema specifications. This LC/MS data set contains 7161 scans with two MS levels. For this example you will use only the first level scans. Second level spectra are usually used for peptide/protein identification, and come at a later stage in some types of workflow analyses. MZXMLREAD can filter the desired scans without loading all the dataset into memory:

mzXML_struct = mzxmlread('7MIX_STD_110802_1.mzXML','LEVEL',1)
mzXML_struct = 

  struct with fields:

     scan: [2387×1 struct]
    mzXML: [1×1 struct]
    index: [1×1 struct]

If you receive any errors related to memory or java heap space during the loading, try increasing your java heap space as described here.

More detailed information pertaining the mass-spectrometer and the experimental conditions are found in the field msRun.

ans = 

  struct with fields:

         scanCount: 7161
         startTime: "PT0.00683333S"
           endTime: "PT200.036S"
        parentFile: [1×1 struct]
      msInstrument: [1×1 struct]
    dataProcessing: [1×1 struct]

To facilitate the handling of the data, the MZXML2PEAKS function extracts the list of peaks from each scan into a cell array (peaks]) and their respective retention time into a column vector (time). You can extract the spectra of certain level by setting the LEVEL input parameter.

[peaks,time] = mzxml2peaks(mzXML_struct);
numScans = numel(peaks)
numScans =


The MSDOTPLOT function creates an overview display of the most intense peaks in the entire data set. In this case, we visualize only the most intense 5% ion intensity peaks by setting the input parameter QUANTILE to 0.95.

h = msdotplot(peaks,time,'quantile',.95);
title('5 Percent Overall Most Intense Peaks')

You can also filter the peaks individually for each scan using a percentile of the base peak intensity. The base peak is the most intense peak found in each scan [2]. This parameter is given automatically by most of the spectrometers. This operation requires querying into the mxXML structure to obtain the base peak information. Note that you could also find the base peak intensity by iterating the MAX function over the peak list.

basePeakInt = [mzXML_struct.scan.basePeakIntensity]';
peaks_fil = cell(numScans,1);
for i = 1:numScans
    h = peaks{i}(:,2) > (basePeakInt(i).*0.75);
    peaks_fil{i} = peaks{i}(h,:);

title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')
  Name                Size               Bytes  Class     Attributes

  basePeakInt      2387x1                19096  double              
  peaks            2387x1             14031800  cell                
  peaks_fil        2387x1               289568  cell                

You can customize a 3-D overview of the filtered peaks using the STEM3 function. The STEM3 function requires to put the data into three vectors, whose elements form the triplets (the retention time, the mass/charge, and the intensity value) that represent every stem.

peaks_3D = cell(numScans,1);
for i = 1:numScans
    peaks_3D{i}(:,[2 3]) = peaks_fil{i};
    peaks_3D{i}(:,1) = time(i);
peaks_3D = cell2mat(peaks_3D);

axis([0 12000 400 1500 0 1e9])
xlabel('Retention Time (seconds)')
ylabel('Mass/Charge (M/Z)')
zlabel('Relative Ion Intensity')
title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')

You can plot colored stems using the PATCH function. For every triplet in peaks_3D, interleave a new triplet with the intensity value set to zero. Then create a color vector dependent on the intensity of the stem. A logarithmic transformation enhances the dynamic range of the colormap. For the interleaved triplets assign a NaN, so that PATCH function does not draw lines connecting contiguous stems.

peaks_patch = sortrows(repmat(peaks_3D,2,1));
peaks_patch(2:2:end,3) = 0;

col_vec = log(peaks_patch(:,3));
col_vec(2:2:end) = NaN;

axis([0 12000 400 1500 0 1e9])
xlabel('Retention Time (seconds)')
ylabel('Mass/Charge (M/Z)')
zlabel('Relative Ion Intensity')
title('Peaks Above (0.75 x Base Peak Intensity) for Each Scan')


Creating Heat Maps of LC/MS Data Sets

Common techniques in the industry work with peak information (a.k.a. centroided data) instead of raw signals. This may save memory, but some important details are not visible, especially when it is necessary to inspect samples with complex mixtures. To further analyze this data set, we can create a common grid in the mass/charge dimension. Since not all of the scans have enough information to reconstruct the original signal, we use a peak preserving resampling method. By choosing the appropriate parameters for the MSPPRESAMPLE function, you can ensure that the resolution of the spectra is not lost, and that the maximum values of the peaks correlate to the original peak information.

[MZ,Y] = msppresample(peaks,5000);
  Name         Size                 Bytes  Class     Attributes

  MZ        5000x1                  20000  single              
  Y         5000x2387            47740000  single              

With this matrix of ion intensities, Y, you can create a colored heat map. The MSHEATMAP function automatically adjusts the colorbar utilized to show the statistically significant peaks with hot colors and the noisy peaks with cold colors. The algorithm is based on clustering significant peaks and noisy peaks by estimating a mixture of Gaussians with an Expectation-Maximization approach. Additionally, you can use the MIDPOINT input parameter to select an arbitrary threshold to separate noisy peaks from significant peaks, or you can interactively shift the colormap to hide or unhide peaks. When working with heat maps, it is common to display the logarithm of the ion intensities, which enhances the dynamic range of the colormap.

fh1 = msheatmap(MZ,time,log(Y),'resolution',.1,'range',[500 900]);
title('Original LC/MS Data Set')

You can zoom to small regions of interest to observe the data, either interactively or programmatically using the AXIS function. We observe some regions with high relative ion intensity. These represent peptides in the biological sample.

axis([527 539 385 496])

You can overlay the original peak information of the LC/MS data set. This lets you evaluate the performance of the peak-preserving resampling operation. You can use the returned handle to hide/unhide the dots.

dp1 = msdotplot(peaks,time);

Calibrating the Mass/Charge Location of Peaks to a Common Grid

The two dimensional peaks appear to be noisy or they do not show a compact shape in contiguous spectra. This is a common problem for many mass spectrometers. Random fluctuations of the mass/charge value extracted from peaks of replicate profiles are reported to range from 0.1% to 0.3% [3]. Such variability can be caused by several factors, e.g. poor calibration of the detector, low signal-to-noise ratio, or problems in the peak extraction algorithms. The MSPALIGN function implements advanced data binning algorithms that synchronize all the spectra in a data set to a common mass/charge grid (CMZ). CMZ can be chosen arbitrarily or it can be estimated after analyzing the data [2,4,5]. The peak matching procedure can use either a nearest neighbor rule or a dynamic programming alignment.

[CMZ, peaks_CMZ] = mspalign(peaks);

Repeat the visualization process with the aligned peaks: perform peak preserving resampling, create a heat map, overlay the aligned peak information, and zoom into the same region of interest as before. When the spectrum is re-calibrated, it is possible to distinguish the isotope patterns of some of the peptides.

[MZ_A,Y_A] = msppresample(peaks_CMZ,5000);
fh2 = msheatmap(MZ_A,time,log(Y_A),'resolution',.10,'range',[500 900]);
title('LC/MS Data Set with the Mass/Charge Calibrated to a CMZ')
dp2 = msdotplot(peaks_CMZ,time);
axis([527 539 385 496])

Calibrating the Mass/Charge Location of Peaks Locally

MSPALIGN computes a single CMZ for the whole LC/MS data set. This may not be the ideal case for samples with more complex mixtures of peptides and/or metabolites than the data set utilized in this example. In the case of complex mixtures, you can align each spectrum to a local set of spectra that contain only informative peaks (high intensity) with similar retention times, otherwise the calibration process in regions with small peaks (low intensity) can be biased by other peaks that share similar mass/charge values but are at different retention times. To perform a finer calibration, you can employ the SAMPLEALIGN function. This function is a generalization of the Constrained Dynamic Time Warping (CDTW) algorithms commonly utilized in speech processing [6]. In the following for loop, we maintain a buffer with the intensities of the previous aligned spectra (LAI). The ion intensities of the spectra are scaled with the anonymous function SF (inside SAMPLEALIGN) to reduce the distance between large peaks. SAMPLEALIGN reduces the overall distance of all matched points and introduces gaps as necessary. In this case we use a finer MZ vector (FMZ), such that we preserve the correct value of the mass/charge of the peaks as much as possible. Note: this may take some time, as the CDTW algorithm is executed 2,387 times.

SF = @(x) 1-exp(-x./5e7); % scaling function
DF = @(R,S) sqrt((SF(R(:,2))-SF(S(:,2))).^2 + (R(:,1)-S(:,1)).^2);

FMZ = (500:0.15:900)';  % setup a finer MZ vector
LAI = zeros(size(FMZ)); % init buffer for the last alignment intensities

peaks_FMZ = cell(numScans,1);
for i = 1:numScans
    % show progress
    if ~rem(i,250)
        fprintf(' %d...',i);
    % align peaks in current scan to LAI
    [k,j] = samplealign([FMZ,LAI],double(peaks{i}),'band',1.5,'gap',[0,2],'dist',DF);
    % updating the LAI buffer
    LAI = LAI*.25;
    LAI(k) = LAI(k) + peaks{i}(j,2);
    % save the alignment
    peaks_FMZ{i} = [FMZ(k) peaks{i}(j,2)];
 250... 500... 750... 1000... 1250... 1500... 1750... 2000... 2250...

Repeat the visualization process and zoom to the region of interest.

[MZ_B,Y_B] = msppresample(peaks_FMZ,4000);
fh3 = msheatmap(MZ_B,time,log(Y_B),'resolution',.10,'range',[500 900]);
title('LC/MS Data Set with the Mass/Charge Calibrated Locally')
dp3 = msdotplot(peaks_FMZ,time);
axis([527 539 385 496])

As a final step to improve the image, you can apply a Gaussian filter in the chromatographic direction to smooth the whole data set.

Gpulse = exp(-.1*(-10:10).^2)./sum(exp(-.1*(-10:10).^2));
YF = convn(Y_B,Gpulse,'same');
fh4 = msheatmap(MZ_B,time,log(YF),'resolution',.10,'limits',[500 900]);
title('Final Enhanced LC/MS Data Set')
dp4 = msdotplot(peaks_FMZ,time);
axis([527 539 385 496])

You can link the axes of two heat maps, to interactively or programmatically compare regions between two data sets. In this case we compare the original and the final enhanced LC/MS matrices.

linkaxes(findobj([fh1 fh4],'Tag','MSHeatMap'))
axis([521 538 266 617])

Extracting Spectra Using the Total Ion Chromatogram

Once the LC/MS data set is smoothed and resampled into a regular grid, it is possible to extract the most informative spectra by looking at the local maxima of the Total Ion Chromatogram (TIC). The TIC is straightforwardly computed by summing the rows of YF. Then, use the MSPEAKS function to find the retention time values for extracting selected subsets of spectra.

TIC = mean(YF);
pt = mspeaks(time,TIC','multiplier',10,'overseg',100,'showplot',true);
title('Extracted peaks from the Total Ion Chromatogram (TIC)')
pt(pt(:,1)>4000,:) = []; % remove spectra above 4000 seconds
numPeaks = size(pt,1)
numPeaks =


Create a 3-D plot of the selected spectra.

xRows = samplealign(time,pt(:,1),'width',1); % finds the time index for every peak
xSpec = YF(:,xRows);                         % gets the signals to plot

hold on
box on

ax = gca;
ax.YTick = 1:numPeaks;
ax.YTickLabel = num2str(time(xRows));
axis([500 900 0 numPeaks 0 1e8])

xlabel('Mass/Charge (M/Z)')
zlabel('Relative Ion Intensity')
title('Extracted Spectra Subset')

Overlay markers for the extracted spectra over the enhanced heatmap.

hold on
for i = 1:numPeaks
    plot([400 1500],xRows([i i]),'m')
axis([500 900 100 925])
dp4.Visible = 'off';
title('Final Enhanced LC/MS Data Set with Extracted Spectra Marked')


[1] Desiere, F. et al., "The Peptide Atlas Project", Nucleic Acids Research, 34:D655-8, 2006.

[2] Purvine, S., Kolker, N., and Kolker, E., "Spectral Quality Assessment for High-Throughput Tandem Mass Spectrometry Proteomics", OMICS: A Journal of Integrative Biology, 8(3):255-65, 2004.

[3] Kazmi, A.S., et al., "Alignment of high resolution mass spectra: Development of a heuristic approach for metabolomics", Metabolomics, 2(2):75-83, 2006.

[4] Jeffries, N., "Algorithms for alignment of mass spectrometry proteomic data", Bioinformatics, 21(14):3066-3073, 2005.

[5] Yu, W., et al., "Multiple peak alignment in sequential data analysis: A scale-space based approach", IEEE®/ACM Trans. Computational Biology and Bioinformatics, 3(3):208-219, 2006.

[6] Sakoe, H. and Chiba s., "Dynamic programming algorithm optimization for spoken word recognition", IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-26(1):43-9, 1978.