主要内容

Exploring Genome-Wide Differences in DNA Methylation Profiles

This example shows how to perform a genome-wide analysis of DNA methylation in the human by using genome sequencing.

Note: For enhanced performance, MathWorks recommends that you run this example on a 64-bit platform, because the memory footprint is close to 2 GB. On a 32-bit platform, if you receive "Out of memory" errors when running this example, try increasing the virtual memory (or swap space) of your operating system or try setting the 3GB switch (32-bit Windows® XP only). For details, see Resolve “Out of Memory” Errors.

Introduction

DNA methylation is an epigenetic modification that modulates gene expression and the maintenance of genomic organization in normal and disease processes. DNA methylation can define different states of the cell, and it is inheritable during cell replication. Aberrant DNA methylation patterns have been associated with cancer and tumor suppressor genes.

In this example you will explore the DNA methylation profiles of two human cancer cells: parental HCT116 colon cancer cells and DICERex5 cells. DICERex5 cells are derived from HCT116 cells after the truncation of the DICER1 alleles. Serre et al. in [1] proposed to study DNA methylation profiles by using the MBD2 protein as a methyl CpG binding domain and subsequently used high-throughput sequencing (HTseq). This technique is commonly known as MBD-Seq. Short reads for two replicates of the two samples have been submitted to NCBI's SRA archive by the authors of [1]. There are other technologies available to interrogate DNA methylation status of CpG sites in combination with HTseq, for example MeDIP-seq or the use of restriction enzymes. You can also analyze this type of data sets following the approach presented in this example.

Data Sets

You can obtain the unmapped single-end reads for four sequencing experiments from NCBI. Short reads were produced using Illumina®'s Genome Analyzer II. Average insert size is 120 bp, and the length of short reads is 36 bp.

This example assumes that you:

(1) downloaded the files SRR030222.sra, SRR030223.sra, SRR030224.sra and SRR030225.sra containing the unmapped short reads for two replicates of from the DICERex5 sample and two replicates from the HCT116 sample respectively, from NCBI SRA Run Selector and converted them to FASTQ-formatted files using the NCBI SRA Toolkit.

(2) produced SAM-formatted files by mapping the short reads to the reference human genome (NCBI Build 37.5) using the Bowtie [2] algorithm. Only uniquely mapped reads are reported.

(3) compressed the SAM formatted files to BAM and ordered them by reference name first, then by genomic position by using SAMtools [3].

This example also assumes that you downloaded the reference human genome (GRCh37.p5). You can use the bowtie-inspect command to reconstruct the human reference directly from the bowtie indices. Or you may download the reference from the NCBI repository by uncommenting the following line:

% getgenbank('NC_000009','FileFormat','fasta','tofile','hsch9.fasta');

Creating a MATLAB® Interface to the BAM-Formatted Files

To explore the signal coverage of the HCT116 samples you need to construct a BioMap. BioMap has an interface that provides direct access to the mapped short reads stored in the BAM-formatted file, thus minimizing the amount of data that is actually loaded into memory. Use the function baminfo to obtain a list of the existing references and the actual number of short reads mapped to each one.

info = baminfo('SRR030224.bam','ScanDictionary',true);
fprintf('%-35s%s\n','Reference','Number of Reads');
for i = 1:numel(info.ScannedDictionary)
    fprintf('%-35s%d\n',info.ScannedDictionary{i},...
        info.ScannedDictionaryCount(i));
end
Reference                          Number of Reads
gi|224589800|ref|NC_000001.10|     205065
gi|224589811|ref|NC_000002.11|     187019
gi|224589815|ref|NC_000003.11|     73986
gi|224589816|ref|NC_000004.11|     84033
gi|224589817|ref|NC_000005.9|      96898
gi|224589818|ref|NC_000006.11|     87990
gi|224589819|ref|NC_000007.13|     120816
gi|224589820|ref|NC_000008.10|     111229
gi|224589821|ref|NC_000009.11|     106189
gi|224589801|ref|NC_000010.10|     112279
gi|224589802|ref|NC_000011.9|      104466
gi|224589803|ref|NC_000012.11|     87091
gi|224589804|ref|NC_000013.10|     53638
gi|224589805|ref|NC_000014.8|      64049
gi|224589806|ref|NC_000015.9|      60183
gi|224589807|ref|NC_000016.9|      146868
gi|224589808|ref|NC_000017.10|     195893
gi|224589809|ref|NC_000018.9|      60344
gi|224589810|ref|NC_000019.9|      166420
gi|224589812|ref|NC_000020.10|     148950
gi|224589813|ref|NC_000021.8|      310048
gi|224589814|ref|NC_000022.10|     76037
gi|224589822|ref|NC_000023.10|     32421
gi|224589823|ref|NC_000024.9|      18870
gi|17981852|ref|NC_001807.4|       1015
Unmapped                           6805842

In this example you will focus on the analysis of chromosome 9. Create a BioMap for the two HCT116 sample replicates.

bm_hct116_1 = BioMap('SRR030224.bam','SelectRef','gi|224589821|ref|NC_000009.11|')
bm_hct116_2 = BioMap('SRR030225.bam','SelectRef','gi|224589821|ref|NC_000009.11|')
bm_hct116_1 = 

  BioMap with properties:

    SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
             Reference: [106189x1 File indexed property]
             Signature: [106189x1 File indexed property]
                 Start: [106189x1 File indexed property]
        MappingQuality: [106189x1 File indexed property]
                  Flag: [106189x1 File indexed property]
          MatePosition: [106189x1 File indexed property]
               Quality: [106189x1 File indexed property]
              Sequence: [106189x1 File indexed property]
                Header: [106189x1 File indexed property]
                 NSeqs: 106189
                  Name: ''



bm_hct116_2 = 

  BioMap with properties:

    SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
             Reference: [107586x1 File indexed property]
             Signature: [107586x1 File indexed property]
                 Start: [107586x1 File indexed property]
        MappingQuality: [107586x1 File indexed property]
                  Flag: [107586x1 File indexed property]
          MatePosition: [107586x1 File indexed property]
               Quality: [107586x1 File indexed property]
              Sequence: [107586x1 File indexed property]
                Header: [107586x1 File indexed property]
                 NSeqs: 107586
                  Name: ''


Using a binning algorithm provided by the getBaseCoverage method, you can plot the coverage of both replicates for an initial inspection. For reference, you can also add the ideogram for the human chromosome 9 to the plot using the chromosomeplot function.

figure
ha = gca;
hold on
n = 141213431;               % length of chromosome 9
[cov,bin] = getBaseCoverage(bm_hct116_1,1,n,'binWidth',100);
h1 = plot(bin,cov,'b');      % plots the binned coverage of bm_hct116_1
[cov,bin] = getBaseCoverage(bm_hct116_2,1,n,'binWidth',100);
h2 = plot(bin,cov,'g');      % plots the binned coverage of bm_hct116_2
chromosomeplot('hs_cytoBand.txt', 9, 'AddToPlot', ha) % plots an ideogram along the x-axis
axis(ha,[1 n 0 100])         % zooms-in the y-axis
fixGenomicPositionLabels(ha) % formats tick labels and adds datacursors
legend([h1 h2],'HCT116-1','HCT116-2','Location','NorthEast')
ylabel('Coverage')
title('Coverage for two replicates of the HCT116 sample')
fig = gcf;
fig.Position = max(fig.Position,[0 0 900 0]); % resize window

Because short reads represent the methylated regions of the DNA, there is a correlation between aligned coverage and DNA methylation. Observe the increased DNA methylation close to the chromosome telomeres; it is known that there is an association between DNA methylation and the role of telomeres for maintaining the integrity of the chromosomes. In the coverage plot you can also see a long gap over the chromosome centromere. This is due to the repetitive sequences present in the centromere, which prevent us from aligning short reads to a unique position in this region. In the data sets used in this example only about 30% of the short reads were uniquely mapped to the reference genome.

Correlating CpG Islands and DNA Methylation

DNA methylation normally occurs in CpG dinucleotides. Alteration of the DNA methylation patterns can lead to transcriptional silencing, especially in the gene promoter CpG islands. But, it is also known that DNA methylation can block CTCF binding and can silence miRNA transcription among other relevant functions. In general, it is expected that mapped reads should preferably align to CpG rich regions.

Load the human chromosome 9 from the reference file hs37.fasta. For this example, it is assumed that you recovered the reference from the Bowtie indices using the bowtie-inspect command; therefore hs37.fasta contains all the human chromosomes. To load only the chromosome 9 you can use the option nave-value pair BLOCKREAD with the fastaread function.

chr9 = fastaread('hs37.fasta','blockread',9);
chr9.Header
ans =

    'gi|224589821|ref|NC_000009.11| Homo sapiens chromosome 9, GRCh37 primary reference assembly'

Use the cpgisland function to find the CpG clusters. Using the standard definition for CpG islands [4], 200 or more bp islands with 60% or greater CpGobserved/CpGexpected ratio, leads to 1682 GpG islands found in chromosome 9.

cpgi = cpgisland(chr9.Sequence)
cpgi = 

  struct with fields:

    Starts: [10783 29188 73049 73686 113309 114488 116877 … ] (1×1682 double)
     Stops: [11319 29409 73624 73893 114336 114809 117105 … ] (1×1682 double)

Use the getCounts method to calculate the ratio of aligned bases that are inside CpG islands. For the first replicate of the sample HCT116, the ratio is close to 45%.

aligned_bases_in_CpG_islands = getCounts(bm_hct116_1,cpgi.Starts,cpgi.Stops,'method','sum')
aligned_bases_total = getCounts(bm_hct116_1,1,n,'method','sum')
ratio = aligned_bases_in_CpG_islands ./ aligned_bases_total
aligned_bases_in_CpG_islands =

     1724363


aligned_bases_total =

     3822804


ratio =

    0.4511

You can explore high resolution coverage plots of the two sample replicates and observe how the signal correlates with the CpG islands. For example, explore the region between 23,820,000 and 23,830,000 bp. This is the 5' region of the human gene ELAVL2.

r1 = 23820001; % set the region limits
r2 = 23830000;
fhELAVL2 = figure; % keep the figure handle to use it later
hold on
% plot high-resolution coverage of bm_hct116_1
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');

% mark the CpG islands within the [r1 r2] region
for i = 1:numel(cpgi.Starts)
    if cpgi.Starts(i)>r1 && cpgi.Stops(i)<r2 % is CpG island inside [r1 r2]?
        px = [cpgi.Starts([i i]) cpgi.Stops([i i])]; % x-coordinates for patch
        py = [0 max(ylim) max(ylim) 0];              % y-coordinates for patch
        hp = patch(px,py,'r','FaceAlpha',.1,'EdgeColor','r','Tag','cpgi');
    end
end

axis([r1 r2 0 20])            % zooms-in the y-axis
fixGenomicPositionLabels(gca) % formats tick labels and adds datacursors
legend([h1 h2 hp],'HCT116-1','HCT116-2','CpG Islands')
ylabel('Coverage')
xlabel('Chromosome 9 position')
title('Coverage for two replicates of the HCT116 sample')

Statistical Modelling of Count Data

To find regions that contain more mapped reads than would be expected by chance, you can follow a similar approach to the one described by Serre et al. [1]. The number of counts for non-overlapping contiguous 100 bp windows is statistically modeled.

First, use the getCounts method to count the number of mapped reads that start at each window. In this example you use a binning approach that considers only the start position of every mapped read, following the approach of Serre et al. However, you may also use the OVERLAP and METHOD name-value pairs in getCounts to compute more accurate statistics. For instance, to obtain the maximum coverage for each window considering base pair resolution, set OVERLAP to 1 and METHOD to MAX.

n = numel(chr9.Sequence); % length of chromosome
w = 1:100:n; % windows of 100 bp

counts_1 = getCounts(bm_hct116_1,w,w+99,'independent',true,'overlap','start');
counts_2 = getCounts(bm_hct116_2,w,w+99,'independent',true,'overlap','start');

First, try to model the counts assuming that all the windows with counts are biologically significant and therefore from the same distribution. Use the negative bionomial distribution to fit a model the count data.

nbp = nbinfit(counts_1);

Plot the fitted model over a histogram of the empirical data.

figure
hold on
edges = 0:100;
emphist = histcounts(counts_1,edges); % calculate the empirical distribution
binCenters = (edges(1:end-1) + edges(2:end)) / 2;
bar(binCenters,emphist./sum(emphist),'c','grouped') % plot histogram
plot(0:100,nbinpdf(0:100,nbp(1),nbp(2)),'b','linewidth',2); % plot fitted model
axis([0 50 0 .001])
legend('Empirical Distribution','Negative Binomial Fit')
ylabel('Frequency')
xlabel('Counts')
title('Frequency of counts, 100bp windows (HCT116-1)')

The poor fitting indicates that the observed distribution may be due to the mixture of two models, one that represents the background and one that represents the count data in methylated DNA windows.

A more realistic scenario would be to assume that windows with a small number of mapped reads are mainly the background (or null model). Serre et al. assumed that 100-bp windows containing four or more reads are unlikely to be generated by chance. To estimate a good approximation to the null model, you can fit the left body of the empirical distribution to a truncated negative binomial distribution. To fit a truncated distribution use the mle function. First you need to define an anonymous function that defines the right-truncated version of nbinpdf.

rtnbinpdf = @(x,p1,p2,t) nbinpdf(x,p1,p2) ./ nbincdf(t-1,p1,p2);

Define the fitting function using another anonymous function.

rtnbinfit = @(x2,t) mle(x2,'pdf',@(x3,p1,p2) rtnbinpdf(x3,p1,p2,t),'start',nbinfit(x2),'lower',[0 0]);

Before fitting the real data, let us assess the fitting procedure with some sampled data from a known distribution.

nbp = [0.5 0.2];              % Known coefficients
x = nbinrnd(nbp(1),nbp(2),10000,1); % Random sample
trun = 6;                     % Set a truncation threshold

nbphat1 = nbinfit(x);         % Fit non-truncated model to all data
nbphat2 = nbinfit(x(x<trun)); % Fit non-truncated model to truncated data (wrong)
nbphat3 = rtnbinfit(x(x<trun),trun); % Fit truncated model to truncated data

figure
hold on
edges = 0:100;
emphist = histcounts(x,edges);     % Calculate the empirical distribution
binCenters = (edges(1:end-1) + edges(2:end)) / 2;
bar(binCenters,emphist./sum(emphist),'c','grouped') % plot histogram
h1 = plot(0:100,nbinpdf(0:100,nbphat1(1),nbphat1(2)),'b-o','linewidth',2);
h2 = plot(0:100,nbinpdf(0:100,nbphat2(1),nbphat2(2)),'r','linewidth',2);
h3 = plot(0:100,nbinpdf(0:100,nbphat3(1),nbphat3(2)),'g','linewidth',2);
axis([0 25 0 .2])
legend([h1 h2 h3],'Neg-binomial fitted to all data',...
    'Neg-binomial fitted to truncated data',...
    'Truncated neg-binomial fitted to truncated data')
ylabel('Frequency')
xlabel('Counts')

Identifying Significant Methylated Regions

For the two replicates of the HCT116 sample, fit a right-truncated negative binomial distribution to the observed null model using the rtnbinfit anonymous function previously defined.

trun = 4;  % Set a truncation threshold (as in [1])
pn1 = rtnbinfit(counts_1(counts_1<trun),trun); % Fit to HCT116-1 counts
pn2 = rtnbinfit(counts_2(counts_2<trun),trun); % Fit to HCT116-2 counts

Calculate the p-value for each window to the null distribution.

pval1 = 1 - nbincdf(counts_1,pn1(1),pn1(2));
pval2 = 1 - nbincdf(counts_2,pn2(1),pn2(2));

Calculate the false discovery rate using the mafdr function. Use the name-value pair BHFDR to use the linear-step up (LSU) procedure ([6]) to calculate the FDR adjusted p-values. Setting the FDR < 0.01 permits you to identify the 100-bp windows that are significantly methylated.

fdr1 = mafdr(pval1,'bhfdr',true);
fdr2 = mafdr(pval2,'bhfdr',true);

w1 = fdr1<.01; % logical vector indicating significant windows in HCT116-1
w2 = fdr2<.01; % logical vector indicating significant windows in HCT116-2
w12 = w1 & w2; % logical vector indicating significant windows in both replicates

Number_of_sig_windows_HCT116_1 = sum(w1)
Number_of_sig_windows_HCT116_2 = sum(w2)
Number_of_sig_windows_HCT116 = sum(w12)
Number_of_sig_windows_HCT116_1 =

        1662


Number_of_sig_windows_HCT116_2 =

        1674


Number_of_sig_windows_HCT116 =

        1346

Overall, you identified 1662 and 1674 non-overlapping 100-bp windows in the two replicates of the HCT116 samples, which indicates there is significant evidence of DNA methylation. There are 1346 windows that are significant in both replicates.

For example, looking again in the promoter region of the ELAVL2 human gene you can observe that in both sample replicates, multiple 100-bp windows have been marked significant.

figure(fhELAVL2) % bring back to focus the previously plotted figure
plot(w(w1)+50,counts_1(w1),'bs', 'HandleVisibility','off') % plot significant windows in HCT116-1
plot(w(w2)+50,counts_2(w2),'gs', 'HandleVisibility','off') % plot significant windows in HCT116-2
axis([r1 r2 0 100])
title('Significant 100-bp windows in both replicates of the HCT116 sample')

Finding Genes With Significant Methylated Promoter Regions

DNA methylation is involved in the modulation of gene expression. For instance, it is well known that hypermethylation is associated with the inactivation of several tumor suppressor genes. You can study in this data set the methylation of gene promoter regions.

First, download from Ensembl a tab-separated-value (TSV) table with all protein encoding genes to a text file, ensemblmart_genes_hum37.txt. For this example, we are using Ensembl release 64. Using Ensembl's BioMart service, you can select a table with the following attributes: chromosome name, gene biotype, gene name, gene start/end, and strand direction.

Use the provided helper function ensemblmart2gff to convert the downloaded TSV file to a GFF formatted file. Then use GFFAnnotation to load the file into MATLAB and create a subset with the genes present in chromosome 9 only. This results 800 annotated protein-coding genes in the Ensembl database.

GFFfilename = ensemblmart2gff('ensemblmart_genes_hum37.txt');
a = GFFAnnotation(GFFfilename)
a9 = getSubset(a,'reference','9')
numGenes = a9.NumEntries
a = 

  GFFAnnotation with properties:

    FieldNames: {1×9 cell}
    NumEntries: 21184


a9 = 

  GFFAnnotation with properties:

    FieldNames: {1×9 cell}
    NumEntries: 800


numGenes =

   800

Find the promoter regions for each gene. In this example we consider the proximal promoter as the -500/100 upstream region.

downstream = 500;
upstream   = 100;

geneDir = strcmp(a9.Strand,'+');  % logical vector indicating strands in the forward direction

% calculate promoter's start position for genes in the forward direction
promoterStart(geneDir) = a9.Start(geneDir) - downstream;
% calculate promoter's end position for genes in the forward direction
promoterStop(geneDir) = a9.Start(geneDir) + upstream;
% calculate promoter's start position for genes in the reverse direction
promoterStart(~geneDir) = a9.Stop(~geneDir) - upstream;
% calculate promoter's end position for genes in the reverse direction
promoterStop(~geneDir) = a9.Stop(~geneDir) + downstream;

Use a dataset as a container for the promoter information, as we can later add new columns to store gene counts and p-values.

promoters = dataset({a9.Feature,'Gene'});
promoters.Strand = char(a9.Strand);
promoters.Start = promoterStart';
promoters.Stop = promoterStop';

Find genes with significant DNA methylation in the promoter region by looking at the number of mapped short reads that overlap at least one base pair in the defined promoter region.

promoters.Counts_1 = getCounts(bm_hct116_1,promoters.Start,promoters.Stop,...
    'overlap',1,'independent',true);
promoters.Counts_2 = getCounts(bm_hct116_2,promoters.Start,promoters.Stop,...
    'overlap',1,'independent',true);

Fit a null distribution for each sample replicate and compute the p-values:

trun = 5;  % Set a truncation threshold
pn1 = rtnbinfit(promoters.Counts_1(promoters.Counts_1<trun),trun); % Fit to HCT116-1 promoter counts
pn2 = rtnbinfit(promoters.Counts_2(promoters.Counts_2<trun),trun); % Fit to HCT116-2 promoter counts
promoters.pval_1 = 1 - nbincdf(promoters.Counts_1,pn1(1),pn1(2)); % p-value for every promoter in HCT116-1
promoters.pval_2 = 1 - nbincdf(promoters.Counts_2,pn2(1),pn2(2)); % p-value for every promoter in HCT116-2

Number_of_sig_promoters =  sum(promoters.pval_1<.01 & promoters.pval_2<.01)

Ratio_of_sig_methylated_promoters = Number_of_sig_promoters./numGenes
Number_of_sig_promoters =

    74


Ratio_of_sig_methylated_promoters =

    0.0925

Observe that only 74 (out of 800) genes in chromosome 9 have significantly DNA methylated regions (pval<0.01 in both replicates). Display a report of the 30 genes with the most significant methylated promoter regions.

[~,order] = sort(promoters.pval_1.*promoters.pval_2);
promoters(order(1:30),[1 2 3 4 5 7 6 8])
ans = 

    Gene                   Strand    Start        Stop         Counts_1
    {'DMRT3'      }        +            976464       977064    223     
    {'CNTFR'      }        -          34590021     34590621    219     
    {'GABBR2'     }        -         101471379    101471979    404     
    {'CACNA1B'    }        +         140771741    140772341    454     
    {'BARX1'      }        -          96717554     96718154    264     
    {'FAM78A'     }        -         134151834    134152434    497     
    {'FOXB2'      }        +          79634071     79634671    163     
    {'TLE4'       }        +          82186188     82186788    157     
    {'ASTN2'      }        -         120177248    120177848    141     
    {'FOXE1'      }        +         100615036    100615636    149     
    {'MPDZ'       }        -          13279489     13280089    129     
    {'PTPRD'      }        -          10612623     10613223    145     
    {'PALM2-AKAP2'}        +         112542089    112542689    134     
    {'FAM69B'     }        +         139606522    139607122    112     
    {'WNK2'       }        +          95946698     95947298    108     
    {'IGFBPL1'    }        -          38424344     38424944    110     
    {'AKAP2'      }        +         112542269    112542869    107     
    {'C9orf4'     }        -         111929471    111930071    102     
    {'COL5A1'     }        +         137533120    137533720     84     
    {'LHX3'       }        -         139096855    139097455     74     
    {'OLFM1'      }        +         137966768    137967368     75     
    {'NPR2'       }        +          35791651     35792251     68     
    {'DBC1'       }        -         122131645    122132245     61     
    {'SOHLH1'     }        -         138591274    138591874     56     
    {'PIP5K1B'    }        +          71320075     71320675     59     
    {'PRDM12'     }        +         133539481    133540081     53     
    {'ELAVL2'     }        -          23826235     23826835     50     
    {'ZFP37'      }        -         115818939    115819539     59     
    {'RP11-35N6.1'}        +         103790491    103791091     60     
    {'DMRT2'      }        +           1049854      1050454     54     


    pval_1        Counts_2    pval_2    
    6.6613e-16    253         5.5511e-16
    6.6613e-16    226         5.5511e-16
    6.6613e-16    400         5.5511e-16
    6.6613e-16    408         5.5511e-16
    6.6613e-16    286         5.5511e-16
    6.6613e-16    499         5.5511e-16
       1.4e-13    165         6.0352e-13
    3.5649e-13    151         4.7347e-12
    4.3566e-12    163         8.0969e-13
    1.2447e-12    133         6.7598e-11
    2.8679e-11    148         7.3682e-12
    2.3279e-12    127         1.6448e-10
    1.3068e-11    135         5.0276e-11
    4.1911e-10    144         1.3295e-11
     7.897e-10    125         2.2131e-10
    5.7523e-10    114         1.1364e-09
    9.2538e-10    106         3.7513e-09
    2.0467e-09     96         1.6795e-08
    3.6266e-08     97         1.4452e-08
    1.8171e-07     91         3.5644e-08
    1.5457e-07     69         1.0074e-06
    4.8093e-07     73         5.4629e-07
    1.5082e-06     62         2.9575e-06
    3.4322e-06     67         1.3692e-06
    2.0943e-06     63         2.5345e-06
    5.6364e-06     61         3.4518e-06
    9.2778e-06     62         2.9575e-06
    2.0943e-06     47         3.0746e-05
    1.7771e-06     42         6.8037e-05
    4.7762e-06     46         3.6016e-05

Finding Intergenic Regions that are Significantly Methylated

Serre et al. [1] reported that, in these data sets, approximately 90% of the uniquely mapped reads fall outside the 5' gene promoter regions. Using a similar approach as before, you can find genes that have intergenic methylated regions. To compensate for the varying lengths of the genes, you can use the maximum coverage, computed base-by-base, instead of the raw number of mapped short reads. Another alternative approach to normalize the counts by the gene length is to set the METHOD name-value pair to rpkm in the getCounts function.

intergenic = dataset({a9.Feature,'Gene'});
intergenic.Strand = char(a9.Strand);
intergenic.Start = a9.Start;
intergenic.Stop = a9.Stop;

intergenic.Counts_1 = getCounts(bm_hct116_1,intergenic.Start,intergenic.Stop,...
    'overlap','full','method','max','independent',true);
intergenic.Counts_2 = getCounts(bm_hct116_2,intergenic.Start,intergenic.Stop,...
    'overlap','full','method','max','independent',true);
trun = 10; % Set a truncation threshold
pn1 = rtnbinfit(intergenic.Counts_1(intergenic.Counts_1<trun),trun); % Fit to HCT116-1 intergenic counts
pn2 = rtnbinfit(intergenic.Counts_2(intergenic.Counts_2<trun),trun); % Fit to HCT116-2 intergenic counts
intergenic.pval_1 = 1 - nbincdf(intergenic.Counts_1,pn1(1),pn1(2)); % p-value for every intergenic region in HCT116-1
intergenic.pval_2 = 1 - nbincdf(intergenic.Counts_2,pn2(1),pn2(2)); % p-value for every intergenic region in HCT116-2

Number_of_sig_genes =  sum(intergenic.pval_1<.01 & intergenic.pval_2<.01)

Ratio_of_sig_methylated_genes = Number_of_sig_genes./numGenes

[~,order] = sort(intergenic.pval_1.*intergenic.pval_2);

intergenic(order(1:30),[1 2 3 4 5 7 6 8])
Number_of_sig_genes =

    62


Ratio_of_sig_methylated_genes =

    0.0775


ans = 

    Gene                  Strand    Start        Stop         Counts_1
    {'AL772363.1'}        -         140762377    140787022    106     
    {'CACNA1B'   }        +         140772241    141019076    106     
    {'SUSD1'     }        -         114803065    114937688     88     
    {'C9orf172'  }        +         139738867    139741797     99     
    {'NR5A1'     }        -         127243516    127269709     86     
    {'BARX1'     }        -          96713628     96717654     77     
    {'KCNT1'     }        +         138594031    138684992     58     
    {'GABBR2'    }        -         101050391    101471479     65     
    {'FOXB2'     }        +          79634571     79635869     51     
    {'NDOR1'     }        +         140100119    140113813     54     
    {'KIAA1045'  }        +          34957484     34984679     50     
    {'ADAMTSL2'  }        +         136397286    136440641     55     
    {'PAX5'      }        -          36833272     37034476     48     
    {'OLFM1'     }        +         137967268    138013025     55     
    {'PBX3'      }        +         128508551    128729656     45     
    {'FOXE1'     }        +         100615536    100618986     49     
    {'MPDZ'      }        -          13105703     13279589     51     
    {'ASTN2'     }        -         119187504    120177348     43     
    {'ARRDC1'    }        +         140500106    140509812     49     
    {'IGFBPL1'   }        -          38408991     38424444     45     
    {'LHX3'      }        -         139088096    139096955     44     
    {'PAPPA'     }        +         118916083    119164601     44     
    {'CNTFR'     }        -          34551430     34590121     41     
    {'DMRT3'     }        +            976964       991731     40     
    {'TUSC1'     }        -          25676396     25678856     46     
    {'ELAVL2'    }        -          23690102     23826335     35     
    {'SMARCA2'   }        +           2015342      2193624     36     
    {'GAS1'      }        -          89559279     89562104     34     
    {'GRIN1'     }        +         140032842    140063207     36     
    {'TLE4'      }        +          82186688     82341658     36     


    pval_1        Counts_2    pval_2    
    8.6597e-15     98         1.8763e-14
    8.6597e-15     98         1.8763e-14
    2.2904e-12    112         7.7716e-16
    7.4718e-14     96         3.5749e-14
     4.268e-12     90         2.5457e-13
    7.0112e-11     62          2.569e-09
    2.5424e-08     73         6.9019e-11
    2.9078e-09     58         9.5469e-09
    2.2131e-07     58         9.5469e-09
    8.7601e-08     55         2.5525e-08
    3.0134e-07     55         2.5525e-08
    6.4307e-08     45         6.7163e-07
     5.585e-07     49         1.8188e-07
    6.4307e-08     42         1.7861e-06
    1.4079e-06     51         9.4566e-08
    4.1027e-07     46         4.8461e-07
    2.2131e-07     42         1.7861e-06
    2.6058e-06     43         1.2894e-06
    4.1027e-07     36         1.2564e-05
    1.4079e-06     39         4.7417e-06
    1.9155e-06     36         1.2564e-05
    1.9155e-06     35         1.7377e-05
    4.8199e-06     37         9.0815e-06
    6.5537e-06     37         9.0815e-06
    1.0346e-06     31         6.3417e-05
    3.0371e-05     41         2.4736e-06
    2.2358e-05     40         3.4251e-06
    4.1245e-05     41         2.4736e-06
    2.2358e-05     38         6.5629e-06
    2.2358e-05     37         9.0815e-06

For instance, explore the methylation profile of the BARX1 gene, the sixth significant gene with intergenic methylation in the previous list. The GTF formatted file ensemblmart_barx1.gtf contains structural information for this gene obtained from Ensembl using the BioMart service.

Use GTFAnnotation to load the structural information into MATLAB. There are two annotated transcripts for this gene.

barx1 = GTFAnnotation('ensemblmart_barx1.gtf')
transcripts = getTranscriptNames(barx1)
barx1 = 

  GTFAnnotation with properties:

    FieldNames: {1×11 cell}
    NumEntries: 18


transcripts =

  2×1 cell array

    {'ENST00000253968'}
    {'ENST00000401724'}

Plot the DNA methylation profile for both HCT116 sample replicates with base-pair resolution. Overlay the CpG islands and plot the exons for each of the two transcripts along the bottom of the plot.

range = barx1.getRange;
r1 = range(1)-1000; % set the region limits
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of bm_hct116_1
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');

% mark the CpG islands within the [r1 r2] region
for i = 1:numel(cpgi.Starts)
    if cpgi.Starts(i)>r1 && cpgi.Stops(i)<r2 % is CpG island inside [r1 r2]?
        px = [cpgi.Starts([i i]) cpgi.Stops([i i])];  % x-coordinates for patch
        py = [0 max(ylim) max(ylim) 0];               % y-coordinates for patch
        hp = patch(px,py,'r','FaceAlpha',.1,'EdgeColor','r','Tag','cpgi');
    end
end

% mark the exons at the bottom of the axes
for i = 1:numel(transcripts)
    exons = getSubset(barx1,'Transcript',transcripts{i},'Feature','exon');
    for j = 1:exons.NumEntries
        px = [exons.Start([j j]);exons.Stop([j j])]'; % x-coordinates for patch
        py = [0 1 1 0]-i*2-1;                         % y-coordinates for patch
        hq = patch(px,py,'b','FaceAlpha',.1,'EdgeColor','b','Tag','exon');
    end
end

axis([r1 r2 -numel(transcripts)*2-2 80])  % zooms-in the y-axis
fixGenomicPositionLabels(gca) % formats tick labels and adds datacursors
ylabel('Coverage')
xlabel('Chromosome 9 position')
title('High resolution coverage in the BARX1 gene')
legend([h1 h2 hp hq],'HCT116-1','HCT116-2','CpG Islands','Exons','Location','NorthWest')

Observe the highly methylated region in the 5' promoter region (right-most CpG island). Recall that for this gene transcription occurs in the reverse strand. More interesting, observe the highly methylated regions that overlap the initiation of each of the two annotated transcripts (two middle CpG islands).

Differential Analysis of Methylation Patterns

In the study by Serre et al. another cell line is also analyzed. New cells (DICERex5) are derived from the same HCT116 colon cancer cells after truncating the DICER1 alleles. It has been reported in literature [5] that there is a localized change of DNA methylation at small number of gene promoters. In this example, you will find significant 100-bp windows in two sample replicates of the DICERex5 cells following the same approach as the parental HCT116 cells, and then you will search statistically significant differences between the two cell lines.

The helper function getWindowCounts captures the similar steps to find windows with significant coverage as before. getWindowCounts returns vectors with counts, p-values, and false discovery rates for each new replicate.

bm_dicer_1 = BioMap('SRR030222.bam','SelectRef','gi|224589821|ref|NC_000009.11|');
bm_dicer_2 = BioMap('SRR030223.bam','SelectRef','gi|224589821|ref|NC_000009.11|');
[counts_3,pval3,fdr3] = getWindowCounts(bm_dicer_1,4,w,100);
[counts_4,pval4,fdr4] = getWindowCounts(bm_dicer_2,4,w,100);
w3 = fdr3<.01; % logical vector indicating significant windows in DICERex5_1
w4 = fdr4<.01; % logical vector indicating significant windows in DICERex5-2
w34 = w3 & w4; % logical vector indicating significant windows in both replicates
Number_of_sig_windows_DICERex5_1 = sum(w3)
Number_of_sig_windows_DICERex5_2 = sum(w4)
Number_of_sig_windows_DICERex5 = sum(w34)
Number_of_sig_windows_DICERex5_1 =

   908


Number_of_sig_windows_DICERex5_2 =

        1041


Number_of_sig_windows_DICERex5 =

   759

To perform a differential analysis you use the 100-bp windows that are significant in at least one of the samples (either HCT116 or DICERex5).

wd = w34 | w12; % logical vector indicating windows included in the diff. analysis

counts = [counts_1(wd) counts_2(wd) counts_3(wd) counts_4(wd)];
ws = w(wd); % window start for each row in counts

Use the function manorm to normalize the data. The PERCENTILE name-value pair lets you filter out windows with very large number of counts while normalizing, since these windows are mainly due to artifacts, such as repetitive regions in the reference chromosome.

counts_norm = round(manorm(counts,'percentile',90).*100);

Use the function mattest to perform a two-sample t-test to identify differentially covered windows from the two different cell lines.

pval = mattest(counts_norm(:,[1 2]),counts_norm(:,[3 4]),'bootstrap',true,...
    'showhist',true,'showplot',true);

Create a report with the 25 most significant differentially covered windows. While creating the report use the helper function findClosestGene to determine if the window is intergenic, intragenic, or if it is in a proximal promoter region.

[~,ord] = sort(pval);
fprintf('Window Pos       Type                  p-value   HCT116     DICERex5\n\n');
for i = 1:25
    j = ord(i);
    [~,msg] = findClosestGene(a9,[ws(j) ws(j)+99]);
    fprintf('%10d  %-25s %7.6f%5d%5d %5d%5d\n',  ...
        ws(j),msg,pval(j),counts_norm(j,:));
end
Window Pos       Type                  p-value   HCT116     DICERex5

 140311701  Intergenic (EXD3)         0.000020   13   13   104  105
 139546501  Intragenic                0.001525   21   21    91   93
     10901  Intragenic                0.002222  258  257   434  428
 120176801  Intergenic (ASTN2)        0.002270  266  270   155  155
 139914801  Intergenic (ABCA2)        0.002482   64   63    26   25
 126128501  Intergenic (CRB2)         0.002664   94   93   129  130
  71939501  Prox. Promoter (FAM189A2) 0.004687  107  101     0    0
 124461001  Intergenic (DAB2IP)       0.004747   77   76    39   37
 140086501  Intergenic (TPRN)         0.005489   47   42   123  124
  79637201  Intragenic                0.006323   52   51    32   31
 136470801  Intragenic                0.006323   52   51    32   31
 140918001  Intergenic (CACNA1B)      0.006786  176  169    71   68
 100615901  Intergenic (FOXE1)        0.006969  262  253   123  118
  98221901  Intergenic (PTCH1)        0.008291   26   30   104   99
 138730601  Intergenic (CAMSAP1)      0.008602   26   21    97   93
  89561701  Intergenic (GAS1)         0.008653   77   76     6   12
    977401  Intergenic (DMRT3)        0.008678  236  245   129  124
  37002601  Intergenic (PAX5)         0.008822  133  127   207  211
 139744401  Intergenic (PHPT1)        0.009078   47   46    32   31
 126771301  Intragenic                0.009547   43   46    97   93
  93922501  Intragenic                0.009565   34   34   149  161
  94187101  Intragenic                0.009579   73   80     6    6
 136044401  Intragenic                0.009623   39   34   110  105
 139611201  Intergenic (FAM69B)       0.009623   39   34   110  105
 139716201  Intergenic (C9orf86)      0.009860   73   72   136  130

Plot the DNA methylation profile for the promoter region of gene FAM189A2, the most significant differentially covered promoter region from the previous list. Overlay the CpG islands and the FAM189A2 gene.

range = getRange(getSubset(a9,'Feature','FAM189A2'));
r1 = range(1)-1000;
r2 = range(2)+1000;
figure
hold on

% plot high-resolution coverage of all replicates
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
h3 = plot(r1:r2,getBaseCoverage(bm_dicer_1,r1,r2,'binWidth',1),'r');
h4 = plot(r1:r2,getBaseCoverage(bm_dicer_2,r1,r2,'binWidth',1),'m');

% mark the CpG islands within the [r1 r2] region
for i = 1:numel(cpgi.Starts)
    if cpgi.Starts(i)>r1 && cpgi.Stops(i)<r2 % is CpG island inside [r1 r2]?
        px = [cpgi.Starts([i i]) cpgi.Stops([i i])]; % x-coordinates for patch
        py = [0 max(ylim) max(ylim) 0];              % y-coordinates for patch
        hp = patch(px,py,'r','FaceAlpha',.1,'EdgeColor','r','Tag','cpgi');
    end
end

% mark the gene at the bottom of the axes
px = range([1 1 2 2]);
py = [0 1 1 0]-2;
hq = patch(px,py,'b','FaceAlpha',.1,'EdgeColor','b','Tag','gene');

axis([r1 r1+4000 -4 30]) % zooms-in
fixGenomicPositionLabels(gca) % formats tick labels and adds datacursors
ylabel('Coverage')
xlabel('Chromosome 9 position')
title('DNA Methylation profiles along the promoter region of the FAM189A2 gene')
legend([h1 h2 h3 h4 hp hq],...
    'HCT116-1','HCT116-2','DICERex5-1','DICERex5-2','CpG Islands','FAM189A2 Gene',...
    'Location','NorthEast')

Observe that the CpG islands are clearly unmethylated for both of the DICERex5 replicates.

References

[1] Serre, D., Lee, B.H., and Ting A.H., "MBD-isolated Genome Sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome", Nucleic Acids Research, 38(2):391-9, 2010.

[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L., "Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome", Genome Biology, 10(3):R25, 2009.

[3] Li, H., et al., "The Sequence Alignment/map (SAM) Format and SAMtools", Bioinformatics, 25(16):2078-9, 2009.

[4] Gardiner-Garden, M. and Frommer, M., "CpG islands in vertebrate genomes", Journal of Molecular Biology, 196(2):261-82, 1987.

[5] Ting, A.H., et al., "A Requirement for DICER to Maintain Full Promoter CpG Island Hypermethylation in Human Cancer Cells", Cancer Research, 68(8):2570-5, 2008.

[6] Benjamini, Y. and Hochberg, Y., "Controlling the false discovery rate: a practical and powerful approach to multiple testing", Journal of the Royal Statistical Society, 57(1):289-300, 1995.