This workflow demonstrates ChIP-Seq analysis in Escherichia coli by performing peak detection in a treatment sample using MACS. It is a Cuneiform transcription of a hands-on introduction to ChIP-Seq by Morgane Thomas-Chollier which itself is based on a study by Myers et al. 2013.

Introduction

In a ChIP-Seq experiment the DNA binding sites of transcription factors or other DNA-binding proteins are first selected using chromatin immunoprecipitation and then sequenced. By comparing the sequencing coverage in such a treatment sample to the coverage observed in a control sample where no DNA selection has been performed it is possible to find the DNA binding sites of the protein in question.

In this workflow we are given the sequenced reads of a treatment and control sample of an experiment conducted in Escherichia coli. After aligning both samples to the reference sequence we can perform peak calling. In this workflow we use the peak caller MACS.

The peak caller not only identifies peak regions and summits but also outputs coverage information which can be displayed in genome browsers like IGV (see Figure 1). The data produced by this workflow is ready to be used with downstream processing like, e.g., motif discovery with RSAT. This web-based tool not only produces statistics about peak length distrubtion (see Figure 2), single and dinucleotide composition (see Figure 3) as well as their position profiles around the peak (see Figures 4 and 5) but also gathers information about the motifs discovered in the peak regions. Figure 6 shows the forward and reverse sequence logos of the most often encountered motif. Figure 7 depicts the positions of that motif relative to the peak center. Figure 8 shows the probability of encountering a certain number of occurrences of this motif per peak region.

The Cuneiform workflow described in this post consists of four parts. First, the task definitions are given. All tasks used in this workflow are defined in Bash. In the next section we define the input data consisting of two SRA samples and a reference genome. Next, the workflow structure is established by applying the previously defined tasks to the input data. In the last section the query is given which defines what the output of the workflow is.

Setting up a machine with Chef (installing software and downloading data) takes around 90 minutes. Running this example workflow on a machine with 4 cores takes around 10 minutes.

chip-seq_peaks_igv.png

Figure 1: Read alignments and coverage visualization for both treatment and control sample with annotated genes, peak regions and summits viewed in IGV.

chip-seq_peak-motifs_test_seqlen_distrib.png

Figure 2: Peak region length distribution generated with RSAT. For each peak region length the absolute frequencies are given.

chip-seq_peak-motifs_test_heatmap-1str-ovlp_1nt.png chip-seq_peak-motifs_test_heatmap-1str-ovlp_2nt.png

Figure 3: Transition frequencies for both single nucleotides and dinucleotides.

chip-seq_peak-motifs_test_profiles-1str-ovlp_1nt_ci20.png

Figure 4: Single nucleotide composition profiles. Occurrences per position around the peak center for each nucleotide.

chip-seq_peak-motifs_test_profiles-1str-ovlp_2nt_ci20.png

Figure 5: Dinucleotide composition profiles. Occurrences per position around the peak center for each nucleotide 2-gram.

chip-seq_peak-motifs_oligos-2str-noov_6nt_mkv3_pssm_count_matrices_logo_m1.png chip-seq_peak-motifs_oligos-2str-noov_6nt_mkv3_pssm_count_matrices_logo_m1_rc.png

Figure 6: Forward and reverse sequence logos for most frequent motif.

chip-seq_peak-motifs_oligos_6nt_mkv3_m1_site_distrib.png

Figure 7: Distribution of predicted sites. For each position around the peak center the absolute frequencies for the motif are given.

chip-seq_peak-motifs_oligos_6nt_mkv3_m1_sites_per_peak.png

Figure 8: Number of predicted sites per peak.

Task Definitions

In this section we give the task definitions that wrap the command line tools used in this workflow. All tasks are assumed to terminate, be deterministic and be side effect-free.

Utility Functions

gunzip

def gunzip( gz : File ) -> <file : File>
in Bash *{
  file=unzipped_${gz%.gz}
  gzip -c -d $gz > $file
}*

Next-Generation Sequencing Tasks

sra-tools

We use sra-tools to extract the sequence reads in FastQ format from an archive in SRA format. The -Z flag makes fastq-dump output the FastQ sequence on its standard output channel.

def fastq-dump( sra : File ) -> <fastq : File>
in Bash *{
  fastq=$sra.fastq
  fastq-dump -Z $sra > $fastq
}*

FastQC

We use the tool FastQC to generate quality reports in HTML format. These HTML reports are output as zip archives which can be decompressed and viewed in a Browser.

def fastqc( fastq : File ) -> <zip : File>
in Bash *{
  fastqc -f fastq --noextract -o ./ $fastq
  zip=`ls *.zip`
}*

Bowtie

We use Bowtie to perform read mapping. The read mapper first needs to create an index of the reference sequence. We create this index in the task bowtie-build. Note that we do not return the file list constituting the index but instead create a tar archive to hold the index files. This is an easy way to keep Cuneiform from altering the index filenames.

def bowtie-build( fa : File ) -> <idx : File>
in Bash *{
  idx=idx.tar
  bowtie-build $fa idx
  tar cf $idx idx.*
}*

In a second step we use the previously created index to map sequence reads stored in FastQ format against the reference genome. The -S flag makes Bowtie output the alignments in SAM format while the -p 2 flag tells the Bowtie instance to use two cores.

def bowtie-align( idx : File, fastq : File ) -> <bam : File>
in Bash *{
  bam=$fastq.bam
  tar xf $idx
  bowtie -S idx $fastq | samtools view -b - > $bam
}*

MACS

The MACS tool is used to perform peak calling. It consumes a pair of alignments one for the treatment and one for the control group and outputs peak regions and summits in bed format. In addition it creates a number of tables in xls format as well as coverage information in the form of bed graphs.

def macs( tag : File, ctl : File ) ->
  <peaks   : File,
   summits : File,
   xls     : [File],
   tag_bed : File,
   ctl_bed : File>
in Bash *{
  macs14 -t $tag \
         -c $ctl \
         --format BAM \
         --gsize 4639675 \
         --name "macs14" \
         --bw 400 \
         --keep-dup 1 \
         --bdg \
         --single-profile \
         --diag

  peaks=macs14_peaks.bed
  summits=macs14_summits.bed
  xls=(macs14_diag.xls macs14_negative_peaks.xls)
  tag_bed=macs14_MACS_bedGraph/treat/macs14_treat_afterfiting_all.bdg.gz
  ctl_bed=macs14_MACS_bedGraph/control/macs14_control_afterfiting_all.bdg.gz
}*

SAMtools

While SAMtools is not directly needed for peak calling, we use it to remove duplicates from the alignments generated in read mapping to obtain an alternative pair of coverage graphs using deepTools. In addition we use SAMtools to create a reference index used by bedtools to extract the FastA sequences of the identified peak regions.

Sorting is a step necessary prior to removing duplicates. The task samtools-sort consumes an alignment file in SAM format, converts it to BAM and sorts it without writing the intermediate BAM file to disk.

def samtools-sort( bam : File ) -> <sorted : File>
in Bash *{
  sorted=sorted.bam
  samtools sort -m 2G $bam -o $sorted
}*

The task samtools-rmdup consumes a sorted alignment file in BAM format and outputs a BAM file containing only unique alignments.

def samtools-dedup( bam : File ) -> <dedup : File>
in Bash *{
  dedup=dedup.bam
  samtools rmdup -s $bam $dedup
}*

The task samtools-index creates an index on a given alignment file in BAM format.

def samtools-index( bam : File ) -> <bai : File>
in Bash *{
  bai=$bam.bai
  samtools index $bam
}*

The task samtools-faidx creates and index on a given reference sequence in FastA format.

def samtools-faidx( fa : File ) -> <fai : File>
in Bash *{
  samtools faidx $fa
  fai=$fa.fai
}*

deepTools

While MACS provides us with viable coverage information we can use deepTools to as an alternative for generating coverage information.

def bamcoverage( bam : File, bai : File ) -> <bedgraph : File>
in Bash *{
  bedgraph=$bam.bedgraph
  ln -sf $bai $bam.bai
  bamCoverage --bam $bam \
              --outFileName $bedgraph \
              --outFileFormat bedgraph
}*

Peak Restriction

We shorten the peak regions in a bed file, which is essentially a table storing the start and stop position of a peak region, by adding 100 bp to the start position while subtracting 100 bp from the ending position. This can be achieved by evaluating a short expression in Perl.

def restrict-peaks( bed : File ) -> <restricted : File>
in Bash *{
  restricted=$bed.100.bed
  perl -lane '$start=$F[1]+100; $end = $F[2]-100 ; print "$F[0]\t$start\t$end"' \
    $bed > $restricted
}*

bedtools

We can use bedtools to obtain the nucleotide sequence in FastA format for all peak regions stored in a bed file. Note that we have to rename the SAMtools FastA index to match the name of the corresponding FastA file because bedtools implicitly assumes the index to find the index under this name or create a new one.

def bedtools-getfasta( fa : File, fai : File, bed : File ) ->
  <bed_fa : File>
in Bash *{
  bed_fa=$bed.fa
  ln -sf $fai $fa.fai
  bedtools getfasta -fi $fa -bed $bed -fo $bed_fa
}*

Input Data

The input data for this workflow are a pair of sequence archive files in SRA format obtained from NCBI GEO and the Escherichia coli reference genome in FastA format. We use an older version of the reference genome here, because that genome was current when the original study was published.

let ecoli-fa-gz : File = 'GCF_000005845.2_ASM584v2_genomic.fna.gz';
let tag-sra     : File = 'SRR576933.sra';
let ctl-sra     : File = 'SRR576938.sra';

Workflow Definition

In this section we apply the previously defined functions to the input data defined in the input data section. This way we construct the data dependencies of the workflow.

Data Preprocessing

First, we decompress the Escherichia coli reference genome.

let <file = ecoli-fa : File> = gunzip( gz = ecoli-fa-gz );

We convert tag and control samples from SRA format to FastQ.

let <fastq = tag-fastq : File> = fastq-dump( sra = tag-sra );
let <fastq = ctl-fastq : File> = fastq-dump( sra = ctl-sra );

We perform quality control using FastQC.

let <zip = tag-qc : File> = fastqc( fastq = tag-fastq );
let <zip = ctl-qc : File> = fastqc( fastq = ctl-fastq );

Before we can perform read mapping, we need to create a Bowtie index of the Escherichia coli reference genome.

let <idx = ecoli-idx : File> = bowtie-build( fa = ecoli-fa );

Next, we perform read mapping for both treatment and control group using Bowtie.

let <bam = tag-bam : File> =
  bowtie-align( idx   = ecoli-idx,
                fastq = tag-fastq );

let <bam = ctl-bam : File> =
  bowtie-align( idx   = ecoli-idx,
                fastq = ctl-fastq );

Peak Calling with deepTools

We sort the alignments using SAMtools sort.

let <sorted = tag-sorted-bam : File> = samtools-sort( bam = tag-bam );
let <sorted = ctl-sorted-bam : File> = samtools-sort( bam = ctl-bam );

We deduplicate the sorted alignments using SAMtools rmdup.

let <dedup = tag-dedup-bam : File> =
  samtools-dedup( bam = tag-sorted-bam );

let <dedup = ctl-dedup-bam : File> =
  samtools-dedup( bam = ctl-sorted-bam );

We index the deduplicated alignments using SAMtools index.

let <bai = tag-dedup-bai : File> =
  samtools-index( bam = tag-dedup-bam );

let <bai = ctl-dedup-bai : File> =
  samtools-index( bam = ctl-dedup-bam );

Next, we call peaks with deepTools generating a coverage graph pair. The calls of the bamcoverage functions are independent and, thus, run in parallel if possible.

let <bedgraph = tag-deeptools-bedgraph : File> =
  bamcoverage( bam = tag-dedup-bam,
               bai = tag-dedup-bai );

let <bedgraph = ctl-deeptools-bedgraph : File> =
  bamcoverage( bam = ctl-dedup-bam,
               bai = ctl-dedup-bai );

Peak Calling with MACS

We call peaks using MACS. The macs function consumes the alignment pair in BAM format and produces an annotation file in bed format that contains the peak information.

let <peaks = peaks-macs-bed : File> =
  macs( tag = tag-bam,
        ctl = ctl-bam );

We restrict the peaks found by MACS.

let <restricted = restricted-macs-bed : File> =
  restrict-peaks( bed = peaks-macs-bed );

We index the reference genome using SAMtools. This reference index becomes important in the next step when we extract the peak sequences from the bed file. Bedtools creates this index automatically if it cannot find it. But it is good practice to create the index in an extra step, which Cuneiform can run in parallel.

let <fai = ecoli-fai : File> = samtools-faidx( fa = ecoli-fa );

We extract the sequence information from the peak regions discovered by MACS:

let <bed_fa = peaks-macs-fa : File> =
  bedtools-getfasta( fa  = ecoli-fa,
                     fai = ecoli-fai,
                     bed = restricted-macs-bed );

Query

Last, we query the variables we’re interested in. Only computing steps that contribute to the query are processed by Cuneiform.

We query the quality reports, the coverage graph for treatment and control group, and the peak and summit information discovered by MACS including the peak sequences in FastA format.

<tag-qc                 = tag-qc,                 % quality control
 ctl-qc                 = ctl-qc,

 tag-deeptools-bedgraph = tag-deeptools-bedgraph, % deepTools results
 ctl-deeptools-bedgraph = ctl-deeptools-bedgraph,

 restricted-macs-bed    = restricted-macs-bed,    % MACS results
 peaks-macs-fa          = peaks-macs-fa>;