Clover: Cis-eLement OVERrepresentation

Detect cis-elements in DNA sequences
Instructions | Gene Regulation Hub

Enter DNA sequences or GenBank identifiers:
AND/OR upload a file of sequences or GI's:
(Optional) Set subsequence From: To:
    Select a bunch of cis-elements: Check All -- Clear All
TATA Sp1 CRE ERE NF-1 E2F Mef-2 Myf
CCAAT AP-1 Ets Myc GATA LSF SRF Tef
Use JASPAR matrices
AND/OR enter your own cis-elements:
(Get matrices from TRANSFAC-free registration)
AND/OR upload cis-elements from a file:
    Select background distribution

Human chromosome 20 (44.1% C+G) - finished sequence
Sequences 2000 bp upstream of human genes (49.8% C+G) - from UCSC 08-Jul-2003
Human CpG islands (68.8% C+G, median length = 557 bp) - from UCSC 14-Apr-2003
Mouse chromosome 19 (42.8% C+G) - NCBI Build 30
Sequences 2000 bp upstream of mouse genes (47.8% C+G) - from UCSC 25-Apr-2003
Drosophila chromosome 2 arm R (43.5% C+G) - from BDGP Release 3

AND/OR enter your own background sequences:
AND/OR upload a background sequence file:
Parameters: (use the defaults if in doubt)

Score threshold P-value threshold Pseudocount
Number of randomization with seed for :
Sequence shuffles. Dinucleotide randomizations. Motif shuffles.
Mask lower case letters in sequences
Display detailed mapping. (Number of bases per line: )

(Optional) Command-line Clover output file:

 


Instructions

Clover Home | Gene Regulation Hub

Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.

Sequence Format

Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, Clover will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 20 sequences, of total length up to 100 kb.

GenBank Identifiers

For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).

Set Subsequence

Limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) This option will be ignored if more than 1 sequence is entered.

Format for User-defined Cis-elements

Cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T
You can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:
>element1
0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
>element2
13 1 1 5
...
These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Clover will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements.

The two formats can be mixed.

To use Clover, you will need a library of sequence motifs. We provide the JASPAR collection of transcription factor binding site patterns. JASPAR is described in the following publication; please give suitable credit:
Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW. Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003 2(2):13.

Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.

Background Sequences

We recommend using background sequence sets to contrast with your sequences of interest. Which background sets to use depends on which sequences you are studying: e.g. they should come from the same taxonomic group and have similar GC compositions.

Score Threshold

Clover prints details for statistically significant motifs (all P-values <= some threshold, by default 0.01), and then finds instances of these motifs in the sequences. Motif instances are scored using the standard log likelihood ratio method:
score = log[ prob(sequence|motif) / prob(sequence|random) ]

Motifs with log-likelihood scores higher than this value will be reported. Default is 6

Statistical Significance

Clover will compare each motif in turn to the sequence set, and calculate a "raw score" indicating how strongly the motif is present in the sequence set. Raw scores by themselves are hard to interpret, so Clover provides options (which we recommend you use) to determine the statistical significance of the raw scores. Four ways of determining statistical significance are available. The first involves providing Clover with one or more files of background DNA sequences. Each background file should contain sequences in FASTA format, with total length much greater than the target sequence set. For each background set, Clover will repeatedly extract random fragments matched by length to the target sequences, and calculate raw scores for these fragments. The proportion of times that the raw score of a fragment set exceeds or equals the raw score of the target set, e.g. 0.02, is called a P-value. The P-value indicates the probability that the motif's presence in the target set can be explained just by chance. For each motif, a separate P-value is calculated for each background file.

The second way of determining statistical significance is to repeatedly shuffle the letters within each target sequence, and use these shuffled sequence sets as controls. P-values are calculated as above. The third way is to create random sequences with the same dinucleotide compositions as each target sequence. The fourth way is to shuffle the motif matrices, and obtain control raw scores by comparing the shuffled motifs to the target sequences. When shuffling a motif, the counts of A, C, G and T within each position are not shuffled, but the positions are shuffled among one another.

In our experience to date, the use of background sequence sets works best. However, it is necessary to choose the background sets carefully: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds. The methods that randomize nucleotides and dinucleotides suffer from predicting motifs that lie in Alus and other common repetitive elements to be significant. You should avoid including orthologous sequences from closely related species, e.g. human and mouse, as that will artefactually boost the significance of motifs in these sequences.

Get command-line Clover

  1. Download Clover by clicking on one of these links, and saving the file on your computer:
    Clover executable for Linux (RedHat 7.2/7.3)
    Clover executable for Sun (Solaris 8)
  2. Set execute permission for the file by typing chmod +x clover-linux (or whatever you saved it as).
  3. Clover is now ready to run.
OR
  1. Download the Clover source code.
  2. Uncompress: gunzip clover-src.tar.gz
  3. Un-archive: tar -xvf clover-src.tar
  4. Change directory: cd clover-src
  5. Compile (cross your fingers): make

Unfortunately the source code doesn't compile successfully on all systems. We'd love to hear your suggestions for making it more portable.

If you choose to feed command-line Clover output to the top of this page for visualization purposes, please assure the integrity of your input text file. Do not use saved html files since a browser may mess up the format during character escaping.

Good luck finding those motifs!

Return to Zlab Gene Regulation Hub

Suggestions to: Martin Frith
Last modified: Saturday, 09-Aug-2003 14:29:38 EDT