Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.
Sequences may be entered in Fasta, raw, or GenBank format. Any non-alphabetic characters in the sequence will be ignored, and any alphabetic characters except A, C, G and T (uppercase or lowercase) will be converted to 'n' and excluded from matching motifs. If GenBank format is used, Clover will read and display any 'CDS' (protein-coding region) annotations. Limits: at most 20 sequences, of total length up to 100 kb.
For example GenBank accession numbers (e.g. NC_001669), 'accession.version' numbers (e.g. NC_001669.1), or GI numbers (e.g. 9628421).
Limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) This option will be ignored if more than 1 sequence is entered.
Cis-elements can be entered as TRANSFAC-style matrices, which look like this:
NA AML-1a XX DE runt-factor AML-1 XX BF T02256; AML1a; Species: human, Homo sapiens. XX P0 A C G T 01 5 1 2 49 T 02 2 2 52 1 G 03 4 14 1 38 T 04 0 0 57 0 G 05 1 0 55 1 G 06 1 4 0 52 TYou can copy-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the nucleotide frequency lines (beginning with digits) are ignored and not required. The name line is required, and should be above the base frequency lines. Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:
>element1 0 4 2 14 12 0 0 8 8 0 1 11 20 0 0 0 >element2 13 1 1 5 ...These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" or "n" on a line. Clover will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements.
The two formats can be mixed.
Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.
score = log[ prob(sequence|motif) / prob(sequence|random) ]
Motifs with log-likelihood scores higher than this value will be reported. Default is 6
Clover will compare each motif in turn to the sequence set, and calculate a "raw score" indicating how strongly the motif is present in the sequence set. Raw scores by themselves are hard to interpret, so Clover provides options (which we recommend you use) to determine the statistical significance of the raw scores. Four ways of determining statistical significance are available. The first involves providing Clover with one or more files of background DNA sequences. Each background file should contain sequences in FASTA format, with total length much greater than the target sequence set. For each background set, Clover will repeatedly extract random fragments matched by length to the target sequences, and calculate raw scores for these fragments. The proportion of times that the raw score of a fragment set exceeds or equals the raw score of the target set, e.g. 0.02, is called a
The second way of determining statistical significance is to repeatedly shuffle the letters within each target sequence, and use these shuffled sequence sets as controls.
In our experience to date, the use of background sequence sets works best. However, it is necessary to choose the background sets carefully: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds. The methods that randomize nucleotides and dinucleotides suffer from predicting motifs that lie in Alus and other common repetitive elements to be significant. You should avoid including orthologous sequences from closely related species, e.g. human and mouse, as that will artefactually boost the significance of motifs in these sequences.
Unfortunately the source code doesn't compile successfully on all systems. We'd love to hear your suggestions for making it more portable.
If you choose to feed command-line Clover output to the top of this page for visualization purposes, please assure the integrity of your input text file. Do not use saved html files since a browser may mess up the format during character escaping.
Good luck finding those motifs!
Return to Zlab Gene Regulation Hub
Suggestions to: Martin Frith
Last modified: Saturday, 09-Aug-2003 14:29:38 EDT