CIpipe:
CRISPR indel pipe
Apr22, 2016
Yingxiang Li
Content
Introduction
CRISPR-Cas9 is a powerful tool for sequence-specific genome editing. The Cas protein cuts genomic DNA at locations complementary to a single guide RNA. Insertions and deletions (indels) often result when the cuts are repaired. Currently, there is no easy-to-use computational pipeline to determine the locations, identities, and frequencies of the indels. We have developed a pipeline, named CIpipe (CRISPR Indel pipeline), to identify indels in high-throughput DNA sequencing data and provide the statistical characterization of these indels.
Installation
CIpipe can only run on Mac OS or Linux OS.
You need python 2.7.10, R 3.2.2, bwa 0.7.5a; fastqc v0.11.2; samtools 1.3; java 1.7.0_95 first.
After installation of pip, type in your terminal:
(sudo) pip install CIpipe (--upgrade)
You need python 2.7.10, R 3.2.2, bwa 0.7.5a; fastqc v0.11.2; samtools 1.3; java 1.7.0_95 first.
After installation of pip, type in your terminal:
(sudo) pip install CIpipe (--upgrade)
Synopsis
CIpipe -R ref.fa -D data/ -O output/
CIpipe -R ref.fa -D data/ -O output/ -N test1
CIpipe -R ref.fa -D data/ -O output/ -N test1 -P 0.01 -B 15 -A 0.001
CIpipe -R ref.fa -D data/ -O output/ -N test1 -F -X -VI
CIpipe -R ref.fa -D data/ -O output/ -N test1 -VS -VC -VR
CIpipe -R ref.fa -D data/ -O output/ -N test1 -T chr1:100 -US 20 -DS 20
CIpipe -E
CIpipe -I test2.input.tab
CIpipe -R ref.fa -D data/ -O output/ -N test1
CIpipe -R ref.fa -D data/ -O output/ -N test1 -P 0.01 -B 15 -A 0.001
CIpipe -R ref.fa -D data/ -O output/ -N test1 -F -X -VI
CIpipe -R ref.fa -D data/ -O output/ -N test1 -VS -VC -VR
CIpipe -R ref.fa -D data/ -O output/ -N test1 -T chr1:100 -US 20 -DS 20
CIpipe -E
CIpipe -I test2.input.tab
Commands and Options
For a single sample analysis.
CIpipe -R reference -D data -O output
optional arguments:
CIpipe -R reference -D data -O output
optional arguments:
-h, --help
-R, --reference -D, --data -O, --output -N, --name -RK, --rank -P, --pvalue -B, --basequality -A, --varfreq -T, --target -US, --upstream -DS, --downstream -F, --fastqc -X, --index -U, --unlimited -VI, --indel -VS, --snp -VC, --consensus -VR, --readcount |
show this help message and exit
sample reference file, fasta format. (eg: my_ref.fa) sample data directory, fastq-ONLY. one file for single end, two files for paired end. (eg: my_data/) output directory, will be created if not exists. (eg: my_output/) sample name, default is name of output directory. (eg: my_sample) sample rank. (eg: 1) minimal p value, default: 0.05. minimal base quality, default: 30. minimal variant frequency, default: 0.0001. CRISPR target position. indel in target range will be picked out, mutiple targets separated by ',', default: ''. (eg: gene1:100,gene2:200) up stream distance from CRISPR target position, default: 20. down stream distance from CRISPR target position, default: 10. fastq quality control by FastQC, default: ON. -F will turn OFF. build reference index by BWA, default: ON. -X will turn OFF. no read depth limit in mpileup by SAMtools, default: OFF. search for indel by VarScan, default: ON. -I will turn OFF. search for SNP by VarScan, default: OFF. search for consensus call by VarScan, default: OFF. search for read counts by VarScan, default: OFF. |
For multiple samples and advanced analysis.
CIpipe -E
CIpipe -I input
optional arguments:
CIpipe -E
CIpipe -I input
optional arguments:
-h, --help
-E, --example -I, --input |
show this help message and exit
create example input data. modify the example.input.tab to fit your data. information table of all input data. all settings should be in it. (eg. example.input.tab) |
Workflow Charts
Annotation of example.input.tab
Type: CIpipe -E to create an example.input.tab.
Then modify it to fit your real data. |
|
Test Cases
Data: parts from
Li,Y. et al. (2015) A versatile reporter system for CRISPR-mediated chromosomal rearrangements. Genome Biol., 16, 111.
full data is: PRJNA283020. Here I only get the top 20,000 lines of each fastq file.
Li,Y. et al. (2015) A versatile reporter system for CRISPR-mediated chromosomal rearrangements. Genome Biol., 16, 111.
full data is: PRJNA283020. Here I only get the top 20,000 lines of each fastq file.
|
|
|
1. Download the data: SRR2007490, SRR2007491, SRR207493. (12.5MB, 12.4MB, 11.8MB)
2. Download the reference: refer.zip (LSL_1008bp.fa, iGFP_448bp.fa). (2KB)
3. For a single sample analysis (name: test1):
3.1. Extract refer.zip to refer/
3.2. Extract SRR2007490 to data/SRR2007490/
3.3. After the installation of CIpipe, in the terminal, type:
CIpipe -R refer/LSL_1008bp.fa -D data/SRR2007490/ -O output/ -N test1
3.4. CIpipe will show the progress on your terminal screen like this:
2. Download the reference: refer.zip (LSL_1008bp.fa, iGFP_448bp.fa). (2KB)
3. For a single sample analysis (name: test1):
3.1. Extract refer.zip to refer/
3.2. Extract SRR2007490 to data/SRR2007490/
3.3. After the installation of CIpipe, in the terminal, type:
CIpipe -R refer/LSL_1008bp.fa -D data/SRR2007490/ -O output/ -N test1
3.4. CIpipe will show the progress on your terminal screen like this:
3.5. The files in output/test1/result folder include:
test1.data.infor.txt (the map and data information)
test1.data.infor.txt (the map and data information)
test1.indel.brief.tab (the brief indel result from VarScan/test1.indel.tab)
test1.indel.potential.LSL_1008bp:88.tab (the indel target position range result. if user didn't point out the cut position, CIpipe will assume that the position with the max varfreq was the cut position and add a 'potential' in the file name.)
test1.indel.potential.LSL_1008bp:88.pdf (the indel target position region detail plot. it's ordered by positions and from small to large and indel types from deletion to insertion)
test1.indel.potential.LSL_1008bp:88.sort.pdf (the indel target position detail plot. it's ordered by variant frequency from high to low.)
4. For multiple samples and advanced analysis:
4.1. Extract refer.zip to refer/
4.2. Extract SRR2007490, SRR2007491, SRR207493 to
data/SRR2007490/, data/SRR2007491/, data/SRR2007493/
4.3. In the terminal, type:
CIpipe more -E
exmaple.input.tab will be generated in the current working directory like this:
4.1. Extract refer.zip to refer/
4.2. Extract SRR2007490, SRR2007491, SRR207493 to
data/SRR2007490/, data/SRR2007491/, data/SRR2007493/
4.3. In the terminal, type:
CIpipe more -E
exmaple.input.tab will be generated in the current working directory like this:
example.input.tab | |
File Size: | 0 kb |
File Type: | tab |
4.4. Open example.input.tab and modify it to test2.input.tab as follows:
test2.input.tab | |
File Size: | 0 kb |
File Type: | tab |
4.5. In the terminal, type:
CIpipe -I input/test2.input.tab
4.6. CIpipe will show the progress on your terminal screen like this:
CIpipe -I input/test2.input.tab
4.6. CIpipe will show the progress on your terminal screen like this:
4.7. In result folder of each sample, there are such files (example: LSL2):
LSL2.data.infor.txt (the map and data information)
LSL2.indel.brief.tab (the brief indel result)
LSL2.indel.LSL_1008bp:88.tab (the indels only in target region, if user pointed out the cut position, there will be no 'potential' in the file name.)
LSL2.indel.LSL_1008bp:88.pdf (the indel target position region detail plot)
LSL2.indel.LSL_1008bp:88.sort.pdf (the indel target position region detail plot)
LSL2.data.infor.txt (the map and data information)
LSL2.indel.brief.tab (the brief indel result)
LSL2.indel.LSL_1008bp:88.tab (the indels only in target region, if user pointed out the cut position, there will be no 'potential' in the file name.)
LSL2.indel.LSL_1008bp:88.pdf (the indel target position region detail plot)
LSL2.indel.LSL_1008bp:88.sort.pdf (the indel target position region detail plot)
4.8. In the result folder of batch (test2.result/), there are such files:
test2.indel.iGFP_448bp.mat (indels across all iGFP samples)
test2.indel.LSL_1008bp.mat (indels across all LSL samples)
test2.indel.iGFP_448bp.mat (indels across all iGFP samples)
test2.indel.LSL_1008bp.mat (indels across all LSL samples)
test2.indel.lsl_1008bp.mat | |
File Size: | 0 kb |
File Type: | mat |
5. You can change all kinds of parameters to filter the results. For example, you can change the p value to 0.01 to get a stricter indels result table; change base quality to 15 to get more potential indels.
Thanks
We thank the members of the Weng and Xue laboratories for helpful discussions, in particular Chunqing Song, Pengpeng Liu, Yu Fu, Tyler Borrman, Michael Purcaro and Arjan van der Velde for their insightful suggestions.