This folder contains data/script to generate the figures in the ENCODE main Nature paper (Fig2A and Supplementary Figz1A, see below for link) and for Dong X et al. Genome Biology paper. Author: Xianjun Dong (xianjun.dong.umass@gmail.com) Date: 2012.03.28 =============== Usage =============== # For Supplementary FigZ1A of the ENCODE Nature paper (http://www.nature.com/nature/journal/v489/n7414/extref/nature11247-s2.pdf) Rscript Fig2A_aggregation.R (mv ./result/Tx-based.K562.pdf FigZ1A.pdf) # For Fig2A: (http://www.nature.com/nature/journal/v489/n7414/fig_tab/nature11247_F2.html) Rscript correlation.comparison.R cpkw TSS-based all (move ./result/figures/pdf/TSS-based.CAGE.PolyA+.K562.WholeCell.all.bestbin.lm.0.10.pdf Fig2A.pdf) or, if you want to do cross-cellline comparison (e.g. Fig4B in the companion paper: http://genomebiology.com/2012/13/9/r53) Rscript correlation.comparison.R cpkw TSS-based all cpgw =============== Main parameters =============== Rscript correlation.comparison.R [exp2] : the brief code for expression experiment, e.g. cpkc="CAGE.PolyA+.K562.Cytosol", meaning the CAGE experiment for PolyA+ RNA extracted from cytosol of K562 cell. Full explaination for the short code can be found in the file ./data/expr.shortname.longname.txt, or you can find the code in mylib.R : what type of expression data you are going to use. The RNA group in ENCODE provides two types of expression data, one transcript (Tx)-based, one TSS-based. Transcript-based is simply expression value (e.g. RPKM for RNAseq, or RPM for CAGE) for each transcript/isoform. To measure the expression level of promoter, transcripts with shared TSS are merged and average value is calculated for that TSS. : which group of genes the correlation analysis is going to operate on. for all genes; means genes with high CpG content promoters; means genes with low CpG content promoters. The HCP/LCP definition is based on normalized CpG content score with threshold of 0.4 [exp2]: the brief code for another expression experiment to compare with. If set, then cross-cellline analysis will be done (see Fig4B in companion paper #GRCP039: Dong X et al. Modeling gene expression using chromatin features in various cellular contexts, submitted, 2012). The code explaination is same as . =============== Other parameters =============== In the script of forNature.correlation.comparison.R, you can set other parameters like: PSEUDOCOUNT = 0.03 # pseudocount for RNAseq log2 transformation nfold=10 # fold of cross-validation bM='bestbin' # stratedgy of choosing representative bins. You can choose 'bestbin', 'best5bins', 'nobin' or 'bins.above.0.2'. See mylib.R for more detail. LOW_SIGNAL_CUTOFF = log2(0.03) # threshold to define on/off genes in the classification =============== Requirements =============== Input files: ------------ ./data/gencode_v7_hg19_transcripts.gtf.cpg.tab ./data/Gencodev7_TSS_July2011.gff # PPM per TSS (provided by Sarah Djebali) ./data/gencode_v7_hg19_tr_with115_cshl_long_quantif.gff # transcript-based FPKM for all 115 CSHL long RNAseq (provided by Sarah Djebali) ./data/Jan11.Anshul.all.81bins.level12.L4100.HisoneDnase/ -->output_K562.H3k4me1 -->output_K562.H3k4me2 -->output_K562.H3k4me3 -->... [These binned density files should be pre-generated if not existing. See ./_utility/README.txt for instructions] R libraries required: --------------------- install.packages(c('relaimpo','earth', 'randomForest','e1071', 'caTools')) =============== Output =============== All output files will be stored in ./result/ folder with structure like: ./result/data/ ./result/figures/png/ ./result/figures/pdf/ ./result/figures/fig/ ./result/TSS-based.cp1w.all.bestbin.10.results.html ./result/TSS-based.cp1w.all.bestbin.10.results.html If you run the R script in a batch model (e.g. submit jobs to hpcc cluster for a list of experiments, e.g. ./utility/correlation.comparison.qsub.sh), then many html/txt files will be generated and you can catenate all files into one file by using script ./_utility/correlation.comparison.cat.sh The main figures for the correlation analysis are in the ./result/figures/png folder. PDF version are in ./result/figurs/pdf folder. Other related figures e.g. diagonasis analysis are in ./result/figures/fig folder. Active genes and corresponding table are in ./result/data/ folder.