High-throughput genotyping software for short reads Make sure you have perl and GD module installed. (1) Build the pseudo-reference sequences of RILs' parents by SNPs. If you have both the parents' high quality pseudo-molecue sequences, please go to step2. Run the perl script to build pseudo-reference: perl PseudoMaker.pl snp_file reference.fa parent_genome snp_file: The file list all the identified high quality SNPs between one parent and the high quality reference pseudo-molecue sequence. The file format should be as the following chromosome01 325 T chromosome01 335 T chromosome01 362 A chromosome01 411 G chromosome01 482 C chromosome01 579 A chromosome01 872 T chromosome01 1032 T chromosome01 1141 G chromosome01 1350 A ... One SNP per line. The three column indicate the chromosome, position, and the SNP base, respectively. reference.fa: The high quality pseudo-molecue sequences of the reference genome, which has been fully sequenced. Basically one sequence per chromosome. And the sequence is FASTA format. parent_genome: The name of the parent. The output pseudo-reference file will be parent_genome.fa. (2) Convert Solexa fastq files to Sanger fastq files, After solexa basecall pipeline for pair end sequences, you will get two fastq files for one lane. The two files will be such as s_1_1_sequence.txt and s_1_2_sequence.txt. You may rename the file while converting the file format. To convert Solexa fastq to Sanger fastq, Before Solexa_Pipeline1.3: you can go to MAQ page and download the maq: http://maq.sourceforge.net/ maq sol2sanger s_1_1_sequence.txt Lane1_1.fastq maq sol2sanger s_1_2_sequence.txt Lane1_2.fastq Here, "Lane1" is the lane used for sequecing the RIL, For Solexa_Pipeline1.3 or later: you can go to ftp://ftp.sanger.ac.uk/pub/zn1/solexa/slx2fastq/ ./slx2fastq s_1_1_sequence.txt Lane1_1.fastq ./slx2fastq s_1_2_sequence.txt Lane1_2.fastq (3) Sort and split the fastq files for each RIL. For 3-bp index, you can run 16 RIL samples per solexa lane. perl Split16.pl Lane1_1.fastq perl Split16.pl Lane1_2.fastq You will get files such as Lane1_1.AAT.fastq, Lane1_1.AAC.fastq..., and Lane1_2.AAT.fastq, Lane1_2.AAC.fastq... Here, "AAT" is the tag for a particular RIL. For 4-bp index, you can run 64 RIL samples per solexa lane. perl Split64.pl Lane1_1.fastq perl Split64.pl Lane1_2.fastq You will get files such as Lane1_1.AAAT.fastq, Lane1_1.AAAC.fastq..., and Lane1_2.AAAT.fastq, Lane1_2.AAAC.fastq... Here, "AAT" is the tag for a particular RIL. Then you must get rid off the unpaired sequences for each RIL(index). perl Get_Paired.pl Lane1 AAT So, you get the sequence files for both ends for each index, such as Lane1_1.AAT.PE.fastq Lane1_2.AAT.PE.fastq. (4.1) The genotype assignment pipeline to deal with SSAHA2 alignment result. The fastq seuqences should be aligned with both parent genomes by SSAHA2. The following commands show how to align solexa pair-end(PE) sequences to both parent genomes by using SSAHA2. ./ssaha2-2.3_x86_64 -rtype solexa -mthresh 30 -skip 2 -diff 0 -depth 500 -align 0 -pair 50,900 ./Parent1_genome.fa Lane1_1.AAT.PE.fastq Lane1_2.AAT.PE.fastq > Lane1.AAT.PE.fastq.p1 ./ssaha2-2.3_x86_64 -rtype solexa -mthresh 30 -skip 2 -diff 0 -depth 500 -align 0 -pair 50,900 ./Parent2_genome.fa Lane1_1.AAT.PE.fastq Lane1_2.AAT.PE.fastq > Lane1.AAT.PE.fastq.p2 Make sure the SSAHA2 (at least v2.3) has been installed in your computer. To get the SSAHA2 package, please go to SSAHA2 page and download them: http://www.sanger.ac.uk/Software/analysis/SSAHA2/ And then run the genotype assignment pipeline: perl Ssaha2rlt.pl Lane1 AAT 36 Here "36" is the length for your reads. (4.2) The genotype assignment pipeline to deal with MAQ alignment result. The fastq seuqences should be aligned with both parent genomes by MAQ. Firstly, both parent sequences need to be convert into MAQ *.bfa files, by the following commands: ./maq fasta2bfa Parent1_genome.fa Parent1_genome.bfa ./maq fasta2bfa Parent2_genome.fa Parent2_genome.bfa Secondly, RILs' fastq sequences need to be convert into MAQ *.bfq files, by the following commands: ./maq fastq2bfq Lane1_1.AAT.PE.fastq Lane1_1.AAT.PE.bfq ./maq fastq2bfq Lane1_2.AAT.PE.fastq Lane1_2.AAT.PE.bfq Then, you should run the alignment procedure of MAQ: ./maq match -a 900 Lane1.AAT.PE.p1.map Parent1_genome.bfa Lane1_1.AAT.PE.bfq Lane1_2.AAT.PE.bfq ./maq match -a 900 Lane1.AAT.PE.p2.map Parent2_genome.bfa Lane1_1.AAT.PE.bfq Lane1_2.AAT.PE.bfq After that, the *.maq files should be convert to plain text files, file names are the same as SSAHA2's. ./maq mapview Lane1.AAT.PE.p1.map >Lane1.AAT.PE.fastq.p1 ./maq mapview Lane1.AAT.PE.p2.map >Lane1.AAT.PE.fastq.p2 And then run the genotype assignment pipeline: perl Maq2rlt.pl Lane1 AAT 36 Here "36" is the length for your reads. (4.3) The genotype assignment pipeline to deal with SOAPaligner alignment result. To run SOAPaligner, you need to build index files for the reference genome. To format reference sequences: ./2bwt-builder Parent1_genome.fa ./2bwt-builder Parent2_genome.fa Then you may search reads against the formatted index files: ./soap -a Lane1_1.AAT.PE.fastq -b Lane1_2.AAT.PE.fastq -D Parent1_genome.fa.index -o Lane1.AAT.PE.fastq.p1.PE -2 Lane1.AAT.PE.fastq.p1.SE -m 50 -x 900 ./soap -a Lane1_1.AAT.PE.fastq -b Lane1_2.AAT.PE.fastq -D Parent2_genome.fa.index -o Lane1.AAT.PE.fastq.p2.PE -2 Lane1.AAT.PE.fastq.p2.SE -m 50 -x 900 And you should put pair end (PE) and single end (SE) results together: cat Lane1.AAT.PE.fastq.p1.PE Lane1.AAT.PE.fastq.p1.SE > Lane1.AAT.PE.fastq.p1 cat Lane1.AAT.PE.fastq.p2.PE Lane1.AAT.PE.fastq.p2.SE > Lane1.AAT.PE.fastq.p2 And then run the genotype assignment pipeline: perl Soap2rlt.pl Lane1 AAT 36 Here "36" is the length for your reads. (5) The genotype calling pipeline to get recombination map. perl Seq2Bin.pl Lane1.AAT.PE.fastq.rlt jap_v4_length_list Here, genome_length_list is the genome length list of the organisms. The file format should be as followed: chromosome01 45064769 chromosome02 36823111 chromosome03 37257345 chromosome04 35863200 chromosome05 30039014 chromosome06 32124789 chromosome07 30357780 chromosome08 28530027 chromosome09 23843360 chromosome10 23661561 chromosome11 30828668 chromosome12 27757321 ... Lane1.AAT.PE.fastq.rlt is the input file. And the output file will be three for each RIL. Lane1.AAT.bin is the bin file, each line is for one bin. The six columns indicates chromosome, bin start, bin end, parent1 or 2, last SNP's read, bin length. Lane1.AAT.combine.png show a combined figure for each RIL. Each backgroud(grey) bar indicates one chromosome. And There are three colored horizontal bars on each backgroud(grey) bar. The first continous bar indicate the bin map. Blue, red and yellow represent regions from parent1, parent2, and heterozygous regions, respectively. The second and third discontinuous bars show the SNPs distributing on the chromosome. Blue from parent1, and red from parent2. Lane1.AAT.PE.fastq.rlt.win15.edge is the temp file for genotyping. (6) Build the bin map and the input files for linkage mapping. Firstly, make a bin file list for all RILs. $ls *.bin > rils_file Secondly, you should creat a file include all traits of each RIL. The file format is like this: RILs Trait1 Trait2 Trait3 RIL_001 94 94 94 RIL_002 124.8 124.8 124.8 RIL_003 103 103 103 RIL_004 99.9 99.9 99.9 RIL_005 121.3 121.3 121.3 RIL_006 101.14 101.14 101.14 RIL_007 124 124 124 RIL_008 . . . RIL_009 153.6 153.6 153.6 RIL_010 161.9 161.9 161.9 RIL_011 136.6 136.6 136.6 RIL_012 169.2 169.2 169.2 ... Here, the first column indicate the RILs' name. And the order should be the same as the filenames in , while the RILs' name is changeable. Then, you can run the following perl script to get the result. perl Bin2MCD.pl rils_file rils_traits rils_file.map is the Bin map file. Each line shows the genotypes of one Bin. The first and second columns indicate the chromosome and physical position (1.0=100Kb). For other columns, each column represents one RIL. "A" is from parent1, "B" is from parent2, and "H" is heterozyous. rils_file.map.mcd is the standard input file for linkage mapping software such as Windows QTL Cartographer. The other 3 files, rils_file.bin.mark, rils_file.edge.comb, rils_file.edge.sort are temp files for Bin map construction. Thank you for your interest in our high-throughput genotyping software. If you have questions about this software, you can contact us via: zqiang@ncgr.ac.cn