On Mon, Aug 31, 2009 at 9:42 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Montag 31 August 2009 Martin A. Hansen wrote: > > MIRA has been running for over a week now digesting this rather simple > > de-novo assembly: > > [...] > > Hmmmm ... at which point is it? Should be through it, or almost. > I killed it - and zapped the directories - so I don't know :o( > > Then again, it shouldn't take that long, even on machines from 2 years ago. > > Since this is a fairly simple genome, and the coverage is huge, I was expecting it to run faster. Perhaps I am doing something wrong here: mira -project=M1 -job=denovo,genome,normal,454,solexa -GENERAL:number_of_threads=4 SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=2000:tismax=3000 > Pseudo code mockup (with limited detail): > > [...] > > Memory consumption would be OK. Speed would be OK. > > How about it? > > I'd do it a bit differently: throw out every read which aligns in a contig > and > is more than 3kb away from one of the ends. You might want to try a bit > smaller hashes though, I'd try 30 or 25 first. > Not a bad idea. Lets try this with Bowtie and full length hits (no mismatches). I can simply do this using Biopieces like this: # Get the middle of the long contigs and index with bowtie. read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | extract_seq -b 3000 | reverse_seq | extract_seq -b 3000 | reverse_seq | create_bowtie_index -i clipped_contigs -x # Match reads against the above index and write IDs of matches to a table. read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | bowtie_seq -i clipped_contigs/clipped_contigs | uniq_vals -k Q_ID | write_tab -k Q_ID -o read_id.tab -x and the read_id.tab files contains 1054724 out of the 3308974 total reads - so we _can_ limit the Solexa part of the full assembly by 1/3 = a lot! > One drawback: it throws out repetitive reads where one repeat copy is > present > in the contigs. > > Aha! But we can just find those reads that were repetitive - and rescue these by matching the stack of reads from before with the clipped contig ends: # Get the ends of the contigs read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | extract_seq -b 1 -l 3000 | write_fasta -o M1_contigs_5end.fna -x read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | reverse_seq | extract_seq -b 1 -l 3000 | reverse_seq | write_fasta -o M1_contigs_3end.fna -x # Index the contig ends with bowtie read_fasta -i M1_contigs_5end.fna,M1_contigs_3end.fna | create_bowtie_index -i contig_ends -x # Read the Solexa entries, grab the ones matching the middle part and match against the above index allowing for 3 mismatches: read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -E read_id.tab | bowtie_seq -i contig_ends/contig_ends -m 3 | uniq_vals -k Q_ID | write_tab -k Q_ID -o repeat_reads.tab -x And that shows us that 8281 of the reads were repetitive with one match in the contig ends (3 mismatches) and one match in the contig middles (0 mismatches) - and thus needs rescue! # We compile a list of ids we can drop: read_tab -i read_id.tab | grab -i -E repeat_reads.tab | write_tab -o reads_to_drop.tab -x # And finally create input files for MIRA read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -i -E reads_to_drop.tab | write_fasta -o M1_in.solexa.fasta -x read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -i -E reads_to_drop.tab | rename_keys -k SCORES,SEQ | write_fasta -o M1_in.solexa.qual -x Thurs we reduced the input reads from 3308974 to 2262531. Viola This exercise should not be necessary for this small, well-covered, non-repetitive and GC% normal genome, but perhaps for the two other genomes I have in the pipeline (GC>65%) ...? Martin > Regards, > Bastien > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html >