[mira_talk] Re: Assembling 454 and Solexa mate-pair data - rethinking ...

From: "Martin A. Hansen" <mail@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Tue, 1 Sep 2009 11:03:56 +0200
On Mon, Aug 31, 2009 at 9:42 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

> On Montag 31 August 2009 Martin A. Hansen wrote:
> > MIRA has been running for over a week now digesting this rather simple
> > de-novo assembly:
> > [...]
>
> Hmmmm ... at which point is it? Should be through it, or almost.
>

I killed it - and zapped the directories - so I don't know :o(


>
> Then again, it shouldn't take that long, even on machines from 2 years ago.
>
>
Since this is a fairly simple genome, and the coverage is huge, I was
expecting it to run faster. Perhaps I am doing something wrong here:

mira -project=M1 -job=denovo,genome,normal,454,solexa
-GENERAL:number_of_threads=4 SOLEXA_SETTINGS -CO:msr=no
-GE:uti=no:tismin=2000:tismax=3000


> Pseudo code mockup (with limited detail):
> > [...]
> > Memory consumption would be OK. Speed would be OK.
> > How about it?
>
> I'd do it a bit differently: throw out every read which aligns in a contig
> and
> is more than 3kb away from one of the ends. You might want to try a bit
> smaller hashes though, I'd try 30 or 25 first.
>

Not a bad idea. Lets try this with Bowtie and full length hits (no
mismatches). I can simply do this using Biopieces like this:

# Get the middle of the long contigs and index with bowtie.
read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | extract_seq -b 3000
| reverse_seq | extract_seq -b 3000 | reverse_seq | create_bowtie_index -i
clipped_contigs -x

# Match reads against the above index and write IDs of matches to a table.
read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | bowtie_seq -i
clipped_contigs/clipped_contigs | uniq_vals -k Q_ID | write_tab -k Q_ID -o
read_id.tab -x

and the read_id.tab files contains 1054724 out of the 3308974 total reads -
so we _can_ limit the Solexa part of the full assembly by 1/3 = a lot!


> One drawback: it throws out repetitive reads where one repeat copy is
> present
> in the contigs.
>
>
Aha! But we can just find those reads that were repetitive - and rescue
these by matching the stack of reads from before with the clipped contig
ends:

# Get the ends of the contigs
read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | extract_seq -b 1 -l
3000 | write_fasta -o M1_contigs_5end.fna -x
read_fasta -i M1_contigs.fna | grab -e 'SEQ_LEN>6035' | reverse_seq |
extract_seq -b 1 -l 3000 | reverse_seq | write_fasta -o M1_contigs_3end.fna
-x

# Index the contig ends with bowtie
read_fasta -i M1_contigs_5end.fna,M1_contigs_3end.fna | create_bowtie_index
-i contig_ends -x

# Read the Solexa entries, grab the ones matching the middle part and match
against the above index allowing for 3 mismatches:
read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -E read_id.tab |
bowtie_seq -i contig_ends/contig_ends -m 3 | uniq_vals -k Q_ID | write_tab
-k Q_ID -o repeat_reads.tab -x

And that shows us that 8281 of the reads were repetitive with one match in
the contig ends (3 mismatches) and one match in the contig middles (0
mismatches) - and thus needs rescue!

# We compile a list of ids we can drop:

read_tab -i read_id.tab | grab -i -E repeat_reads.tab | write_tab -o
reads_to_drop.tab -x

# And finally create input files for MIRA

read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -i -E
reads_to_drop.tab | write_fasta -o M1_in.solexa.fasta -x
read_solexa -s -i HSZ09094_M1_1.fq,HSZ09094_M1_2.fq | grab -i -E
reads_to_drop.tab | rename_keys -k SCORES,SEQ | write_fasta -o
M1_in.solexa.qual -x

Thurs we reduced the input reads from 3308974 to 2262531.

Viola



This exercise should not be necessary for this small, well-covered,
non-repetitive and GC% normal genome, but perhaps for the two other genomes
I have in the pipeline (GC>65%) ...?



Martin




> Regards,
>  Bastien
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>
Follow-Ups:
- [mira_talk] Re: Assembling 454 and Solexa mate-pair data - rethinking ...
  - From: Bastien Chevreux
[mira_talk] Re: Assembling 454 and Solexa mate-pair data - rethinking ...

Other related posts: