[mira_talk] Re: Reference assembly issues...

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 3 Apr 2012 19:17:55 +0200

On Apr 3, 2012, at 16:16 , Shankar Manoharan wrote:
> Thank you professor. :) Helped a LOT.

Hmmm ... "Prof. Dr. Chevreux" sounds good, but as I have no professor title 
(not even "h.c."), I think you shouldn't call me that :-)

> My next plan of work is to recover the 40k odd reads which are in the debris 
> of the reference assembly, try to do a de novo assembly of these and try to 
> fit them into the de novo assembly.

Good strategy, I use it quite often.

There is one cave-at: you will get also all the error-ridden reads in the data 
set from the debris, and if you put all the debris into a de-novo, it may be 
that those error-rich reads catch the statistics module off-guard. You may want 
to assemble the debris as "est" instead of "genome". I know it sounds a bit 
weird, but it is the only work-around I can give at the moment for this special 
kind of data.

> I'd like your opinion on that professor. Plus, how can I extract debris reads 
> from the Sff file based on the headers that MIRA provides in the info 
> directory ? Do we have a script for that or should I write my own ? I'm a 
> rather lousy scripter :(

Then it would be a good opportunity to improve ;-)

On the other hand: you do not need to. convert_project comes with an option 
("-n") to supply a names file which tells it to extract only certain reads from 
a data set. I think this will come in handy in your case.

And you may want to extract the reads from the last "readpool.maf" in the 
checkpoint directory. They are as clean as MIRA could get them, so if you tell 
convert_project to extract clipped data ("-c"), this would probably help you 
also a lot (remember to turn off all clipping in MIRA if you use that already 
clipped set as input).

B.

Other related posts: