[mira_talk] Re: Convert_project problem

On Wednesday 26 November 2008 16:44, mark.rose@xxxxxxxxxxxx wrote:
> I'm trying to use convert_project to convert a caf file into a clipped
> fasta file.  I use the following command:
>
> convert_project -f caf -t clippedfasta project.caf
> project.caf.clipped.fasta
>
> I get the following error:
>
> -t clippedfastais not a valid type!
>
> The documentation lists "clippedfasta" as a valid "totype".  What am I
> doing wrong?

Ooops, an error. I've fixed this in the current CVS, will get roled out in the 
next release.

> Also, what is the nature of the fasta files produced by default by mira
> in the "<project>_d_results" directory?  How are they different from
> what is supposed to be produced using the "clippedfasta" and
> "maskedfasta" when using convert_project to convert the output caf file?

When you use "clippedfasta" and "maskedfasta" in convert_project, this gives 
back the sequences of single reads, either clipped or masked, and not the 
consensus of an assembly (a CAF can also consist of unaligned reads).

When you apply a conversion from "caf" to "fasta" and the caf contains an 
assembly, you will get back the consensus of the assembly in FASTA format.

The difference between padded and unpadded are simply the gap characters: 
unpadded is the same sequence as padded when removing the '*'. While not 
really useful in a de-novo assembly, it is quite useful when performing 
mapping assemblies as one can then quickly calculate base positions of the 
mapped consensus against the reference sequences with a simple script.

> Lastly, is it wrong to think of the sequence in the
> <project>_out.unpadded.fasta file as mira's best approximation of the
> actual sequence? 

No, it's not wrong, it's absolutely correct.

> If not, what is recommended to achieve a simple, 
> working sequence dataset short of extensive, manual editing of the
> assembly?

You do not need extensive manual editing to improve your results to a point 
where the users of the sequence will be *really* happy: load the project in 
an editor (e.g. gap4) and simply let the tags set by MIRA guide you to the 
problems that were spotted ("SRMc", "WRMc" and "IUPc" in the case of assembly 
of one strain with one sequencing technology, other tags like "SROc" when 
having multiple strains, "STMS" and "STMU" when using different sequencing 
technologies etc.pp).

E.g.: for a Solexa mapping in a bacterium, I rarely need more than two or 
three hours to clean up the whole thing up to a point where my users have yet 
to find one case where I really missed a SNP (they and I prefer to have a 5% 
false discovery rate).

Hope this helps,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: