[mira_talk] Re: Convert_project problem
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Thu, 4 Dec 2008 00:21:13 +0100
On Wednesday 26 November 2008 16:44, mark.rose@xxxxxxxxxxxx wrote:
> I'm trying to use convert_project to convert a caf file into a clipped
> fasta file. I use the following command:
>
> convert_project -f caf -t clippedfasta project.caf
> project.caf.clipped.fasta
>
> I get the following error:
>
> -t clippedfastais not a valid type!
>
> The documentation lists "clippedfasta" as a valid "totype". What am I
> doing wrong?
Ooops, an error. I've fixed this in the current CVS, will get roled out in the
next release.
> Also, what is the nature of the fasta files produced by default by mira
> in the "<project>_d_results" directory? How are they different from
> what is supposed to be produced using the "clippedfasta" and
> "maskedfasta" when using convert_project to convert the output caf file?
When you use "clippedfasta" and "maskedfasta" in convert_project, this gives
back the sequences of single reads, either clipped or masked, and not the
consensus of an assembly (a CAF can also consist of unaligned reads).
When you apply a conversion from "caf" to "fasta" and the caf contains an
assembly, you will get back the consensus of the assembly in FASTA format.
The difference between padded and unpadded are simply the gap characters:
unpadded is the same sequence as padded when removing the '*'. While not
really useful in a de-novo assembly, it is quite useful when performing
mapping assemblies as one can then quickly calculate base positions of the
mapped consensus against the reference sequences with a simple script.
> Lastly, is it wrong to think of the sequence in the
> <project>_out.unpadded.fasta file as mira's best approximation of the
> actual sequence?
No, it's not wrong, it's absolutely correct.
> If not, what is recommended to achieve a simple,
> working sequence dataset short of extensive, manual editing of the
> assembly?
You do not need extensive manual editing to improve your results to a point
where the users of the sequence will be *really* happy: load the project in
an editor (e.g. gap4) and simply let the tags set by MIRA guide you to the
problems that were spotted ("SRMc", "WRMc" and "IUPc" in the case of assembly
of one strain with one sequencing technology, other tags like "SROc" when
having multiple strains, "STMS" and "STMU" when using different sequencing
technologies etc.pp).
E.g.: for a Solexa mapping in a bacterium, I rarely need more than two or
three hours to clean up the whole thing up to a point where my users have yet
to find one case where I really missed a SNP (they and I prefer to have a 5%
false discovery rate).
Hope this helps,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: