[mira_talk] Re: caf2phdball

  • From: Lionel Guy <guy.lionel@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 25 Oct 2009 17:53:26 +0100

Hi Sven,

Thanks for your suggestions, I'll implement them soon. About the chromatogram names, is it enough to give the name and positions in the phd file? Don't you need an actual file? Does it work for Sanger reads too (I guess I could link the actual abi files there, otherwise)?

Do the numbers (15, 19) in the calculation of $peakpos come from empirical data?

Cheers,

Lionel

On 25 Oct 2009, at 17:11 , Sven Klages wrote:

Hi Lionel,

if I find some time I'll test it as well.

We have even phd.ball of almost 30G(!), for more or less historical reasons, as consed supports loading more than one phd.ball since v17 AFAIK. We started using phd.balls quite ealier (we also wrote our own predPhrap), because we were not able to (effeciently) handle 400,000 or more single phd files in a single filesystem ..

You should think about distinguishing sanger and 454 data, as for 454 data you probably can
omit the follwing tags:

CALL_METHOD:
QUALITY_LEVELS:

I'd also think about adding real chromatogram names to the phd.ball as only this option lets you edit single reads (and thus lets you changing consensus) ...

If you do so, you need to calculate the peak positions as well.
$peakpos = (++$basepos - 1)*19 + 15;

just some thoughts,
Sven

2009/10/23 Lionel Guy <guy.lionel@xxxxxxxxx>
Hi there,

Following my yesterday's message, I changed my original idea and finally
parsed the mira-produced caf file to obtain a phd.ball file to be used
with consed. The idea behind that is to have qualities associated with
reads when editing mira assemblies within consed. This is very important
for example when merging/tearing contigs, because the consensus is
recalculated in a very, very bad way if you don't have qualities
(especially because mira doesn't physically trims the reads from the
vector sequences...).

The result is a small perl script that works for my data, but I would be
glad if others could test it to see if it works with other types of
data. All comments are welcome!

CAVEAT: this script produces huuuuge files, because it writes one line
per base, plus headers. For example, I have 350'000 reads and some long
Sanger, and I get a file which is 1.4 Gb...

Cheers,

Lionel


============================================
Lionel Guy
Thunmansgatan 25, SE-75421 Uppsala

phone: +46 (0)18 245596
mobile: +46 (0)73 9760618
email: guy.lionel@xxxxxxxxx
============================================


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: