[mira_talk] Re: caf2phdball

From: Lionel Guy <guy.lionel@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Sun, 25 Oct 2009 17:53:26 +0100

Hi Sven,

Thanks for your suggestions, I'll implement them soon. About thechromatogram names, is it enough to give the name and positions in thephd file? Don't you need an actual file? Does it work for Sanger readstoo (I guess I could link the actual abi files there, otherwise)?

Do the numbers (15, 19) in the calculation of $peakpos come fromempirical data?


Cheers,

Lionel

On 25 Oct 2009, at 17:11 , Sven Klages wrote:

Hi Lionel,

if I find some time I'll test it as well.
We have even phd.ball of almost 30G(!), for more or less historicalreasons, as consed supports loading more than one phd.ball since v17AFAIK. We started using phd.balls quite ealier (we also wrote ourown predPhrap), because we were not able to (effeciently) handle400,000 or more single phd files in a single filesystem ..
You should think about distinguishing sanger and 454 data, as for454 data you probably can
omit the follwing tags:

CALL_METHOD:
QUALITY_LEVELS:
I'd also think about adding real chromatogram names to the phd.ballas only this option lets you edit single reads (and thus lets youchanging consensus) ...
If you do so, you need to calculate the peak positions as well.
$peakpos = (++$basepos - 1)*19 + 15;

just some thoughts,
Sven

2009/10/23 Lionel Guy <guy.lionel@xxxxxxxxx>
Hi there,
Following my yesterday's message, I changed my original idea andfinally
parsed the mira-produced caf file to obtain a phd.ball file to be used
with consed. The idea behind that is to have qualities associated with
reads when editing mira assemblies within consed. This is veryimportant
for example when merging/tearing contigs, because the consensus is
recalculated in a very, very bad way if you don't have qualities
(especially because mira doesn't physically trims the reads from the
vector sequences...).
The result is a small perl script that works for my data, but Iwould be
glad if others could test it to see if it works with other types of
data. All comments are welcome!

CAVEAT: this script produces huuuuge files, because it writes one line
per base, plus headers. For example, I have 350'000 reads and somelong
Sanger, and I get a file which is 1.4 Gb...

Cheers,

Lionel


============================================
Lionel Guy
Thunmansgatan 25, SE-75421 Uppsala

phone: +46 (0)18 245596
mobile: +46 (0)73 9760618
email: guy.lionel@xxxxxxxxx
============================================


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: caf2phdball
  - From: Sven Klages

References:
- [mira_talk] caf2phdball
  - From: Lionel Guy
- [mira_talk] Re: caf2phdball
  - From: Sven Klages

[mira_talk] Re: caf2phdball

Other related posts: