[mira_talk] Re: smalt doc additions

  • From: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
  • To: Bastien Chevreux <bach@xxxxxxxxxxxx>, mira_talk@xxxxxxxxxxxxx
  • Date: Sat, 05 Mar 2011 16:17:43 +0100

Hi Bastien, (and the mira-talk-y people Cc-ed), ;)

Bastien Chevreux wrote:
> On Saturday 05 March 2011 13:46:59 you wrote:
>>   while trying to rip perfectly 454 B adaptors from my data I came across 
>> this section 
>> 
>> http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_sanger_using_ssaha2_smalt_to_screen_for_vector_sequence
>> 
>> <quote>
>> Note
>> I need an example for SMALT ..
>> </quote>
>>
>> Maybe you were looking for
>> //www.freelists.org/post/mira_talk/454-cleaning,16
>>
>> $ smalt map -f ssaha -d -1 -m 7 idx seqs.fasta  > seqs.ssaha_out
> 
> Good find, thank you. I got it updated in the git repository, will be rolled 
> out wuth the next versions.

I haven't tested this for vector screening (large matches) but for 30nt
adaptor smalt always crashed on my 32bit linux (reported upstream), matching
with ssaha2 was not good either (too many misses), blat seemed better (and of
course ;) the initial tests with 'blastall -p blastn -v 999999 -b 999999'
from 'legacy' NCBI blast tools was crashing due to too many initial seed hits).
At the moment water from EMBOSS package is my favorite, just need to write
a parser for that.
Notably, I do not like ssaha2 because it does not try to include in the
alignment leading nucleotides if say after 3rd there is a gap. The alignment
just starts since 4th position. Fiddling with gapopen or gapextension penalties
would not help would they be available at all. :( Myself was only able to
restrict matches using '-minscore 100 -best 1' though it still gives multiple
matches if having same score, even in the same region if I remember right.

>> Just am not sure if seqs.ssaha_out or $project_smaltvectorscreen_in.txt or
>> $project_ssaha2vectorscreen_in.txt is preferred. (smalt output is in ssaha2
>> format but does mira want to distinguish who did the work or just cares
>> about file format?)
> 
> It should just care about the format, but as always, there are slight 
> differences between the SSAHA2 outzput in sshaha2 fomrat and the SMALT output 
> in ssaha2 format (*sigh*).

Hope you clarified that in the docs if one has to ask smalt to produce
ssaha2-like format or if mira can use some other ... ;-)
 
> Therefore: if a *ssha2* named file is present it knows its from SSAHA2, if 
> it's 
> named *smalt* it knows its from SMALT. Ah, and if both are present it will 
> first do the ssaha2, then the smalt. But that's more a side-effect than a 
> feature ...

What will happen when the match regions are partially overlapping? Will
their union be used, or the intersection? What if there is no overlap
between the two methods? Yeah, the section in the docs about adapter clipping
is also unclear, I just wanted to find my own path before commenting on that.
So I quit until I have a full proposal. ;)
M.

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: