[mira_talk] Re: Question about kmer size settings in an EST assembly
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Sun, 24 Apr 2016 23:00:26 -0400
On 24 Apr 2016, at 17:10 , Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx> wrote:
I am attempting a very large scale EST assembly, and I notice that the
overlap detection phase seems to go quicker when the kmer size is larger.
What are the consequences of having the first pass start at a larger kmer
size (say 50) instead of the usual 17?
Well, all overlaps <50bp will not be found. That will lead to lowly expressed
genes not being reconstructed. The same applies for somewhat more expressed
genes, but with ploidy variants making them look like different genes. And to
add insult to injury: this also will apply to the 3’ end of some moderately
well expressed genes when the RNASeq data show a typical coverage of highly
expressed at 5’, then falling down towards 3’. Bad luck may lead to low
coverage in some of the later areas, leading to missed joins and only partially
reconstructed genes.
Next in line is what happens with slightly erroneous reads. E.g. 100mers and an
error at base position 51. On paper it looks as if only half of the read has a
valid overlap with other reads and it may be that MIRA rejects this … though I
*think* I have some code which checks for that and above a given kmer length
still accepts it for subsequent Smith-Waterman validation. You might want to
test this with a couple of reads though.
B.
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: