On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote: > Could you list reasons for why input sequences are discarded into the > *_info_debrislist.txt file (I can't find much of a description of it in > the manual). I've been trying to assemble some fasta contigs originally > derived from 454 read assemblies and roughly 3/4 of my input sequence > wind up in this debris file. Hello Mark, MIRA puts everything into the debris it could not assemble. The reads can land into debris at various stages (I need to improve the debris file by appending reasons for the reads to land there): - after loading (and clipping), if the reads are too short Important MIRA parameters: -AS:mrl and the complete -CL category - after the SKIM process (quick comparison of each read against every other read), if a read does not potentially match any other read Important MIRA parameter: -SK:pr - during the pre-assembly Smith-Waterman alignments, if it turns out that a read does not match any other read with the given criteria Important MIRA parameters: -AL:mo:mrs:egp:egpl:megpp - during/after the contig assembly stage (less frequent, but still possible): if a read turns out to be "strange" and doesn't want to integrate fully into a contig. This happens, e.g., for chimeras or for sequences that have a larger portion of sequencing vector or adaptor sequence Important MIRA parameters: -CO:rodirs, -CO:mr (and the parameters dependent on that) as well as the same -AL parameters as for the Smith-Waterman alignments For assembly of genomic data, -AS:ard (automatic repeat detection) and the parameters dependent on this can also play a role. Here's a simplified example: assume you have an average coverage of 10 and for a certain repeat you have 21 sequences. MIRA will then assume that you have 2 copies of this repeat and allow a coverage of 10 for every repeat. Which means that you'll have 2 copies with 10 sequences + a remaining sequence that could match, but due to the strict handling of repetitive coverage is not assembled to any repeat copy and therefore lands into the debris file. The same case with 22 sequences would result into 2 copies with 10 sequences and a third copy with 2 sequences (very probably named "lrc", see below). Now, if in your case 3/4 of the sequences land in the debris file, you are in trouble. Possible reasons: - bad sequence quality (and clipped away) - sequencing vector (or adaptor sequence) not removed at all from your input (if present: is the ancillary data loaded?) - assembly of EST data with "--job=genome" (this is a big no-no, as options switched on there like -AS:ard or -CL:pec wreak havoc for EST data) - other sequence related reasons I currently cannot think of - a bug in MIRA (which I hope it is not) > Also can you explain what "lrc" means in the names of assembled contigs? A remainder of larger changes that are currently made in the algorithms, it was introduced more for debugging purposes but a few users thought it could help them so I left it in. Basically, MIRA starts to build contigs in areas that are "rock solid", i.e., not a repetitive region (main decision point) and nice coverage of good reads. If during the assembly MIRA reaches a point where it cannot start building a contig in a non-repetitive region, it will name the contig "lrc" instead of "c". In normal genome projects this happens mostly towards the end of the assembly process for debris of repetitive copies which were formerly thrown out due to -AS:ard (see above). You know that something is wrong if you have "lrc" contigs very early on in the assembly (with low contig numbers). Hope that helps, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html