Thanks Bastien. Where are the debris file reads? I have the debrislist.txt file which lists the reads, but is there a file with the actual debris reads? Tom From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux Sent: Thursday, October 25, 2012 11:31 AM To: mira_talk@xxxxxxxxxxxxx Subject: [mira_talk] Re: assembly to identify a large insert On Oct 25, 2012, at 20:08 , Thomas Goldman <tomgoldman@xxxxxxxxxxx> wrote: Secondly, I have an issue that I was hoping someone has had some experience with. I'm trying to identify the location of a large insert (~7.2KB) in a MIRA 3.4 assembled genome for which I have MiSeq reads against a reference. Unfortunately, the inserted region reads (cassette) do not assemble to the reference, I'm assuming simply because the reads are not in the reference and are thrown out. I also tried a de novo assembly. In this case, the cassette is assembled, but unfortunately it produces a scaffold by itself without flanking reference sequence, so I still cannot determine where the cassette is inserted. I think this is because the 5' and 3' regions of the cassette itself has homology to other parts of the genome. You cannot "force" reads to map against a non-existing reference . that simply is not possible. The easiest way out for you might be this: take the read from debris file of the mapping project and assemble them "denovo,est,accurate". EST is quite important there, as those reads will not behave like a normal genome. You should get your insert out of it. If you are lucky, the borders of that insert are non-repetitive with respect to the genome, so that you should be able to determine its place quickly. If you are unlucky, find out which pairs in the insert have only one read in the insert, then fetch the positions of the mapped mates in the mapping project to give you an idea where your insert is placed (you will need to have a mapping project which did not use the short read merging option (you made sure of that)). Alternative to the above approach (and pretty quick to do if there are not too many "SNPs" between your reference and the mapped data: use gap4 or gap5 to quickly skip visually through all SNP locations marked by MIRA (marked with "SROc" tags). If the insert is in some non-repetitive area, you'll find a cluster of SROc/SRMc/UNSc tags together with the typical pattern for an insert in your genome. Oh . and you'll have to hope that your insert inserted itself in a region where the flanking repeats are not too long, because then even the paired-end reads will not help. Hope that helps, Bastien