I was referring to things like the files in the info folder. The contigstats file will give you some idea of the coverage for the individual contigs, etc. Contigs with excessive coverage are generally pile ups of repeats. This is not so much about closing gaps as it is understanding the data you are working with. The other files can be useful too but not if you are manually working with them. They really need to be parsed with perl scripts or something similar. Shaun From: Shankar Manoharan <shankarmanostar@xxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2012-05-07 07:22 AM Subject: [mira_talk] Re: Gap closure Sent by: mira_talk-bounce@xxxxxxxxxxxxx Dear Shaun... Thanks a LOT for that detailed and very helpful walk through. I work on whole genome sequencing of bacterial isolates. We have roughly 3-4 references for my species. The objective of my work is to establish a gene function relationship to certain characteristics of my isolate. Of 14 repeats, two coded for proteins, the remaining 12 were 23S, 16S and 5S rRNA encoding sequences (Yes multiple copies of the same thing). It is essential for me to complete the genomic gaps by whatever means that may be available. I'll first try to order the contigs before proceeding further. Thanks for suggesting the tools. You mentioned something about ancillary data generated by MIRA. Would you mind explaining what it is and how it will help in gap closure ? Best regards Shankar Manoharan Graduate Student Department of Genetics Madurai Kamaraj University Ph. +919790167534 I strongly believe in doing my best and leaving the rest to God On Mon, May 7, 2012 at 4:46 AM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx> wrote: There is no “this is how you do it” when it comes to genome closure. Every project is different and will have different challenges which require different approaches. However, I can offer a few tips. First off and most importantly is to know what it is you want to accomplish. This is a question that should have been addressed before you even started with the sequencing as it ultimately impacts on the route you need to take. Characterizing a novel organism is different from doing a comparative project looking at differences in gene content, SNPs, phylogenetic relatedness, etc. Too often people embark on these projects without having a clear idea of the questions they want answered only to find out that the tract they’ve taken is not really appropriate. But your question was on closing gaps. Basically you’ll do this by PCR and conventional sequencing. Typically the genome coverage will be close to 100 % so the gaps you are left with are generally fairly short or are due to repeats like rRNA sequences, IS elements and things like that. The first thing to do is to get to know your data. You mentioned 14 repeats. What are they? Are they different copies of the same thing? How big are they and what coverage do they have compared to the rest of the genome? They may be separate contigs but could still represent multiple copies of the same thing. Do you care about these sequences and closing the gaps they create? Start with doing some simple BLAST searches and looking over the ancillary data files created by MIRA. The better you understand the data the better off you will be in planning a route of attack. You should also use something like GAP5, Tablet or other pile up viewers to assess the contigs created. Inconsistencies in depth of coverage or paired end distribution could indicate misassemblies. MIRA is good but nothing is perfect. Contigs flanking repeats will typically have a higher depth of coverage at the terminal ends. Interrogate these regions to find out what repeat you are dealing with. It will also aid you in primer design. No point putting a primer at the end of a contig if it is likely to prime in multiple locations in the genome and give you multiple products. Scaffolding is also going to be required in order to orient your contigs and plan your PCR experiments. If you don’t have paired end data to facilitate this you can use programs like Mauve, Mummer/Nucmer, Abacus, Projector2, etc. to order your contigs based on a reference sequence. Just keep in mind that the resulting contig order is only as good as the reference sequence (and program) you use so trying different reference sequences and looking for consistency is usually recommended. In the end it probably won’t be perfect so expect some predictions to be wrong. Abacus and Projector2 are nice because they will provide you with a primer list for closing gaps but I sometimes question the ordering that they come up with and the primers designed are not always the best (e.g. they can target obvious repeat regions). Another tip has to do with the repeats. In all likelihood you will have multiple copies of rRNA sequences or other large regions that will require primer walking. If you are looking to close these gaps make use of the data you have to make things easier. The rRNA regions for example will typically be 5 Kb or so and would require multiple rounds of sequence-design primer-sequence-design primer, etc. But you can predesign the primers to cover these repetitive regions base on the data you have so that you can use the same sequencing primers for all of the different copies. I could go on and on about what to do in different situations but I think this covers the main areas. Good Luck and Have Fun. Oh and be prepared for frustration and disappointment ;-) Shaun Inactive hide details for Shankar Manoharan ---2012-05-06 01:05:31 PM---*Dear all...* * Firstly thanks a LOT for your earliShankar Manoharan ---2012-05-06 01:05:31 PM---*Dear all...* * Firstly thanks a LOT for your earlier support. I have now managed to From: Shankar Manoharan <shankarmanostar@xxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2012-05-06 01:05 PM Subject: [mira_talk] Gap closure Sent by: mira_talk-bounce@xxxxxxxxxxxxx Dear all... Firstly thanks a LOT for your earlier support. I have now managed to assemble a ~4.5 Mb genome and obtain 48 contigs of which 14 are repeats. If I have to close gaps, what is the best possible way possible ? Any input would be greatly appreciated. Many thanks in advance. Best regards, Shankar Manoharan Graduate Student Department of Genetics Madurai Kamaraj University Ph. +919790167534 I strongly believe in doing my best and leaving the rest to God