On Montag 13 Juli 2009 Michal Stuglik wrote: > I would ask you one small thing I cannot find this in manuals: I do > miraEST assembling for 454 data and in result files (*.ace) names of the > input reads are changed so I can't track back to the original reads. Is > there any switch for that? Hello Michal, miraEST? Which version are you working with because I renamed "miraEST" to "miraSearchESTSNPs" a while ago (people got confused with running mira is est mode and miraEST). Anyway .. the documentation on miraSearchESTSNPs (miraEST) is lacking quite a bit, I never came around writing a good one. Am I guessing right that you want to find out which reads go into which final contigs? If yes, I've searched my archive for a mail that I once sent for an earlier version of MIRA (2.4.x) but which should clarify how to find the information you want. Files and directories have moved around a bit or have changed name slightly, but basically the information is there. Please note that the program there is still called miraEST and that parameter options I gave in there may have changed quite a bit in the mean time (there were bigger changes in 2.8.x and 2.9.x). ----------------------------------------------------------------------------- Generally speaking, miraEST is a three-stage assembly system that was designed to catch SNPs in different strains. That is, one really should have different strains to analyse (and the information provided to the assembler) to make the most out of miraEST. Unfortunately, only the smaller of the demos (estdemo2) that are in the package shows this, and it even is not made very clear. Here's a quick overview on what miraEST does: Step 1: assemble everything together, not caring about strain information. Potential SNPs are not treated as SNPs, but as possible repeat marker bases and are tagged as such (temporarily) to catch each and every possible sequence alignment which might be important later. As a result of this stage, the following information is written out: Into "step1_snpsinSTRAIN_<strainname>.caf": - all the sequences of a given strain that are in contigs (can be aligned with at least another sequence) - also, all sequences that are singlets BUT have been tagged previously as containing tagged bases showing that they aligned previously (even to other strains) but were torn apart due to the SNP bases. Into "step1_nonsnps_remain.caf": - all the singlets of all the strains that have no such tag If no strain information is provided Obviously, if one did not provide strain information to the assembly of step 1, all the sequences belong to the same strain (named "default"). The CAF files above are the input sequences for the next step. Step 2: Now, miraEST assembles each strain independently from each other. Again, sequences containing SNPs are torn apart into different contigs (or singlets) to give a clean representation of the "really sequenced" ESTs. In the end, each of the contigs (or singlets) coming out of the assemblies for the strains is a representation of the mRNA that was floating around the given cell/strain/organism. The results of this step are written out into one big file (step2_reads.caf) and a new straindata that goes along with those results (step2_straindata.txt). Step 3: miraEST takes the result of the previous step (which should now be clean transcripts) and assembles them together, *this time* allowing transcripts from different strains with different SNP bases to be assembled together. The result is then written to step3_out.* files and directories. > Also, I generally find that people want to know which original input > sequence ended up in which contig (or singleton). The way Mira is > creating contigs, with other contigs as the member sequences, makes this > difficult. You are not the first making this remark. Fortunately, I think that I can help you out there (see below). > Is there a way to produce final contigs where the directed > assembly uses original input sequences as the building blocks? I > understand that Mira edits these but it would be nice. Yes, there is. What I didn't tell in the description of the working process above is that (like the normal 'mira'), miraEST keeps track on a lot of things and writes out quite a lot of additional information files after each step. For step 1, look at step1_* files. For step 2, look at <strainname>_* files (and directories). Step 3, it's step3_* again. To get what you want, you will have to write a small script that basically collects the information from different output files and makes a small re-assembly with the initial sequences for each transcript. It should do something like this: first have a look at the step3_info_contigreadlist.txt (as you correctly did) > # > # contig_name read_name > # > step3_c1 default_c2 > step3_c1 default_c5 > step3_c1 default_c1 > ... > step3_c1 default_s11 > step3_s2 default_c3 > ... The first contig of step3 (step3_c1) has as first read a transcript formed out of contig number 2 from the strain "default" (default_c2), then there is default_c5, then ..., then there is a singlet form the "default" strain (default_s11). After that, the next result of step 3 (a singlet called step3_s2) is formed out of a contig of the default strain (default_c3). Etc. What you (the script) must now do, is to search in the corresponding step2 contigreadlist of the strain on how the transcripts were built. E.g.: look at default_info_contigreadlist.txt and find something like this (made up by me as I do not have your corresponding file): # # contig_name read_name # default_c1 gnlti136479291 default_c1 gnlti136479409 default_c1 gnlti136479240 default_c1 gnlti136479336 default_c2 gnlti136478532 default_c2 gnlti136477760 default_c2 gnlti136477568 ... default_c5 gnlti136479043 default_c5 gnlti136479138 ... The transcript "default_c2" is made out of three reads (gnlti136478532, gnlti136477760 and gnlti136477568), "default_c5" out of two, etc. With that, you will have collected the necessary information to make a small assembly using only the reads that form one transcript of step3. Here, you will have to use the 'normal' mira binary. Put all sequences collected into one input suited for mira (EXP files, FASTA or CAF). Then call mira and make sure that the following parameters are included in the parameterlist: mira <your parameters here, especially the correct ones for naming and loading the input files> -SB:lsd=yes -AS:nol=2:rbl=2:sd=no:ugpf=no -DP:ure=no:tpae=yes -AL:egp=yes:egpl=low -CO:mr=yes:asir=yes Optionally include -CLIPPING and other options as you think right. That should do the trick. Incidentally, this is also quite near on how to call the normal mira binary for assembling ESTs when no strain information is present. ----------------------------------------------------------------------------- Hope this helps, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html