[mira_talk] Re: Reads names
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Mon, 13 Jul 2009 22:45:40 +0200
On Montag 13 Juli 2009 Michal Stuglik wrote:
> I would ask you one small thing I cannot find this in manuals: I do
> miraEST assembling for 454 data and in result files (*.ace) names of the
> input reads are changed so I can't track back to the original reads. Is
> there any switch for that?
Hello Michal,
miraEST? Which version are you working with because I renamed "miraEST" to
"miraSearchESTSNPs" a while ago (people got confused with running mira is est
mode and miraEST).
Anyway .. the documentation on miraSearchESTSNPs (miraEST) is lacking quite a
bit, I never came around writing a good one. Am I guessing right that you want
to find out which reads go into which final contigs?
If yes, I've searched my archive for a mail that I once sent for an earlier
version of MIRA (2.4.x) but which should clarify how to find the information
you want. Files and directories have moved around a bit or have changed name
slightly, but basically the information is there.
Please note that the program there is still called miraEST and that parameter
options I gave in there may have changed quite a bit in the mean time (there
were bigger changes in 2.8.x and 2.9.x).
-----------------------------------------------------------------------------
Generally speaking, miraEST is a three-stage assembly system that was
designed to catch SNPs in different strains. That is, one really should have
different strains to analyse (and the information provided to the assembler)
to make the most out of miraEST. Unfortunately, only the smaller of the
demos (estdemo2) that are in the package shows this, and it even is not made
very clear. Here's a quick overview on what miraEST does:
Step 1: assemble everything together, not caring about strain information.
Potential SNPs are not treated as SNPs, but as possible repeat marker bases
and are tagged as such (temporarily) to catch each and every possible
sequence alignment which might be important later. As a result of this
stage, the following information is written out:
Into "step1_snpsinSTRAIN_<strainname>.caf":
- all the sequences of a given strain that are in contigs (can be aligned
with at least another sequence)
- also, all sequences that are singlets BUT have been tagged previously as
containing tagged bases showing that they aligned previously (even to
other strains) but were torn apart due to the SNP bases.
Into "step1_nonsnps_remain.caf":
- all the singlets of all the strains that have no such tag
If no strain information is provided
Obviously, if one did not provide strain information to the assembly of step
1, all the sequences belong to the same strain (named "default"). The CAF
files above are the input sequences for the next step.
Step 2: Now, miraEST assembles each strain independently from each other.
Again, sequences containing SNPs are torn apart into different contigs (or
singlets) to give a clean representation of the "really sequenced" ESTs. In
the end, each of the contigs (or singlets) coming out of the assemblies for
the strains is a representation of the mRNA that was floating around the
given cell/strain/organism. The results of this step are written out into
one big file (step2_reads.caf) and a new straindata that goes along with
those results (step2_straindata.txt).
Step 3: miraEST takes the result of the previous step (which should now be
clean transcripts) and assembles them together, *this time* allowing
transcripts from different strains with different SNP bases to be assembled
together. The result is then written to step3_out.* files and directories.
> Also, I generally find that people want to know which original input
> sequence ended up in which contig (or singleton). The way Mira is
> creating contigs, with other contigs as the member sequences, makes this
> difficult.
You are not the first making this remark. Fortunately, I think that I can
help you out there (see below).
> Is there a way to produce final contigs where the directed
> assembly uses original input sequences as the building blocks? I
> understand that Mira edits these but it would be nice.
Yes, there is. What I didn't tell in the description of the working process
above is that (like the normal 'mira'), miraEST keeps track on a lot of
things and writes out quite a lot of additional information files after each
step. For step 1, look at step1_* files. For step 2, look at <strainname>_*
files (and directories). Step 3, it's step3_* again.
To get what you want, you will have to write a small script that basically
collects the information from different output files and makes a small
re-assembly with the initial sequences for each transcript.
It should do something like this: first have a look at the
step3_info_contigreadlist.txt (as you correctly did)
> #
> # contig_name read_name
> #
> step3_c1 default_c2
> step3_c1 default_c5
> step3_c1 default_c1
> ...
> step3_c1 default_s11
> step3_s2 default_c3
> ...
The first contig of step3 (step3_c1) has as first read a transcript formed
out of contig number 2 from the strain "default" (default_c2), then there is
default_c5, then ..., then there is a singlet form the "default" strain
(default_s11). After that, the next result of step 3 (a singlet called
step3_s2) is formed out of a contig of the default strain (default_c3). Etc.
What you (the script) must now do, is to search in the corresponding step2
contigreadlist of the strain on how the transcripts were built. E.g.: look
at default_info_contigreadlist.txt and find something like this (made up by
me as I do not have your corresponding file):
#
# contig_name read_name
#
default_c1 gnlti136479291
default_c1 gnlti136479409
default_c1 gnlti136479240
default_c1 gnlti136479336
default_c2 gnlti136478532
default_c2 gnlti136477760
default_c2 gnlti136477568
...
default_c5 gnlti136479043
default_c5 gnlti136479138
...
The transcript "default_c2" is made out of three reads (gnlti136478532,
gnlti136477760 and gnlti136477568), "default_c5" out of two, etc.
With that, you will have collected the necessary information to make a small
assembly using only the reads that form one transcript of step3. Here, you
will have to use the 'normal' mira binary. Put all sequences collected into
one input suited for mira (EXP files, FASTA or CAF). Then call mira and make
sure that the following parameters are included in the parameterlist:
mira <your parameters here, especially the correct ones for naming and
loading the input files> -SB:lsd=yes -AS:nol=2:rbl=2:sd=no:ugpf=no
-DP:ure=no:tpae=yes -AL:egp=yes:egpl=low -CO:mr=yes:asir=yes
Optionally include -CLIPPING and other options as you think right. That
should do the trick. Incidentally, this is also quite near on how to call
the normal mira binary for assembling ESTs when no strain information is
present.
-----------------------------------------------------------------------------
Hope this helps,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: