EULER Portal User Guide
The EULER Web Portal is a service that allows users to
assemble DNA fragments using an Eulerian path approach.
Please see this page
for more information on the
EULER algorithm, and
this page
for related papers.
The EULER portal allows users to use this algorithm to assemble DNA fragments
from a target DNA sequence of BAC size (with typical length 200-300Kb and less
than 20000 fragments). This server will be scaled up to the whole bacterial
genome assembly with typical length a few Mega-bases (in progress, please check
later!).
Using a single common interface, the user can choose to perform assembly
on one of several NPACI High-Performance Computing platforms, and then employ a range of tools
verify and compare the results (also see
EULER-Compare
for assembly comparison).
Each user must register for his own
account to run
the EULER portal. Registration offers a range of benefits including:
- a personal, password-protected account in which to store files and results
- ability to track the progress of jobs after submission
- greater computing power: less restrictions on usage of the HPC platforms
The following sections explain how to use the portal in each of these modes.
Registered users
How to apply for an account
Fill out
this online form.
This will give you limited access to several
NPACI facilities, including
-
the portal system (the web server for accessing scientific applications);
-
the
Storage Resource Broker
(SRB, for file storage);
-
the NPACI
gridport;
- and the computer on which the job is finally run.
If you are an NPACI HPC user, you should probably
still use
the same form.
However, you may bypass it; you need to create a
portal
account and
an SRB account with the same username, and an account on each
HPC platform on which you will run the EULER assembler.
Contact us
if you need additional assistance setting up either type of
account.
Uploading and managing files
Data files can be uploaded by clicking on the
Upload File button.
You can upload four types of files:
- xxxx (fasta file):
The FASTA format reads file.
This would normally be output by Phred.
The vector sequences in the read sequences should be screened
before you upload the file.
- xxxx.qual (quality file):
The Phred quality value file corresponding to the
reads file. It must
contain the same number of reads as the reads file and in the same order.
This would normally be output by Phred along with the reads file.
- name.rul or xxxx.name.rul (mate-pair file):
A file that describes the naming rules for mate-pairs and other
special reads.
This is a short ASCII file described below.
A sequencing center would normally develop conventions for naming
rules, and apply the same naming rule conventions to much or all of its
data, so you will probably use the same name.rul file
with many different reads files.
- step.inp or xxxx.step.inp (error correction file):
The error correction step control file,
which dictates the number of steps and
other parameters of error correction.
This is a short ASCII file described below.
It is likely you will use the same error correction file for many
reads files.
At any stage, a user
may list the contents of his file directories by clicking on List
& Download
Files. From this list, files can be downloaded or deleted from
the server. The reads file is required but
the other files are optional. The user can select to use the default error
correction file and mate-pair file when they submit the job.
The defaults are as follows:
Mate-pair file sample (normally named name.rul):
/* Names are: Library_Plate_Well.primer
primer library plate length range */
Finishing reads: f ALL ALL
Single reads: s ALL ALL
Double-barreled reads: x y ALL 1-200 1500 3500
Double-barreled reads: x y ALL 201-400 3000 10000
|
|
The lines in the file name.rul define the sources of each
read.
The first line means the reads
named from all libraries and plates
with primer label f are finishing
reads. The second line means all the reads with primer label
s are single reads.
The third line means the reads with primer labels x
and y from plates 1-200 form
mate-pairs and the clone length is about 1500-3500. The fourth
line is similar. The user can revise
this file according the projects they have and upload their
file to the EULER portal.
By default, the names of
the reads in the FASTA file are interpreted as
Library_Plate_Well.primer,
where the symbols "_" and "." are used as separators.
For instance, the read with name 2K_00135_351.f10
is from library "2K", plate "135", and well
"351". The primer of this read is labeled "f", so it is
a finishing read.
Note that the primer of a read
is only defined by the first letter after the last
period, no matter what other characters follow it.
Also, unbalanced mate pairs (read 2K_00199_17.x is present
but read 2K_00199_17.y is not) are treated as single reads.
Error correction step file sample (normally named
step.inp):
20 20 10 2
20 5 5 2
40 20 10 2
60 20 10 2
20 20 10 2
20 20 10 2
20 20 10 2
|
|
Each line in the file defines one iteration step of error correction.
The above file defines an error correction procedure of
seven steps.
The first and the last numbers in each line are the l-tuple size and the
minimal coverage used in single error correction of each error correction step.
The middle two parameters define the length of string extension and
the minimal matched ends used in double error correction.
Starting an assembly
After the required files have been uploaded, clicking on
Submit job brings up a form used to initiate an EULER assembly.
This job submission form contains a number of fields that
the user should fill in.
- Job Name
-
A name for the EULER job that the user is submitting.
The user may monitor the progress of this job afterwards
by looking for this job name under
the Job Status and History option.
The job name
does not have to be related to the filenames in the submission.
- Fasta Files
- Quality Value Files
-
These are the
names of the input reads files and the corresponding Phred quality files,
respectively,
which should have been uploaded through the previous step
(see Uploading files).
The read file is required,
while the quality file is optional.
If the user doesn't have
a quality file available, he may leave the field as no_file.
The read file and the quality file must be consistent with each other:
the number of reads in each file should be equal, and their names and order
should be the same. Otherwise, the job will stop with
an error message in the final report.
- Mate-Pair Naming Rules Files
- Error Correction Steps Control Files
-
These are the
names of the mate-pair files and error correction files respectively.
These files should have been uploaded as well if necessary
(see Uploading files).
The user may decide to use the default versions of these files by choosing
default for each of these fields.
Please note the definition of the
sources of the reads in the mate-pair files are very important and
the wrong definition may cause errors in the assembly of EULER-DB and
EULER-SF.
- Euler Cutoff
- Quality Value Cutoff
- Trim
-
Reads trimming is an important step in EULER, in which the
ends of reads are trimmed out due to the higher error rates at
the ends than the middle.
The user may choose one of these trimming schemes:
- phred: uses the Phred quality file.
This is the default, and is the recommended choice.
- estimate: EULER estimates the error distribution over
reads without having to use the quality file.
This option must be selected if there is no
quality file.
- both: Do the "estimate" trimming, followed by "phred"
trimming.
- default: Recommended. phred if there is a quality file,
estimate if there isn't.
The user needs to specify the parameters for the
corresponding trimming methods:
-
For Phred quality value based trimming,
fill in the field "Quality Value Cutoff."
The default value 15 means the read ends with quality value below 15
will be trimmed.
-
For estimate based trimming, fill in the field "Euler cutoff."
The default value 0.02 means the read ends with error rate above
0.02 based on EULER's estimation will be trimmed.
If the user doesn't provide
a quality value file, he MUST choose EULER-estimate
trimming for this job and choose the appropriate EULER cutoff.
- Machine
- # CPU's
- Queue
- Wall Clock Time
-
These are parameters for the NPACI grid portal, and are only shown for
NPACI account holders. EULER guest account holders will not see them,
as we have made arrangements for specific computing resources.
For NPACI account holders, you may run your jobs on
IBM's Blue Horizon
supercomputer, and others.
In the future, the machine selections may change.
At the present time, EULER only uses one processor.
All NPACI HPC machines are accessed by submitting jobs
to a queue which reflects the priority of the job.
EULER users may specify low, normal, high,
or express
priority in the field Queue.
The recommended selection is high.
Due to the heavy load on these machines, it is requested that the
express priority queue only be used when results must
be obtained quickly.
The user should fill in the estimated running time of the submitted job
in the field Wall Clock Time.
A typical EULER run on a BAC takes about 2-4 hrs on
Blue Horizon.
So for a normal job, the user may fill in 06:00 in this field.
- Email address
-
The user may specify an email address with each request,
to which notification will be sent when the search is complete.
Normally you will use the same address as in your account
application, but that is not required.
- Comments
-
The user may optionally write some comments describing the job.
Clicking the Submit button will copy the input files to
the selected HPC machine and add the job to the queue. Please
be patient, as this may take a few minutes.
When the submission process completes,
a message will appear, showing the query ID that may be used
to track it and retrieve results.
Tracking the progress of an assembly job
At any point after submitting an assembly,
the user may view its current status by clicking on
Job Status and History. This screen lists all currently pending and
running jobs in the user's account. The status of all jobs is
automatically updated every five minutes; if for any reason the status
of the list has not been recently updated, the user may check the
current status of all jobs by clicking on the Update link at the top
of the page.
Clicking on the View Queue link pops up a list of all jobs
currently running on the appropriate machine, and all jobs waiting to
run. This gives the user an impression of how much time will pass
before an EULER assembly starts running.
How to cancel a job
Jobs which are pending or running can be cancelled
at any time by going to the Job Status and History page,
depressing the grey
buttons corresponding to the unwanted jobs and clicking the
Cancel Jobs button.
Please note that cancelling a currently running job
will prevent any results from being produced by it.
Retrieving and analyzing results
Once a job runs to completion, an email will be
sent to the address specified on the submission form. The
files are transferred to your SRB account, and you may use the
List & Download Files link to retrieve them.
You may download files individually, or groups of files
in various archive formats
(tar, tar.Z, tar.gz, zip).
Users must download the result files as soon as possible
after the job is done, to avoid their removal
by the account administrators due to periodic space checks.
The output files are the following
(xxxx represents the name of the reads file):
Cumulative:
- xxxx.out:
- The report of EULER for this assembly job.
- index.html:
-
An HTML file with links to all of these files, with brief descriptions.
Classify reads:
- xxxx.reads:
- The reads that will be used in assembly
(FASTA format).
- xxxx.reads.qual:
- The Phred quality values of the reads used.
- xxxx.fin
- xxxx.fin.qual:
- The finishing reads and quality values, according to the naming rules.
- xxxx.long
- xxxx.long.qual:
- The long reads (longer than the permitted length of a read).
- xxxx.ambig
- xxxx.ambig.qual:
- The ambiguous reads (with large portion of N or X) files,
if there are any.
EULER-Trim:
- xxxx.clean:
- The reads after trimming the ends and low quality regions.
EULER-EC:
- xxxx.fix:
- The reads after error correction by EULER-EC.
EULER-EC fixes the sequencing errors in the reads
xxxx.clean and outputs the
result into this file in the same FASTA format.
EULER-ChimDet:
- xxxx.fil:
- The reads that survive after chimeric reads detection.
If a read combines two different regions
in the DNA sequence together, it is a chimeric read.
EULER-ChimDet detects the suspicious chimeric reads in its input file
xxxx.fix and discards them into a separate file
xxxx.fix.chim.
Meanwhile, this program trims the unreliable read
ends further and outputs the remaining reads into this
file in the same FASTA format.
- xxxx.fix.chim:
- The chimeric reads detected and discarded by EULER-ChimDet.
- xxxx.fix.unrel:
-
The unreliable reads detected and discarded by EULER-ChimDet.
EULER-ET:
- xxxx.fil.edge
xxxx.fil.path
xxxx.fil.graph:
-
The de Bruijn graph representation of the assembly after
equivalent transformation with reads.
- xxxx.fil_et_comp1.gvz:
- The graphviz format output of the
connected components (see the report file
xxxx.out
for details of the components) after
equivalent transformation with reads.
There can be several
of these files, one for each component, with "1" replaced by the
component number.
For components that are reverse-complements of each other,
only one component is generated.
- xxxx.fil.contig:
- The contigs after equivalent transformation with reads.
- xxxx.fil.singleton:
- The single-read and double read contigs generated by EULER.
EULER-DB:
- xxxx.fil.mate:
- Table of all mate-pairs found in the reads used in
the assembly. Each line in this file represents the name of a clone,
the index of the 1st and 2nd read of that clone in the read file
xxxx.fil
and the type of clone (corresponding to the
description in the mate-pair naming rule file).
- xxxx.fil.mate.edge
xxxx.fil.mate.graph:
-
The de Bruijn
graph representation after equivalent transformation with mate-pairs.
- xxxx.fil_db_comp1.gvz:
- The graphviz format output of the
connected components (see
the report file xxxx.out for details of the components) after
equivalent transformation with mate-pairs.
- xxxx.fil.mate.contig:
- The contigs after equivalent transformation with mate-pairs.
- xxxx.fil.et.mate:
- Table of all mate pairs linking these contigs. This file
describes the mate-pairs which are not used by EULER-DB.
It has the same format as xxxx.fil.mate.
These mate-pairs may be used in EULER-SF.
EULER-Consensus:
- xxxx.fil.mate.con:
- The contigs after consensus with majority rule.
EULER-SF (pass 1):
- xxxx.fil.mate.con_sf_comp1.gvz:
- The possible connections between contigs after
EULER-Consensus.
Each node in this graph represents a contig in the
input file xxxx.fil.mate.con and each
edge indicates these two contigs might be positioned closely, suggested
by some mate-pairs.
EULER-Connect:
- xxxx.fil.mate.con.connt:
- The contigs after EULER-Connect.
- xxxx.fil.mate.con.connt.chim:
- Chimeric reads detected by EULER-Connect.
This is a more reliable detection algorithm than
the contigs xxxx.fix.chim
from EULER-ChimDet.
- xxxx.fil.mate.con_connt.gvz:
- The possible connections (detected by EULER-Connect)
between contigs after EULER-DB.
EULER-SF (pass 2):
- xxxx.fil.mate.con.connt_sf_comp1.gvz:
- The possible connections between contigs
after EULER-Connect.
This file follows the same format as
xxxx.fil.mate.con_sf_comp1.gvz,
but uses the input file
xxxx.fil.mate.con.connt.
The users may treat the contigs in the file
xxxx.fil.mate.con.connt
as the final contig files. However, the contigs generated in previous steps
xxxx.fil.contig (after EULER-ET) and
xxxx.fil.mate.contig
(after EULER-DB) are more reliable but often
shorter.
The *.gvz files are graphs in GraphViz format.
Please download
graphviz,
transform .gvz files into PostScript files, and
then use ghostview
to view them.
Alternately, the GraphViz package also supports other output formats,
and has viewers for Microsoft Windows and for the X Window System.
Please send bug reports, questions, and suggestions
about EULER to
euler-help@cs.ucsd.edu.
For information on the GridPort toolkit used to build the
EULER Portal, contact Jerry Greenberg
(jpg@sdsc.edu).