EULER Portal User Guide


The EULER Web Portal is a service that allows users to assemble DNA fragments using an Eulerian path approach. Please see this page for more information on the EULER algorithm, and this page for related papers.

The EULER portal allows users to use this algorithm to assemble DNA fragments from a target DNA sequence of BAC size (with typical length 200-300Kb and less than 20000 fragments). This server will be scaled up to the whole bacterial genome assembly with typical length a few Mega-bases (in progress, please check later!).

Using a single common interface, the user can choose to perform assembly on one of several NPACI High-Performance Computing platforms, and then employ a range of tools verify and compare the results (also see EULER-Compare for assembly comparison).

Each user must register for his own account to run the EULER portal. Registration offers a range of benefits including:

The following sections explain how to use the portal in each of these modes.


Registered users

Getting an account

Uploading files

Starting an assembly

Tracking the progress of an assembly job

Retrieving results


Registered users


How to apply for an account


Fill out this online form.

This will give you limited access to several NPACI facilities, including

If you are an NPACI HPC user, you should probably still use the same form. However, you may bypass it; you need to create a portal account and an SRB account with the same username, and an account on each HPC platform on which you will run the EULER assembler.

Contact us if you need additional assistance setting up either type of account.

Uploading and managing files


Data files can be uploaded by clicking on the Upload File button. You can upload four types of files:

  1. xxxx (fasta file):
    The FASTA format reads file. This would normally be output by Phred. The vector sequences in the read sequences should be screened before you upload the file.
  2. xxxx.qual (quality file):
    The Phred quality value file corresponding to the reads file. It must contain the same number of reads as the reads file and in the same order. This would normally be output by Phred along with the reads file.
  3. name.rul or xxxx.name.rul (mate-pair file):
    A file that describes the naming rules for mate-pairs and other special reads. This is a short ASCII file described below. A sequencing center would normally develop conventions for naming rules, and apply the same naming rule conventions to much or all of its data, so you will probably use the same name.rul file with many different reads files.
  4. step.inp or xxxx.step.inp (error correction file):
    The error correction step control file, which dictates the number of steps and other parameters of error correction. This is a short ASCII file described below. It is likely you will use the same error correction file for many reads files.
At any stage, a user may list the contents of his file directories by clicking on List & Download Files. From this list, files can be downloaded or deleted from the server. The reads file is required but the other files are optional. The user can select to use the default error correction file and mate-pair file when they submit the job. The defaults are as follows:

Mate-pair file sample (normally named name.rul):


/*                      Names are: Library_Plate_Well.primer
                                primer  library plate   length range    */

Finishing reads:                f       ALL     ALL
Single reads:                   s       ALL     ALL
Double-barreled reads:          x y     ALL     1-200     1500 3500
Double-barreled reads:          x y     ALL     201-400   3000 10000

The lines in the file name.rul define the sources of each read. The first line means the reads named from all libraries and plates with primer label f are finishing reads. The second line means all the reads with primer label s are single reads. The third line means the reads with primer labels x and y from plates 1-200 form mate-pairs and the clone length is about 1500-3500. The fourth line is similar. The user can revise this file according the projects they have and upload their file to the EULER portal.

By default, the names of the reads in the FASTA file are interpreted as Library_Plate_Well.primer, where the symbols "_" and "." are used as separators. For instance, the read with name 2K_00135_351.f10 is from library "2K", plate "135", and well "351". The primer of this read is labeled "f", so it is a finishing read. Note that the primer of a read is only defined by the first letter after the last period, no matter what other characters follow it. Also, unbalanced mate pairs (read 2K_00199_17.x is present but read 2K_00199_17.y is not) are treated as single reads.

Error correction step file sample (normally named step.inp):


20 20 10 2
20 5  5 2
40 20 10 2
60 20 10 2
20 20 10 2
20 20 10 2
20 20 10 2

Each line in the file defines one iteration step of error correction. The above file defines an error correction procedure of seven steps. The first and the last numbers in each line are the l-tuple size and the minimal coverage used in single error correction of each error correction step. The middle two parameters define the length of string extension and the minimal matched ends used in double error correction.


Starting an assembly


After the required files have been uploaded, clicking on Submit job brings up a form used to initiate an EULER assembly. This job submission form contains a number of fields that the user should fill in.

Job Name
A name for the EULER job that the user is submitting. The user may monitor the progress of this job afterwards by looking for this job name under the Job Status and History option. The job name does not have to be related to the filenames in the submission.
Fasta Files
Quality Value Files
These are the names of the input reads files and the corresponding Phred quality files, respectively, which should have been uploaded through the previous step (see Uploading files). The read file is required, while the quality file is optional. If the user doesn't have a quality file available, he may leave the field as no_file. The read file and the quality file must be consistent with each other: the number of reads in each file should be equal, and their names and order should be the same. Otherwise, the job will stop with an error message in the final report.
Mate-Pair Naming Rules Files
Error Correction Steps Control Files
These are the names of the mate-pair files and error correction files respectively. These files should have been uploaded as well if necessary (see Uploading files). The user may decide to use the default versions of these files by choosing default for each of these fields. Please note the definition of the sources of the reads in the mate-pair files are very important and the wrong definition may cause errors in the assembly of EULER-DB and EULER-SF.
Euler Cutoff
Quality Value Cutoff
Trim
Reads trimming is an important step in EULER, in which the ends of reads are trimmed out due to the higher error rates at the ends than the middle. The user may choose one of these trimming schemes:
The user needs to specify the parameters for the corresponding trimming methods: If the user doesn't provide a quality value file, he MUST choose EULER-estimate trimming for this job and choose the appropriate EULER cutoff.
Machine
# CPU's
Queue
Wall Clock Time
These are parameters for the NPACI grid portal, and are only shown for NPACI account holders. EULER guest account holders will not see them, as we have made arrangements for specific computing resources. For NPACI account holders, you may run your jobs on IBM's Blue Horizon supercomputer, and others. In the future, the machine selections may change.

At the present time, EULER only uses one processor.

All NPACI HPC machines are accessed by submitting jobs to a queue which reflects the priority of the job. EULER users may specify low, normal, high, or express priority in the field Queue. The recommended selection is high. Due to the heavy load on these machines, it is requested that the express priority queue only be used when results must be obtained quickly.

The user should fill in the estimated running time of the submitted job in the field Wall Clock Time. A typical EULER run on a BAC takes about 2-4 hrs on Blue Horizon. So for a normal job, the user may fill in 06:00 in this field.

Email address
The user may specify an email address with each request, to which notification will be sent when the search is complete. Normally you will use the same address as in your account application, but that is not required.
Comments
The user may optionally write some comments describing the job.
Clicking the Submit button will copy the input files to the selected HPC machine and add the job to the queue. Please be patient, as this may take a few minutes. When the submission process completes, a message will appear, showing the query ID that may be used to track it and retrieve results.

Tracking the progress of an assembly job


At any point after submitting an assembly, the user may view its current status by clicking on Job Status and History. This screen lists all currently pending and running jobs in the user's account. The status of all jobs is automatically updated every five minutes; if for any reason the status of the list has not been recently updated, the user may check the current status of all jobs by clicking on the Update link at the top of the page.

Clicking on the View Queue link pops up a list of all jobs currently running on the appropriate machine, and all jobs waiting to run. This gives the user an impression of how much time will pass before an EULER assembly starts running.

How to cancel a job


Jobs which are pending or running can be cancelled at any time by going to the Job Status and History page, depressing the grey buttons corresponding to the unwanted jobs and clicking the Cancel Jobs button. Please note that cancelling a currently running job will prevent any results from being produced by it.

Retrieving and analyzing results


Once a job runs to completion, an email will be sent to the address specified on the submission form. The files are transferred to your SRB account, and you may use the List & Download Files link to retrieve them. You may download files individually, or groups of files in various archive formats (tar, tar.Z, tar.gz, zip).

Users must download the result files as soon as possible after the job is done, to avoid their removal by the account administrators due to periodic space checks.

The output files are the following (xxxx represents the name of the reads file):
 

Cumulative:

xxxx.out:
The report of EULER for this assembly job.
index.html:
An HTML file with links to all of these files, with brief descriptions.

Classify reads:

xxxx.reads:
The reads that will be used in assembly (FASTA format).
xxxx.reads.qual:
The Phred quality values of the reads used.
xxxx.fin
xxxx.fin.qual:
The finishing reads and quality values, according to the naming rules.
xxxx.long
xxxx.long.qual:
The long reads (longer than the permitted length of a read).
xxxx.ambig
xxxx.ambig.qual:
The ambiguous reads (with large portion of N or X) files, if there are any.

EULER-Trim:

xxxx.clean:
The reads after trimming the ends and low quality regions.

EULER-EC:

xxxx.fix:
The reads after error correction by EULER-EC. EULER-EC fixes the sequencing errors in the reads xxxx.clean and outputs the result into this file in the same FASTA format.

EULER-ChimDet:

xxxx.fil:
The reads that survive after chimeric reads detection. If a read combines two different regions in the DNA sequence together, it is a chimeric read. EULER-ChimDet detects the suspicious chimeric reads in its input file xxxx.fix and discards them into a separate file xxxx.fix.chim. Meanwhile, this program trims the unreliable read ends further and outputs the remaining reads into this file in the same FASTA format.
xxxx.fix.chim:
The chimeric reads detected and discarded by EULER-ChimDet.
xxxx.fix.unrel:
The unreliable reads detected and discarded by EULER-ChimDet.

EULER-ET:

xxxx.fil.edge
xxxx.fil.path
xxxx.fil.graph:
The de Bruijn graph representation of the assembly after equivalent transformation with reads.
xxxx.fil_et_comp1.gvz:
The graphviz format output of the connected components (see the report file xxxx.out for details of the components) after equivalent transformation with reads. There can be several of these files, one for each component, with "1" replaced by the component number. For components that are reverse-complements of each other, only one component is generated.
xxxx.fil.contig:
The contigs after equivalent transformation with reads.
xxxx.fil.singleton:
The single-read and double read contigs generated by EULER.

EULER-DB:

xxxx.fil.mate:
Table of all mate-pairs found in the reads used in the assembly. Each line in this file represents the name of a clone, the index of the 1st and 2nd read of that clone in the read file xxxx.fil and the type of clone (corresponding to the description in the mate-pair naming rule file).
xxxx.fil.mate.edge
xxxx.fil.mate.graph:
The de Bruijn graph representation after equivalent transformation with mate-pairs.
xxxx.fil_db_comp1.gvz:
The graphviz format output of the connected components (see the report file xxxx.out for details of the components) after equivalent transformation with mate-pairs.
xxxx.fil.mate.contig:
The contigs after equivalent transformation with mate-pairs.
xxxx.fil.et.mate:
Table of all mate pairs linking these contigs. This file describes the mate-pairs which are not used by EULER-DB. It has the same format as xxxx.fil.mate. These mate-pairs may be used in EULER-SF.

EULER-Consensus:

xxxx.fil.mate.con:
The contigs after consensus with majority rule.

EULER-SF (pass 1):

xxxx.fil.mate.con_sf_comp1.gvz:
The possible connections between contigs after EULER-Consensus. Each node in this graph represents a contig in the input file xxxx.fil.mate.con and each edge indicates these two contigs might be positioned closely, suggested by some mate-pairs.

EULER-Connect:

xxxx.fil.mate.con.connt:
The contigs after EULER-Connect.
xxxx.fil.mate.con.connt.chim:
Chimeric reads detected by EULER-Connect. This is a more reliable detection algorithm than the contigs xxxx.fix.chim from EULER-ChimDet.
xxxx.fil.mate.con_connt.gvz:
The possible connections (detected by EULER-Connect) between contigs after EULER-DB.

EULER-SF (pass 2):

xxxx.fil.mate.con.connt_sf_comp1.gvz:
The possible connections between contigs after EULER-Connect. This file follows the same format as xxxx.fil.mate.con_sf_comp1.gvz, but uses the input file xxxx.fil.mate.con.connt.

The users may treat the contigs in the file xxxx.fil.mate.con.connt as the final contig files. However, the contigs generated in previous steps xxxx.fil.contig (after EULER-ET) and xxxx.fil.mate.contig (after EULER-DB) are more reliable but often shorter.

The *.gvz files are graphs in GraphViz format. Please download graphviz, transform .gvz files into PostScript files, and then use ghostview to view them. Alternately, the GraphViz package also supports other output formats, and has viewers for Microsoft Windows and for the X Window System.

 

Please send bug reports, questions, and suggestions about EULER to euler-help@cs.ucsd.edu.

For information on the GridPort toolkit used to build the EULER Portal, contact Jerry Greenberg (jpg@sdsc.edu).