Human-mouse-rat alignments
Glenn Tesler
March 16, 2004
This is the data for
G. Bourque, P. Pevzner, G. Tesler,
Reconstructing the Genomic Architecture of
Ancestral Mammals: Lessons from Human, Mouse, and Rat Genomes.
Genome Research, in press, 2004.
Please cite the above if you use this data.
If you use the PatternHunter anchors, please also cite
B. Ma, J. Tromp, M. Li, PatternHunter: Faster And More Sensitive Homology
Search. Bioinformatics. 18(3):440-445, 2002.
Assembly versions
This data was analyzed in May and August 2003.
Due to logistical issues (synchronizing data sets with 200+ other
people in the Rat Genome Analysis Consortium)
and publication delays,
the assembly versions below are the ones that we used.
Enough time has now passed that newer assemblies are available,
but we are not redoing our analysis on newer assemblies at this time.
| Organism | Assembly name and UCSC source | Chromosomes |
| Human | UCSC hg15 (NCBI build 33) | 1-22, X, Y; M ignored |
| Mouse | UCSC mm3 (NCBI build 30) | 1-19, X |
| Rat | UCSC rn3 (Baylor HGSC v. 3.1) | 1-20, X; Un ignored |
Downloadable data
| |
PatternHunter alignment coordinates |
GRIMM-Anchors uncompressed anchors |
GRIMM-Anchors compressed anchors |
GRIMM-Synteny 300 kb blocks |
GRIMM-Synteny 1 Mb blocks |
| human-mouse |
gzip (40,315,854 bytes)
|
gzip (11,612,600 bytes)
|
gzip (1,949,496 bytes)
|
html
text
|
html
text
|
| human-rat |
gzip (40,793,358 bytes)
|
gzip (10,828,461 bytes)
|
gzip (1,791,448 bytes)
|
html
text
|
html
text
|
| mouse-rat |
gzip (272,624,055 bytes)
|
gzip (24,567,603 bytes)
|
gzip (708,977 bytes)
|
html
text
|
html
text
|
| human-mouse-rat |
not applicable
|
gzip (6,828,062 bytes)
|
gzip (1,041,309 bytes)
|
html
text
|
html
text
|
The formats of the above files are described below.
In addition, other members of the Rat Genome Sequencing Consortium
have made related data for the same assemblies available at
http://www.genboree.org.
PatternHunter alignments
We are not distributing the raw alignments. I processed them into
coordinate files, which are described below.
The assemblies were first masked at UCSC with RepeatMasker and
TandemRepeatFinder.
Bin Ma ran PatternHunter to align the non-masked portions of the
sequences. The coordinates of the regions that were masked
are available at UCSC within these directories:
http://genome.ucsc.edu/goldenPath/10april2003/bigZips/
http://genome.ucsc.edu/goldenPath/mmFeb2003/bigZips/
http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/
PatternHunter does base-by-base alignments of 2 sequences.
It was run with each combination of two species, and also
with each species vs. itself. (I did some processing of
the latter for the purpose of analyzing repeats, but we
will not be using it in the present set of papers.)
I parsed out the coordinates from the PatternHunter output and put them
into my own format. Even the coordinate files without the
detailed alignments are still very large.
If you have used PatternHunter directly, please note that the
format below is my own format, useful to me for GRIMM-Synteny,
and is different than PatternHunter's.
Each line has these fields:
ID chr1 start1 end1 chr2 start2 end2 sign score evalue
for example:
1 1 28574062 28575652 X 54629503 54631092 - 1367 0.0
2 1 28574062 28575652 X 54726654 54728243 + 1355 0.0
3 1 243490236 243492219 X 15479260 15481161 - 1017 0.0
- ID
-
Due to concatenating a large number of separate PatternHunter output
files and then filtering them, PatternHunter's ID number
isn't unique or useful any more.
- genome number
-
human-mouse: human=1, mouse=2
human-rat: human=1, rat=2
mouse-rat: mouse=1, rat=2
For example, in the file human-mouse.gz, (chr1,start1,end1) refer to human and
(chr2,start2,end2) refer to mouse.
- chr1, chr2
-
The chromosomes are denoted by their name as a string.
- start1, end1, start2, end2
-
The start/end are 0-based on the + strand.
The intervals are half-closed, half-open: [start,end).
-
- sign
-
The sign is '+' or '-'.
Negative means to take the string of bases from the interval as indicated
on the + strands of both genomes/chromosomes, but then reverse complement
one of those strings to align it to the other. See more on this
below.
- score
-
Higher scores are better alignments.
- evalue
-
Lower evalues are better alignments.
Note on signs
The coordinates are 0-based on the positive strand. If the sign
field is negative, it indicates that you take the sequence from the
positive strand at the specified coordinates, and then reverse complement
it:
position 0 1 2 3 4 5 6 7 8 9
+ strand [A] C T A G C C A T G
- strand [T] G A T C G G T A C
so position 0 with a sign + is "A" in the leftmost column, + strand row,
and position 0 with a sign - is "T" in the leftmost column - strand row.
I've marked that position with brackets.
Just to clarify why I'm describing that if it seems like the only way
to you, I've also seen this:
position 0 1 2 3 4 5 6 7 8 9
+ strand [A] C T A G C C A T G
- strand [T] G A T C G G T A C
position 9 8 7 6 5 4 3 2 1 0
where that position is identified as 0 on the + strand and 9 on the -
strand.
I've also seen variations where one or both strands are 1-based,
the strands are indicated by positive/negative coordinates, etc.
GRIMM-Anchors coordinate files
PatternHunter's alignment coordinates were heavily processed
to produce 2-way and 3-way anchors. These come in "uncompressed"
and (lossy) "compressed" versions. The compression is appropriate for
the purposes of GRIMM-Synteny; if it is not appropriate for your
purposes, use the uncompressed version. The compression method is this:
Sort the anchors in order for every species.
Consecutive anchors in the first species are combined together if
- in all other species they are consecutive in the exact same order
as the first species or the exact reverse order (with all signs
inverted as well)
- and the anchors are close in all species (sum over species of gap
between two consecutive anchors is < 50000
in 2-species comparisons or < 75000 in 3-species comparisons).
In the 2-species version, each line has these fields:
ID chr1 start1 end1 chr2 start2 end2 sign
In the 3-species version:
ID chr1 start1 end1 sign1 chr2 start2 end2 sign2 chr3 start3 end3 sign3
- ID
-
Anchors are sorted in order by first genome and numbered
consecutively.
- genome number
-
human-mouse: human=1, mouse=2
human-rat: human=1, rat=2
mouse-rat: mouse=1, rat=2
human-mouse-rat: human=1, mouse=2, rat=3
- chr1, chr2, chr2
-
The chromosomes are denoted by their name as a string.
- start1, end1, start2, end2, start3, end3
-
The start/end are 0-based on the + strand.
The intervals are half-closed, half-open: [start,end).
-
- sign (2 species)
-
The sign is '+' or '-'.
Negative means to take the string of bases from the interval as indicated
on the + strands of both genomes/chromosomes, but then reverse complement
one of those strings to align it to the other. See more on this
above.
- sign1, sign2, sign3 (3 species)
-
Each sign is '+' or '-'.
Positive means to take the string of bases from the interval as indicated
on the + strand of the appropriate genome.
Negative means to take from the positive positions for that genome
and then reverse complement the string.
See more on this
above.
GRIMM-Synteny coordinate files
Next, there are several synteny block files. There are 2-way comparisons in
which only PatternHunter alignments between those two species were used,
and the 3rd species played no role whatsoever. There are also 3-way
comparisons.
The file blocks_hmr_1000000 is human-mouse-rat blocks at 1 Mb resolution.
The genome #'s in this file can be derived from the filename
(order h,m,r: genome 1 = human, genome 2 = mouse, genome 3 = rat).
Similarly, in hr_1000000, genome 1 = human and genome 2 = rat.
And so on.
Resolution x (300000 or 1000000) means
that the length of the every block in the first genome is >= x
and the GRIMM-Synteny "gap threshold" is x (in the 2-species files)
or 1.5*x (in the 3-species files).
We may change the definitions and parameters in future runs.
The first line of the file is "# " plus tab-separated names of the columns.
Each subsequent line is a 2 or 3-way synteny block.
The files are sorted in order by chromosome and then starting position
in genome 1.
- cluster_number
- The ID assigned to that synteny block.
-
org1.chrom, org1.start, org1.end, org1.sign
-
are the chromosome # (as described above),
start and end (as closed interval),
and sign (1 or -1), for whatever genome 1 is in that file.
-
org1.pos
-
an integer sequence 1,2,... giving the order of the
synteny blocks on chromosome org1.chrom of genome 1.
-
org2.*, org3.*
- similar