Human-mouse-rat alignments

Glenn Tesler
March 16, 2004

This is the data for
G. Bourque, P. Pevzner, G. Tesler, Reconstructing the Genomic Architecture of Ancestral Mammals: Lessons from Human, Mouse, and Rat Genomes. Genome Research, in press, 2004.
Please cite the above if you use this data. If you use the PatternHunter anchors, please also cite
B. Ma, J. Tromp, M. Li, PatternHunter: Faster And More Sensitive Homology Search. Bioinformatics. 18(3):440-445, 2002.

Assembly versions

This data was analyzed in May and August 2003. Due to logistical issues (synchronizing data sets with 200+ other people in the Rat Genome Analysis Consortium) and publication delays, the assembly versions below are the ones that we used. Enough time has now passed that newer assemblies are available, but we are not redoing our analysis on newer assemblies at this time.
OrganismAssembly name and UCSC sourceChromosomes
HumanUCSC hg15 (NCBI build 33)1-22, X, Y; M ignored
MouseUCSC mm3 (NCBI build 30)1-19, X
RatUCSC rn3 (Baylor HGSC v. 3.1)1-20, X; Un ignored

Downloadable data

  PatternHunter
alignment coordinates
GRIMM-Anchors
uncompressed anchors
GRIMM-Anchors
compressed anchors
GRIMM-Synteny
300 kb blocks
GRIMM-Synteny
1 Mb blocks
human-mouse gzip (40,315,854 bytes) gzip (11,612,600 bytes) gzip (1,949,496 bytes) html    text html    text
human-rat gzip (40,793,358 bytes) gzip (10,828,461 bytes) gzip (1,791,448 bytes) html    text html    text
mouse-rat gzip (272,624,055 bytes) gzip (24,567,603 bytes) gzip (708,977 bytes) html    text html    text
human-mouse-rat not applicable gzip (6,828,062 bytes) gzip (1,041,309 bytes) html    text html    text
The formats of the above files are described below. In addition, other members of the Rat Genome Sequencing Consortium have made related data for the same assemblies available at http://www.genboree.org.

PatternHunter alignments

We are not distributing the raw alignments. I processed them into coordinate files, which are described below.

The assemblies were first masked at UCSC with RepeatMasker and TandemRepeatFinder. Bin Ma ran PatternHunter to align the non-masked portions of the sequences. The coordinates of the regions that were masked are available at UCSC within these directories:

http://genome.ucsc.edu/goldenPath/10april2003/bigZips/
http://genome.ucsc.edu/goldenPath/mmFeb2003/bigZips/
http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/

PatternHunter does base-by-base alignments of 2 sequences. It was run with each combination of two species, and also with each species vs. itself. (I did some processing of the latter for the purpose of analyzing repeats, but we will not be using it in the present set of papers.)

I parsed out the coordinates from the PatternHunter output and put them into my own format. Even the coordinate files without the detailed alignments are still very large.

If you have used PatternHunter directly, please note that the format below is my own format, useful to me for GRIMM-Synteny, and is different than PatternHunter's.

Each line has these fields:

	ID chr1 start1 end1 chr2 start2 end2 sign score evalue

for example:

	1 1 28574062 28575652 X 54629503 54631092 - 1367 0.0
	2 1 28574062 28575652 X 54726654 54728243 + 1355 0.0
	3 1 243490236 243492219 X 15479260 15481161 - 1017 0.0
ID
Due to concatenating a large number of separate PatternHunter output files and then filtering them, PatternHunter's ID number isn't unique or useful any more.
genome number
human-mouse: human=1, mouse=2
human-rat: human=1, rat=2
mouse-rat: mouse=1, rat=2
For example, in the file human-mouse.gz, (chr1,start1,end1) refer to human and (chr2,start2,end2) refer to mouse.
chr1, chr2
The chromosomes are denoted by their name as a string.
start1, end1, start2, end2
The start/end are 0-based on the + strand. The intervals are half-closed, half-open: [start,end).
sign
The sign is '+' or '-'. Negative means to take the string of bases from the interval as indicated on the + strands of both genomes/chromosomes, but then reverse complement one of those strings to align it to the other. See more on this below.
score
Higher scores are better alignments.
evalue
Lower evalues are better alignments.

Note on signs

The coordinates are 0-based on the positive strand. If the sign field is negative, it indicates that you take the sequence from the positive strand at the specified coordinates, and then reverse complement it:
   position   0  1  2  3  4  5  6  7  8  9
   + strand  [A] C  T  A  G  C  C  A  T  G
   - strand  [T] G  A  T  C  G  G  T  A  C
so position 0 with a sign + is "A" in the leftmost column, + strand row, and position 0 with a sign - is "T" in the leftmost column - strand row. I've marked that position with brackets.

Just to clarify why I'm describing that if it seems like the only way to you, I've also seen this:

   position   0  1  2  3  4  5  6  7  8  9
   + strand  [A] C  T  A  G  C  C  A  T  G
   - strand  [T] G  A  T  C  G  G  T  A  C
   position   9  8  7  6  5  4  3  2  1  0

where that position is identified as 0 on the + strand and 9 on the - strand. I've also seen variations where one or both strands are 1-based, the strands are indicated by positive/negative coordinates, etc.

GRIMM-Anchors coordinate files

PatternHunter's alignment coordinates were heavily processed to produce 2-way and 3-way anchors. These come in "uncompressed" and (lossy) "compressed" versions. The compression is appropriate for the purposes of GRIMM-Synteny; if it is not appropriate for your purposes, use the uncompressed version. The compression method is this: Sort the anchors in order for every species. Consecutive anchors in the first species are combined together if
  1. in all other species they are consecutive in the exact same order as the first species or the exact reverse order (with all signs inverted as well)
  2. and the anchors are close in all species (sum over species of gap between two consecutive anchors is < 50000 in 2-species comparisons or < 75000 in 3-species comparisons).

In the 2-species version, each line has these fields:

	ID chr1 start1 end1 chr2 start2 end2 sign
In the 3-species version:
	ID chr1 start1 end1 sign1 chr2 start2 end2 sign2 chr3 start3 end3 sign3
ID
Anchors are sorted in order by first genome and numbered consecutively.
genome number
human-mouse: human=1, mouse=2
human-rat: human=1, rat=2
mouse-rat: mouse=1, rat=2
human-mouse-rat: human=1, mouse=2, rat=3
chr1, chr2, chr2
The chromosomes are denoted by their name as a string.
start1, end1, start2, end2, start3, end3
The start/end are 0-based on the + strand. The intervals are half-closed, half-open: [start,end).
sign (2 species)
The sign is '+' or '-'. Negative means to take the string of bases from the interval as indicated on the + strands of both genomes/chromosomes, but then reverse complement one of those strings to align it to the other. See more on this above.
sign1, sign2, sign3 (3 species)
Each sign is '+' or '-'. Positive means to take the string of bases from the interval as indicated on the + strand of the appropriate genome. Negative means to take from the positive positions for that genome and then reverse complement the string. See more on this above.

GRIMM-Synteny coordinate files

Next, there are several synteny block files. There are 2-way comparisons in which only PatternHunter alignments between those two species were used, and the 3rd species played no role whatsoever. There are also 3-way comparisons.

The file blocks_hmr_1000000 is human-mouse-rat blocks at 1 Mb resolution. The genome #'s in this file can be derived from the filename (order h,m,r: genome 1 = human, genome 2 = mouse, genome 3 = rat). Similarly, in hr_1000000, genome 1 = human and genome 2 = rat. And so on.

Resolution x (300000 or 1000000) means that the length of the every block in the first genome is >= x and the GRIMM-Synteny "gap threshold" is x (in the 2-species files) or 1.5*x (in the 3-species files). We may change the definitions and parameters in future runs.

The first line of the file is "# " plus tab-separated names of the columns. Each subsequent line is a 2 or 3-way synteny block. The files are sorted in order by chromosome and then starting position in genome 1.

cluster_number
The ID assigned to that synteny block.
org1.chrom, org1.start, org1.end, org1.sign
are the chromosome # (as described above), start and end (as closed interval), and sign (1 or -1), for whatever genome 1 is in that file.
org1.pos
an integer sequence 1,2,... giving the order of the synteny blocks on chromosome org1.chrom of genome 1.
org2.*, org3.*
similar