Methods
To bring the Y Chromosome Consortium (2002)
tree up to date, we used three different approaches to map recently discovered mutations on the Y chromosomal binary haplogroup tree. Many of these mutations came from the laboratories of Michael Hammer and Peter Underhill ("P" and "M" mutations, respectively). See Supplemental Table 1 for a list of all markers tested. Markers numbered P45–P122 were discovered in the course of various resequencing projects in the Hammer lab (Hammer et al. 2003
; Wilder et al. 2004a
, b
), while Underhill’s group published markers in the range of M226–M450 in the past 5 yr (Cruciani et al. 2002
; Semino et al. 2002
, 2004
; Cinnioglu et al. 2004
; Rootsi et al. 2004
, 2007
; Shi et al. 2005
; Kayser et al. 2006
; Regueiro et al. 2006
; Sengupta et al. 2006
; Hudjashov et al. 2007
). Other mutations (e.g., most P mutations from 123–297) were mined from public databases in the following manner. In a recent study aimed at characterizing patterns of common DNA variation in three populations (Hinds et al. 2005
), 334 Y-linked SNPs were typed in 33 males (13 European-Americans, 11 African-Americans, and nine Asian-Americans). The set of typed markers included SNPs ascertained during that study, as well as previously reported SNPs, some of which had been mapped onto the Y chromosome tree (Y Chromosome Consortium 2002
). Using information from mapped SNPs, we provisionally assigned Y chromosome haplogroups for these 33 samples. For example, according to the state they had for M9 (rs3900:C
G), it was possible to assign 18 males as belonging and 15 as not belonging to the KT branch of the Y chromosome tree. To confirm and better resolve the position of the newly reported SNPs in the tree, we performed further genotyping (either by direct resequencing or PCR-RFLP). During the process of mapping, several new SNPs were discovered and also mapped. Mutations that define major haplogroups are referred to as "defining" mutations, while those that mark lineages within a major haplogroup are referred to as "internal" mutations. We also tried to incorporate published markers other than P and M on the tree. The major challenge here was the absence of "positive control" samples (i.e., a DNA sample known to carry the derived state at the polymorphic site), especially for singletons and very-low-frequency mutations. When more than one marker mapped on the same branch of the tree, we tried to identify the order of mutational events by cross-typing positive control samples for each mutation. Large insertion/deletion or simple repeat mutations were not included in this study.
Most of the SNPs reported by Hinds et al. (2005)
were discovered (or rediscovered) in the same set of samples. These SNPs have a uniform ascertainment scheme, which makes them useful for estimating the length of particular branches on the tree. However, the number of mutations may be under-represented on some lineages, especially on those branches that are present at low frequency in the ascertainment sample. Because haplogroups R-M269, E-P1, and I-P30 are found at relatively high frequency in the European and African populations used in the ascertainment process (Hinds et al. 2005
), we expect the number of SNPs discovered on these lineages to be fairly representative of the relative length of these branches of the tree.
We used a novel method to estimate the relative ages of internal nodes of the tree that relies on a uniform probability distribution for the age of mutations in the ancestry of a lineage. If time is partitioned into k subintervals and mutational events n occur at a constant rate, then the number of events in each subinterval X1, . . . , Xk follows a multinomial distribution with parameters p1, . . . , pk, where pi is the length of the subinterval i relative to the whole interval. To estimate the age of nodes, we partitioned the time interval between the most recent common ancestor (MRCA) of all lineages in branches C through T (denoted MRCA-CT) and the present into two subintervals: one extending from the MRCA-CT to the internal node whose age is estimated, and a second extending from the internal node to the present. The number of mutations in each subinterval follows a binomial distribution with parameter pi equal to the relative length of the subinterval. The relative length of the second subinterval is multiplied by the assumed age of the MRCA-CT to obtain the age in years of the internal node.
To establish confidence intervals, we find the range of relative ages of the internal node such that the distribution of the number of mutations in each subinterval is not too extreme (i.e., neither too many mutations from the MRCA-CT to the internal node nor too many from the internal node to the present). We choose the endpoints of the intervals such that the probability of the observed or more extreme distributions of the mutations in the ancestry is equal to or smaller than 0.05. If the total number of mutations in the lineage (N) equals the observed number of mutations (n) and the observed number of mutations occurring between the node and the present is m, we choose pmax and pmin, the maximum and minimum relative lengths of the recent subinterval, respectively, according to: