Home: Y-Chromosome Nomenclature System: Results & Discussion  

Y-Chromosome Nomenclature System

A Nomenclature System for the Tree of Human Y-Chromosomal Binary Haplogroups

Results and Discussion

NRY Haplogroup Tree and Haplogroup Nomenclature

We constructed a comprehensive haplogroup tree for the human NRY by genotyping most of the known polymorphisms on the NRY in a single set of samples (74 male YCC cell lines). Some polymorphisms known to be variable in other DNAs showed no variation in the YCC panel; therefore, additional samples were included to improve the resolution of the phylogeny. This served to increase the number of polymorphic sites mapped onto the haplogroup tree to 237. Two mutational events occurred at each of eight sites. However, these recurrent mutations were found on different haplogroup backgrounds and thus were distinguishable events. The 245 mutational events gave rise to 153 NRY haplogroups. The single most parsimonious tree for these 153 NRY haplogroups is shown in Figure 1, with mutational events shown along the branches.

The tree was drawn as asymmetrically as possible by sorting the descendants of each interior node so that the bottom-most descendant had the greatest number of immediate descendants. The position of the root in Figure 1 (indicated by an arrow) was determined by outgroup comparisons. In other words, whenever possible, homologous regions on the NRY of closely related species (e.g., chimpanzees, gorillas, and orangutans) were sequenced to determine the ancestral states at human polymorphic sites (see Underhill et al. 2000, Hammer et al. 2001). The root of the tree falls between a clade defined by M91 and a clade defined by a set of markers: SRY10831a, M42, M94, and M139. The NRY tree in Figure 1 can be seen as a series of nested monophyletic clades (i.e., a set of lineages related by a shared, derived state at a single or set of sites). In order to devise a nomenclature system at a reasonable scale, we assigned a capital letter to several of the major clades, beginning with the letter A (for the haplogroup above the position of the root in Figure 1) and continuing through the alphabet to the letter R. The letter Y was assigned to the most inclusive haplogroup comprising haplogroups A-R. Deciding which clades are to receive the highest labeling level can only be, to some extent, arbitrary. Here we label with single capital letters those clades that seem to us to represent the major divisions of human NRY diversity. Only 19 letters have been assigned to clades to allow for the possible expansion and further resolution of this phylogeny (the implications of which are discussed below).

We propose here two complementary nomenclatures. The first is hierarchical and uses selected aspects of set theory to enable clades at all levels to be named unambiguously. The capital letters (A-R) used to identify the major clades constitute the front symbols of all subsequent subclades (Figure 1). Unlabeled clades can be named as the 'join' of two subclades; for example, clade CR includes all chromosomes that share the derived state of the M168 and P9 polymorphisms. Note that this is distinct from the set theoretic 'union', which, in the above example, would not define a monophyletic clade. Lineages that are not defined on the basis of a derived character represent interior nodes of the haplogroup tree and are potentially paraphyletic (i.e., they are comprised of basal lineages and monophyletic subclades). Thus, we suggest the term "paragroup" rather than haplogroup to describe these lineages. Paragroups are distinguished from haplogroups (i.e., monophyletic groupings) by using the * (star) symbol, which represents chromosomes belonging to a clade but not its subclades. For example, paragroup B* belongs to the B clade; however, it does not fall into haplogroup B1 or B2. As illustrated in Figure 2, internal nodes are highly sensitive to changes in tree topology. Thus, the * symbol cautions that a given paragroup name may refer to different sets of chromosomes in succeeding versions of the phylogeny.
Subclades nested within each major haplogroup defined by a capital letter are named using an alternating alphanumeric system. For example, within haplogroup E, there are three basal haplogroups that are named E1, E2, and E3 and the underived paragroup becomes E*. Nested clades within each of these haplogroups are named in a similar way, except that lower case letters are used instead of numerals. Again, paragroups are labeled with a * symbol, and the remaining haplogroups are labeled with an "a", "b", "c", etc. This naming system continues to alternate between numerals and lower case letters until the most terminal branches are labeled (tip haplogroups). Therefore, the name of each haplogroup contains the information needed to find its location on the tree.

Alternatively, haplogroups can be named by the mutations that define lineages rather than by the lineages themselves. Thus, we propose a second nomenclature that retains the major haplogroup information (i.e., 19 capital letters) followed by the name of the terminal mutation that defines a given haplogroup. We distinguish haplogroup names identified "by mutation" from those identified "by lineage" by including a dash between the capital letter and the mutation name. For example, haplogroup H1a would be called H-M36 (Figure 2). When multiple phylogenetically equivalent markers define a haplogroup, the one typed is used. For example, if M39 but not M138 were typed within haplogroup H1, then H1c becomes H-M39. If multiple equivalent markers were typed, this notation system omits some marker information, and a statement of which additional markers were typed should be included in the Methods section. Note that the mutation-based nomenclature has the important property of being more robust to changes in topology (Figure 2).

While it is straightforward to name monophyletic clades, it is more challenging to devise a simple and flexible system to name underived interior nodes. This is especially important to facilitate the naming of haplogroups in studies where not all markers are typed, and to provide a standard set of names for previously described haplogroups (and paragroups). For instances where not all markers within a clade are typed, we introduce a bracketing system that encloses an 'x' (for 'excluding') and the lineages that have been shown to be absent. This system can be applied equally well to the lineage-based and mutation-based nomenclatures. The following examples portray the lineage-based nomenclature first, followed by the mutation-based nomenclature. Lineages (or markers) excluded from a haplogroup are listed within parentheses after the name of the haplogroup (or the last derived marker in the case of the mutation-based nomenclature). For example, if M82-derived chromosomes are typed with all downstream markers then the underived chromosomes belong to H1* or H-M82* (Figure 3A). However, if M82-derived chromosomes are typed only with M36, then the underived chromosomes belong to H1*(xH1a) or H-M82*(xM36) (Figure 3B). If we apply this bracketing method to the naming of Underhill et al.'s (2000) paraphyletic haplogroup VI, then its label becomes F*(xK) or F-M89*(xM9) Table 2. In the more extreme case of a study genotyping only the YAP and M3 markers, chromosomes ancestral for both markers would be named Y*(xDE,Q3) or Y*(xYAP,M3) , where Y refers to the most inclusive haplogroup encompassing the total cladogram. See Table 2 for application of this bracketing system to lineage-based names of previously published haplogroups. When using the mutation-based nomenclature, the adoption of this bracketing system is optional, as long as full lineage-based names of haplogroups have been given elsewhere in the manuscript (e.g., in the form of a table or a tree). The lineage- and mutation-based nomenclatures each has advantages and disadvantages, and each can be used where most appropriate.

Cross-referencing to previous nomenclatures

A number of investigators have developed nomenclature systems based on overlapping subsets of the markers typed here. In order to facilitate comparisons among seven previously published nomenclatures and our present proposed nomenclature, Figure 1 and Table 2 illustrate direct comparisons among these different systems. These nomenclature systems are extremely inconsistent (i.e., non-isomorphic) in how they define haplogroups. Moreover, when there is consistency between two systems (e.g., between Underhill et al.'s (2000) haplogroup V and Hammer et al.'s (2000) haplogroup 1F), different names are used for the same haplogroups. All of the major human NRY nomenclature schemes used thus far have included paraphyletic groupings (see Figure 1), and these paragroups can be misinterpreted as being necessarily ancestral to "downstream" haplogroups containing derived characters. Three major benefits of the proposed system are its (1) ability to distinguish between underived interior nodes (paragroups) and monophyletic clades (haplogroups), (2) flexibility in naming haplogroups at different levels of the phylogenetic hierarchy, and (3) ability to accommodate new haplogroups as new mutations are discovered (see below). If broadly accepted and utilized, this system will also serve to standardize the names of NRY haplogroups in the literature.

Caveats and Changes in Nomenclature

In addition to the long-term challenges posed by any attempt to form a stable nomenclature system, there are several caveats that should be raised relating to the way the current tree topology was inferred. First, it is important to point out that not all polymorphisms were genotyped in all individuals. Indeed, continued genotyping of these polymorphisms may result in slight changes in the topology of the tree in Figure 1. It is also possible that some mutational events that were assumed to be unique are actually recurrent on the tree (i.e., there are undetected multiple hits at some additional sites). More importantly, because it is extremely difficult to devise a nomenclature system that is both informative in a phylogenetic sense and impervious to the need for renaming groups as new polymorphisms are discovered, a set of guidelines is needed to minimize the impact of future structural changes in the tree.

In order to facilitate the evolution of the present nomenclature we make a number of proposals. Firstly, a nomenclature committee comprising some of the current participants in the YCC will receive requests from investigators who wish new binary markers or haplogroups to be incorporated into the nomenclature, and will decide on the changes to be made to the existing system. At any one time, the current nomenclature and the committee's contact details will be made available on the following URL: http://ycc.biosci.arizona.edu/. Consequently, we recommend that if investigators wish to use new markers prior to their incorporation into the nomenclature, they distinguish between consensus and novel parts of the clade labels by use of a forward slash. For example, a new mutation (m) which divides clade D1 in two creates D1/-m and D1/-M15*. This makes it clear to the reader which parts of the label are specific to that study and which can be cross-referenced to other publications. This will minimize confusion should two contemporaneous papers introduce novel markers within the same clade. In this manner, information from VNTR and STR haplotypes can also be incorporated; a standard nomenclature for Y-STRs is already available (Gill et al. 2001). Because new versions of the YCC nomenclature will be published annually to reflect changes in the tree topology resulting from newly discovered mutations, we suggest that each paper cite the particular version of the YCC NRY tree that was used (e.g., YCC NRY Tree 2002).




Biological Sciences West #246B, Tucson, AZ, 85721-0088 Phone: 520.621-9791  Fax: 520.626.8050

Site maintained by Al Agellon. Copyright © 2002 Arizona Research Laboratories