| Home: Y-Chromosome Nomenclature System: Results & Discussion |
| Y-Chromosome Nomenclature System |
A Nomenclature System for the Tree of Human Y-Chromosomal Binary HaplogroupsResults and DiscussionNRY Haplogroup Tree and Haplogroup Nomenclature We constructed a comprehensive haplogroup tree for the human NRY by genotyping most of the known polymorphisms on the NRY in a single set of samples (74 male YCC cell lines). Some polymorphisms known to be variable in other DNAs showed no variation in the YCC panel; therefore, additional samples were included to improve the resolution of the phylogeny. This served to increase the number of polymorphic sites mapped onto the haplogroup tree to 237. Two mutational events occurred at each of eight sites. However, these recurrent mutations were found on different haplogroup backgrounds and thus were distinguishable events. The 245 mutational events gave rise to 153 NRY haplogroups. The single most parsimonious tree for these 153 NRY haplogroups is shown in Figure 1, with mutational events shown along the branches. The tree was drawn as asymmetrically as possible by
sorting the descendants of each interior node so that the bottom-most
descendant had the greatest number of immediate descendants. The position
of the root in Figure
1 (indicated by an arrow) was determined by outgroup comparisons.
In other words, whenever possible, homologous regions on the NRY of
closely related species (e.g., chimpanzees, gorillas, and orangutans)
were sequenced to determine the ancestral states at human polymorphic
sites (see Underhill et al. 2000, Hammer et al. 2001). The root of
the tree falls between a clade defined by M91 and a clade defined
by a set of markers: SRY10831a, M42, M94, and M139. The NRY tree in
Figure
1 can be seen as a series of nested monophyletic clades (i.e.,
a set of lineages related by a shared, derived state at a single or
set of sites). In order to devise a nomenclature system at a reasonable
scale, we assigned a capital letter to several of the major clades,
beginning with the letter A (for the haplogroup above the position
of the root in Figure
1) and continuing through the alphabet to the letter R. The letter
Y was assigned to the most inclusive haplogroup comprising haplogroups
A-R. Deciding which clades are to receive the highest labeling level
can only be, to some extent, arbitrary. Here we label with single
capital letters those clades that seem to us to represent the major
divisions of human NRY diversity. Only 19 letters have been assigned
to clades to allow for the possible expansion and further resolution
of this phylogeny (the implications of which are discussed below).
We propose here two complementary nomenclatures. The
first is hierarchical and uses selected aspects of set theory to enable
clades at all levels to be named unambiguously. The capital letters
(A-R) used to identify the major clades constitute the front symbols
of all subsequent subclades (Figure
1). Unlabeled clades can be named as the 'join' of two subclades;
for example, clade CR includes all chromosomes that share the derived
state of the M168 and P9 polymorphisms. Note that this is distinct
from the set theoretic 'union', which, in the above example, would
not define a monophyletic clade. Lineages that are not defined on
the basis of a derived character represent interior nodes of the haplogroup
tree and are potentially paraphyletic (i.e., they are comprised of
basal lineages and monophyletic subclades). Thus, we suggest the term
"paragroup" rather than haplogroup to describe these lineages.
Paragroups are distinguished from haplogroups (i.e., monophyletic
groupings) by using the * (star) symbol, which represents chromosomes
belonging to a clade but not its subclades. For example, paragroup
B* belongs to the B clade; however, it does not fall into haplogroup
B1 or B2. As illustrated in Figure
2, internal nodes are highly sensitive to changes in tree topology.
Thus, the * symbol cautions that a given paragroup name may refer
to different sets of chromosomes in succeeding versions of the phylogeny. Alternatively, haplogroups can be named by the mutations
that define lineages rather than by the lineages themselves. Thus,
we propose a second nomenclature that retains the major haplogroup
information (i.e., 19 capital letters) followed by the name of the
terminal mutation that defines a given haplogroup. We distinguish
haplogroup names identified "by mutation" from those identified
"by lineage" by including a dash between the capital letter
and the mutation name. For example, haplogroup H1a would be called
H-M36 (Figure
2). When multiple phylogenetically equivalent markers define a
haplogroup, the one typed is used. For example, if M39 but not M138
were typed within haplogroup H1, then H1c becomes H-M39. If multiple
equivalent markers were typed, this notation system omits some marker
information, and a statement of which additional markers were typed
should be included in the Methods section. Note that the mutation-based
nomenclature has the important property of being more robust to changes
in topology (Figure
2). While it is straightforward to name monophyletic clades, it is more challenging to devise a simple and flexible system to name underived interior nodes. This is especially important to facilitate the naming of haplogroups in studies where not all markers are typed, and to provide a standard set of names for previously described haplogroups (and paragroups). For instances where not all markers within a clade are typed, we introduce a bracketing system that encloses an 'x' (for 'excluding') and the lineages that have been shown to be absent. This system can be applied equally well to the lineage-based and mutation-based nomenclatures. The following examples portray the lineage-based nomenclature first, followed by the mutation-based nomenclature. Lineages (or markers) excluded from a haplogroup are listed within parentheses after the name of the haplogroup (or the last derived marker in the case of the mutation-based nomenclature). For example, if M82-derived chromosomes are typed with all downstream markers then the underived chromosomes belong to H1* or H-M82* (Figure 3A). However, if M82-derived chromosomes are typed only with M36, then the underived chromosomes belong to H1*(xH1a) or H-M82*(xM36) (Figure 3B). If we apply this bracketing method to the naming of Underhill et al.'s (2000) paraphyletic haplogroup VI, then its label becomes F*(xK) or F-M89*(xM9) Table 2. In the more extreme case of a study genotyping only the YAP and M3 markers, chromosomes ancestral for both markers would be named Y*(xDE,Q3) or Y*(xYAP,M3) , where Y refers to the most inclusive haplogroup encompassing the total cladogram. See Table 2 for application of this bracketing system to lineage-based names of previously published haplogroups. When using the mutation-based nomenclature, the adoption of this bracketing system is optional, as long as full lineage-based names of haplogroups have been given elsewhere in the manuscript (e.g., in the form of a table or a tree). The lineage- and mutation-based nomenclatures each has advantages and disadvantages, and each can be used where most appropriate. Cross-referencing to previous nomenclatures A number of investigators have developed nomenclature systems based on overlapping subsets of the markers typed here. In order to facilitate comparisons among seven previously published nomenclatures and our present proposed nomenclature, Figure 1 and Table 2 illustrate direct comparisons among these different systems. These nomenclature systems are extremely inconsistent (i.e., non-isomorphic) in how they define haplogroups. Moreover, when there is consistency between two systems (e.g., between Underhill et al.'s (2000) haplogroup V and Hammer et al.'s (2000) haplogroup 1F), different names are used for the same haplogroups. All of the major human NRY nomenclature schemes used thus far have included paraphyletic groupings (see Figure 1), and these paragroups can be misinterpreted as being necessarily ancestral to "downstream" haplogroups containing derived characters. Three major benefits of the proposed system are its (1) ability to distinguish between underived interior nodes (paragroups) and monophyletic clades (haplogroups), (2) flexibility in naming haplogroups at different levels of the phylogenetic hierarchy, and (3) ability to accommodate new haplogroups as new mutations are discovered (see below). If broadly accepted and utilized, this system will also serve to standardize the names of NRY haplogroups in the literature. Caveats and Changes in Nomenclature In addition to the long-term challenges posed by any
attempt to form a stable nomenclature system, there are several caveats
that should be raised relating to the way the current tree topology
was inferred. First, it is important to point out that not all polymorphisms
were genotyped in all individuals. Indeed, continued genotyping of
these polymorphisms may result in slight changes in the topology of
the tree in Figure
1. It is also possible that some mutational events that were assumed
to be unique are actually recurrent on the tree (i.e., there are undetected
multiple hits at some additional sites). More importantly, because
it is extremely difficult to devise a nomenclature system that is
both informative in a phylogenetic sense and impervious to the need
for renaming groups as new polymorphisms are discovered, a set of
guidelines is needed to minimize the impact of future structural changes
in the tree. In order to facilitate the evolution of the present
nomenclature we make a number of proposals. Firstly, a nomenclature
committee comprising some of the current participants in the YCC will
receive requests from investigators who wish new binary markers or
haplogroups to be incorporated into the nomenclature, and will decide
on the changes to be made to the existing system. At any one time,
the current nomenclature and the committee's contact details will
be made available on the following URL: http://ycc.biosci.arizona.edu/.
Consequently, we recommend that if investigators wish to use new markers
prior to their incorporation into the nomenclature, they distinguish
between consensus and novel parts of the clade labels by use of a
forward slash. For example, a new mutation (m) which divides clade
D1 in two creates D1/-m and D1/-M15*. This makes it clear to the reader
which parts of the label are specific to that study and which can
be cross-referenced to other publications. This will minimize confusion
should two contemporaneous papers introduce novel markers within the
same clade. In this manner, information from VNTR and STR haplotypes
can also be incorporated; a standard nomenclature for Y-STRs is already
available (Gill et al. 2001). Because new versions of the YCC nomenclature
will be published annually to reflect changes in the tree topology
resulting from newly discovered mutations, we suggest that each paper
cite the particular version of the YCC NRY tree that was used (e.g.,
YCC NRY Tree 2002). |
|
Site maintained by Al Agellon. Copyright © 2002 Arizona Research Laboratories |