FRONTIERS IN BIOSCIENCE;
GUIDELINES FOR HUMAN GENE NOMENCLATURE (1997)



The following guidelines are prepared by J.A. White, P.J. McAlpine, S. Antonarakis, H. Cann, K. Frazer, J. Frezal, D. Lancet, J. Nahmias, P. Pearson, J. Peters, A. Scott, H. and the attendees at the nomenclature meeting 5th of March 1997.

CONTENTS

Definition: A gene is a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology.

1. General Rules for Gene Nomenclature

1.1. Requirements for designation by gene symbol

1.1.1. A gene symbol may be used to designate a clearly defined phenotype shown to be inherited as a monogenic Mendelian trait. (Example:TSC1).
1.1.2. Gene symbols may be allocated to as yet unidentified genes contributing to a complex trait shown by linkage or association with a known marker (for example IDDM6).
1.1.3. A gene symbol may be used to designate a cloned segment of DNA with sufficient structural, functional, and expression data to identify it as a transcribed entity. However, alternate transcripts from the same gene should not in general be given different gene symbols.
1.1.4. Gene symbols are also allocated to non-functional copies of genes (pseudogenes).
1.1.5. Genes encoded by the opposite (anti-sense) strand of a known gene will be given their own symbols.
1.1.6. A gene symbol may be given to a transcribed but untranslated DNA segment eg XIST.
1.1.7. A cellular phenotype from which the existence of a gene or genes can be inferred may have its own designation. Example: LOH#CR#.
1.1.8. If insufficient data are available to allocate a unique and meaningful gene symbol, a putative gene may be designated by the symbol C#ORF#. This symbol will also be used for EST clusters. Other fragments of expressed sequence will be designated by a D-number.

1.2. Gene symbols

1.2.1. Gene symbols are designated by upper case Latin letters or by a combination of upper-case letters and Arabic numbers. Symbols should be short in order to be useful, and should not attempt to represent all known information about a gene. Ideally symbols should be no longer than six characters in length. Based on classical genetic guidelines, gene symbols are always either underlined or italicized when referring to genotypic information (phenotypic information is represented in standard fonts). Exceptions to this rule are in catalogs of known genes, and when fragments or synthesized segments of genes are referred to. New symbols must not duplicate existing gene symbols (check the Genome Database, or the HUGO/GDB Nomenclature Committee list of approved gene symbols).
1.2.2. The initial character of the symbol should always be a letter. Subsequent characters may be other letters, or if necessary, Arabic numerals.
1.2.3. All characters of the symbol should be written on the same line; no superscripts or subscripts may be used.
1.2.4. No Roman numerals may be used. Roman numbers in previously used symbols should be changed to their Arabic equivalents.
1.2.5. Greek letters are not used in gene symbols. All Greek letters should be changed to letters in the Latin alphabet (Table 2).
1.2.6. A Greek letter prefixing a gene name must be changed to its Latin alphabet equivalent and placed at the end of the gene symbol. This permits alphabetical ordering of the gene in listings with similar properties, such as substrate specificities. Examples: GLA (galactosidase, alpha); GLB (galactosidase, beta).

1.3. Gene names

1.3.1. Gene names should be brief and specific and should convey the character or function of the gene.
1.3.2. The first letter of the symbol should be the same as that of the name in order to facilitate alphabetical listing and grouping, with the exception of the abbreviations noted in 2.6.2.

1.4. DNA segments

The following guidelines determine each part of the symbol

Part I: D for DNA

Part II: 0,1,2,...22,X,Y,XY for the chromosomal assignment, where XY is for segments homologous on the X and Y chromosomes, and 0 is for unknown chromosomal assignment.

Part III: A symbol indicating the complexity of the DNA segment detected by the probe, with S for a unique DNA segment, and Z for repetitive DNA segments found at a single chromosome site or F for small undefined families of homologous sequences found on multiple chromosomes. Part IV: 1,2,3,..., a sequential number to give uniqueness to the above concatenated characters.

Part V:When the DNA segment is known to be an expressed sequence the suffix E can be added to indicate this fact.

These numbers can now be generated automatically in the Genome Database, following entry of clone details.

2. Recommendations for symbol construction

2.1. Hierarchical symbols, gene families and series

2.1.1. Every attempt should be made to represent information in a hierarchical form to facilitate retrieval of sets of related genes from computerized databases.
2.1.2. Where gene products of similar function are encoded by different genes, the corresponding loci are designated by Arabic numerals placed immediately after the gene symbol, without any space between the letters and numbers used. Examples: PGM1, PGM2, PGM3 (three loci for phosphoglucomutase activity); ADH1, ADH2, ADH3 (three alcohol dehydrogenase loci); HBA1, HBA2 (duplicated forms of the alpha-hemoglobin gene). However, if they exist historically, single-letter suffixes may be used to designate these different loci. Example: LDHA, LDHB, LDHC (three lactate dehydrogenase loci).
2.1.3. A final character in the gene symbol may be used to specify a characteristic of the gene. While letters to specify tissue distribution have been used historically, Arabic numbers are now preferred as experience has shown that tissue specificity may not be as restricted as described initially.

2.2. Homologies with other species

2.2.1. Homologous genes in different species (orthologs) should where possible have the same gene nomenclature.
2.2.2. Human homologs of genes first identified in other species should not be designated by a symbol beginning with H for human.
2.2.3. When a locus or series of genes has been defined in one species, and it is reasonable to expect that in the future a homologous gene will be identified in man, we recommend that the designated symbol be reserved for the human loci. We recommend that this should be done in other species, for genes first identified in human.
2.2.4. When necessary to distinguish the species of origin for homologous genes with the same gene symbol, the three-letter code for different species already established by the Committee on Standardization in Human Cytogenetics (see Table 1), is recommended. The code is for use in publications only and not incorporated as part of the gene symbol. The species designation is added as a prefix to the gene symbol. For example HSA signifies Homo sapiens and MMU stands for Mus musculus. Examples of using the species designation with the gene symbol; human loci: (HSA)G6PD; (HSA)HBB; (HSA)ALB; homologous mouse loci: (MMU)G6pd; (MMU)Hbb; (MMU)Alb.
2.2.5. The agreement between human and mouse gene nomenclature for many homologous gene loci should be continued and extended to other species where possible.
2.2.6. Human homologs of genes in invertebrate, or prokaryote species, may be represented by the symbol used in the other species followed by an L to represent like. The use of H to represent homolog is no longer recommended, and will be discontinued.

2.3. Genes identified from sequence information

2.3.1. Predicted genes.

Genes predicted from EST clusters or from genomic sequence alone are regarded as putative, and are designated by the chromosome of origin and arbitrary number. Example: C2ORF1

2.3.2. Pseudogenes

Molecular technology has identified sequences (generally not transcribed) that bear striking homologies to structural gene sequences. These sequences are termed pseudogenes. In order to show the relatedness of pseudogenes to functional genes, pseudogenes will be identified with the gene symbol of the structural gene followed by a P for pseudogene. In order to reserve P for pseudogenes, the use of P as the last character of a structural gene symbol should be avoided where possible. Examples: HBBP1 (hemoglobin, beta pseudogene 1); ACTBP1 (actin, beta pseudogene 1); ACTBP2 (actin, beta pseudogene 2), etc. Pseudogenes may be on different chromosomes or closely linked to the functional gene and occur in varying numbers.

2.3.3. Related sequences

Related sequences identified by cross-hybridisation, and or by computer searching of sequence databases (BLAST, FASTA), where no other functional information is available for the construction of a symbol, are designated with the symbol of the known gene followed by an L for like. (see also homology section 2.3).

2.4. Enzymes and proteins

2.4.1 Names of genes coding for enzymes are based on those recommended by the Nomenclature Committee of the International Union of Biochemistry. Names of plasma proteins, hemoglobins, and specialized proteins are based on standard names and those recommended by their respective committees (??refs).

2.5. Clinical disorders

2.5.1. Inherited clinical disorders (monogenic Mendelian inheritance).

The first gene symbol allocated to an inherited clinical phenotype may be based on an acronym which has been established as a name for the disorder, whilst following the rules described in section 1. Example: ACH for achondroplasia. However it is usual for this symbol to change when the gene product or function is identified. In some cases a gene symbol based on product or function will already exist, and this will take precedence over the symbol derived from the clinical disorder when the gene descriptions are merged for example in the case of achondroplasia the symbol changed to FGFR3 and the name to fibroblast growth factor receptor 3 (achondroplasia, thanatophoric dwarfism)..

2.5.2. Complex/polygenic traits

Genome searches may suggest a contributing locus in a complex trait, which may for convenience be given a gene symbol, although a proportion of these will disappear in time. A symbol allocated to such a gene will not be re-used.

2.5.3. Contiguous gene syndromes.

Syndromes clearly associated with multiple loci should not be given gene symbols. Syndromes associated with a regional deletion or duplication may be assigned the letters CR (for chromosome region), in place of S for syndrome. Examples: ANCR (Angelman syndrome chromosome region), DCR (Down syndrome chromosome region). However, as advances in database design have now increased the possible ways of representing this type of information, we recommend that such symbols are now classified as syndromic region symbols and not gene symbols.

2.5.4. Loss of heterozygosity.

A chromosomal region in which the existence of genes may be inferred by loss of heterozygosity can be designated by a symbol consisting of the letters LOH, the chromosome number, CR (for chromosomal region) and then an arbitrary number.

2.6. Letters reserved for specific usage

2.6.1 Certain letters, or combinations of letters are used as the last letter in a symbol to represent a specific meaning, these are P for pseudogene (but note also BP for binding protein), L for like (see 2.1.), R for receptor or regulator, N or NH for inhibitor. The use of these for other meanings should be avoided where possible.
2.6.2 If the name of a gene contains a character or property for which there is a recognized abbreviation, the abbreviation should be used. Example: the single-letter abbreviation for amino acids (Table 3) used in aminoacyl residues, or approved biochemical abbreviations such as GLC for glucose and GSH for glutathione.

3. Allele terminology

Allele terminology is now the responsibility of the Mutation Database (ref/URL)

4. Printing Gene and Allele Symbols

Gene and allele symbols are underlined in manuscript and italicized in print. Italics need not be used in catalogs. It may be convenient in manuscripts, computer printouts and in printed text to designate a gene symbol by following it with an asterisk (e.g. PGM1*). When only allele symbols are displayed they can be preceded by an asterisk. For example, for PGM1*1, the allele is printed as *1.

Table 1: Species Abbreviations

abbreviation

Species
HSA
Homo sapiens
PTR
Pan troglodytes (chimpanzee)
GGO
Gorilla gorilla
PPY
Pongo pygmaeus (orangutan)
MMU
Mus musculus
RNO
Rattus norvegicus
MML
Macaca mulatta (Rhesus monkey)
CAE
Cercopithecus aethiops (African green monkey)
PPA
Papio papio (baboon)
FCA
Felis catus (cat)
CGR
Cricetulus griseus (hamster)
OOV
Ovies ovies (sheep)
BBO
Bos bovinus (cattle)
SSC
Sus scrofa (pig)
OCU
Oryctolagus cuniculus (rabbit)
MRU
Macropus rufus (red kangaroo)

Table 2: Greek-to-Latin alphabet conversion

Greek

Lower case
Latin upper case conversion
a
alpha
A
[beta]
beta
B
[gamma]
gamma
G
d
delta
D
[epsilon]
epsilon
E
[zeta]
zeta
Z
[eta]
eta
H
[theta]
theta
Q
[iota]
iota
I
[kappa]
kappa
K
[lambda]
lambda
L
u
mu
M
[nu]
nu
N
[xi]
xi
X
[omicron]
omicron
O
[pi]
pi
P
[rho]
rho
R
[sigma]
sigma
S
[tau]
tau
T
[upsilon]
upsilon
Y
[phi]
phi
F
[chi]
chi
C
[psi]
psi
U
[omega]
omega
W

Table 3: Single-letter amino acid symbols

Amino acid

Three-letter symbol
One-letter symbol
Alanine
Ala
A
Arginine
Arg
R
Asparagine
Asn
N
Aspartic acid
Asp
D
Asn +Asp
Asx
B
Cysteine
Cys
C
Glutamine
Gln
Q
Glutamic acid
Glu
E
Gln + Glu
Glx
Z
Glycine
Gly
G
Histidine
His
H
Isoleucine
Ile
I
Leucine
leu
L
Lysine
Lys
K
Methionine
Met
M
Phenylalanine
Phe
F
Proline
Pro
P
Serine
Ser
S
Threonine
Thr
T
Tryptophan
Trp
W
Tyrosine
Tyr
Y
Valine
Val
V

5. Acknowledgements:

The Nomenclature meeting held in Toronto on 5th March 1997 was made possible by the support of the EU, through a contract to HUGO.