A guide to Associating Drug Target Names with Sequences for Querying Databases

 

Chris Southan, Nov 07

 

”The Naming of Cats is a difficult matter, It isn't just one of your holiday games; You may think at first I'm as mad as a hatter When I tell you, a cat must have Three different names” ( Eliot, TS 1939)

 

The precision with which we can recall all types of internal and external information associated with genes of interest to pharmaceutical R&D is crucial. In most cases the established or putative drug target is a biologically functional protein, or complex, for which we can uniquely define the amino acid sequence(s). We typically use three basic descriptors for this entity of ”target”,  a short functional name (e.g. Beta-site amyloid precursor protein cleaving enzyme 1) a symbolic abbreviation (e.g. BACE1)  and an accession number (e.g. P56817).  Some idea of the precision problem is given by the Google recall figures of 38K, 82K and 1.3K, respectively.  It is also important to be able to recall information associated with human paralogues (e.g. BACE2 or Q9Y5Z0) and 1:1 orthologues in model organisms (e.g. P56819 for rat and P56818 for mouse).  There are of course many identifiers with useful specificity,  for example, AP000892 refers to the section of genome sequence in which BACE1 is located, AF201468 refers to one of the human BACE1 mRNA entries,  2is0  one of the BACE1 crystal structures with an inhibitor bound and PMID: 10656250 one of the publications characterising BACE1.  A loss of precision that includes some false-positives (retrieval of irrelevant information) and/or redundancy is onerous but manageable. However, the consequences of false-negatives (information loss) are more serious.  Database choice makes a big difference;  GooglingBACE” gives 722K hits including ”Board of Adult and Community Education”. A wild card text search of ”BACE  in Swiss-Prot gives 9 matches out of 252616 entries, including BACE1 and BACE2 but with three false-positives (Putative bacilysin exporter, bacE).  Extending the search to include UniProtKB/TrEMBL gives 28 out of 3513861 entries with several redundant entries for human and mouse.

 

This specificity problem with gene names is a well known issue in modern biology that is now faced with having to provide short functional names, symbolic abbreviations and cross-referenced database identifiers for millions of  proteins from hundreds of complete genomes and thousands of species from which only a small proportion have been experimentally characterised.  Detailed reasons for, and consequences of, this are outside the scope of this document but the main problems arise from:

 

  1. a rich variety of  historical naming practices from over 50 years of biochemistry; based on functional characterisation, purification behaviour, genetic data, tissue location, or polypeptide size
  2. independent discovery and re-naming of proteins that later prove to have the same sequence
  3. perpetuation of multiple names in the literature that cannot be updated
  4. GenBank does not enforce naming guidelines for primary mRNA submissions 
  5. obstinate use of alternative individual names or competing gene family nomenclatures by experts
  6. conceptual and technical differences in the annotation pipelines of major secondary sequence databases
  7. the necessity for transitive annotation of  sequences predicted from genomic data. This means names and associated properties have to be transferred to new sequences solely by homology-based inferences in the absence of experimental verification (one consequence of this is the use of the notorious ”-like” in some gene names)

 

Thus, for describing drug targets, it is often necessity to disambiguate synonyms (different terms for the same entity) homonyms (the same terms for different entities) establish the veracity of links between gene names and their underlying sequences and to find a reference protein primary sequence which represents the most common individual (splice) and population (polymorphism) form.  The complexity of the relationships between the four entites of gene, coding locus, transcripts, polypeptide chains, active proteins and biochemically functional oligomeric complexes means that the problem cannot be completely solved. However, three things have made it at least manageable for most human genes. The first is that the general problem is now well recognised by the major databases and they have consequently made great efforts to improve the cross-mapping and standardisation of names, symbols and accession numbers between databases and the literature (Kersey & Apweiler, 2006). The second reason is that the human genome turned out to encode for a much lower basal number of proteins than was expected, which now looks to be well below 25,000 (Southan 2004 , Clamp et al. 2007). The third is the formation of the HUGO Gene Nomenclature Committee in 1979, whose remit is to give unique and meaningful names to every human gene (Wright & Bruford 2006). The Guidelines for Human Gene Nomenclature define a gene as:  "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology".  Thus, the definition is based on mapping biological activity to genomic DNA although the names are linked with the protein that is expressed via the translation of  an mRNA transcript derived from that gene locus.

 

However establishing an unequivocal link between a proposed therapeutic target or a protein name from the literature and the exact sequence is still not trivial, especially for large multigene families.  In patent documentation the use of non-standard internal sequence designations means the false-negative retrieval problem is even worse than that for the literature.  Within the context of the drug discovery enterprise  this needs to be clarified in the first instance by a target “champion” who can establish the link from the primary literature or internal target discovery data, ideally with the help of a bioinformatician.  Most difficulties arise from ambiguous or conflicting descriptions in publications but it is usually possible to track-back to a stable protein sequence identifier from a secondary database. Secondary databases include curation efforts to reduce redundancy by aggregating primary data links. For example the SwissProt entry for BACE1 has a single accession number, P56817, that links to one reference protein sequence, 11 mRNA entries, 18 PDB entries, three alternative splice forms and 16 publications. Stable in this context means that the link e.g. via  P56817 or BACE1_HUMAN, will remain unambiguous even if changes or new entries appear in any of the linked primary databases (see  Southan 2003 for a discussion of primary and secondary accession numbers) Examples of key identifiers for a small set of drug targets are given below in table 1.  

 

Table 1. Identifiers for a GPCR sub-family example, three of which have publications implicating them as drug targets

 

First name

HGNC approved symbol

HGNC approved name

Swiss-Prot

Entrez Gene

RefSeq Protein

Ensembl Gene

CAS Registry Number

GPR40

FFAR1

free fatty acid receptor 1

O14842

2864

NP_005294

ENSG00000126266

199397-45-0

GPR41

FFAR3

free fatty acid receptor 3

O14843

2865

NP_005295

ENSG00000185897

199397-46-1

GPR42

GPR42

G protein-coupled receptor 42

O15529

2866

NP_005296

ENSG00000126251

199397-47-2

GPR43

FFAR2

free fatty acid receptor 2

O15552

2867 

NP_005297

ENSG00000126262

199397-48-3

 

They illustrate some of the principles and different utility of certain identifiers.  It should be pointed out that web-based bioinformatics resources are now so richly interconnected that in fact, from this table, every identifier (except the CAS no)  can be accessed from every other identifier, in most cases via only two or three mouse clicks.  However some brief explanations are as follows:

 

The First name given to gene products that are at least patricianly characterised in publication are useful because they usually persist in databases as synonyms even if  more appropriate functional re-naming occurs on the basis of new data.  In this case the arbitrary start at 40 is simply because of  the productivity of the O'Dowd team in GPCR cloning during the 1990’s.

 

 

The  HGNC approved symbol is being increasingly used for human genes and is very useful for database queries, especially since “stemming” can be used for sub-family queries. For example FFAR will retrieve FFAR’s 1,2 and 3, (but not GPR42)  The HGNC symbol web link  also provides a useful ”Atlantic crossing” where outlinks to both European and US identifiers are included.  However there are caveats to be aware of. 

 

 

The HGNC approved name is also being adopted by major databases but it is also not without problems, the main one being spelling or punctuation differences (generated systematically or just errors), in many databases.  Their correct use in publications is patchy and also confounded by the fact that while Greek symbols are used freely in print they have to be spelled out in databases.  Some HGNC names actually include synonyms in brackets e.g. A3GALT2 alpha 1,3-galactosyltransferase 2 (isoglobotriaosylceramide synthase), which seems to defeat the object but they are trying to move these to the alias fields.  Names have advantages over symbols of course in being descriptive and allow some useful stemming searches e.g. ”fatty acid”. 

 

Swiss-Prot also known as  UniProtKB/SwissProt is the worlds leading source of protein annotation and in most cases the HGNC symbol will link to it.  With over 50 outlinks it provides comprehensive context information on the bioinformatics, structure and function for any protein that is indexed.  It also now includes links to DrugBank for 413 human drug targets. 

 

The Entrez Gene ID (formerly called Locus Link) defines a unique gene locus in a particular species. From table 1 2864 defines  FFAR1 specifically  from Homo sapiens whereas 233081 defines the mouse ortholgue.  The Entrez Gene system is gene-centric (not protein-centric like Swiss-Prot  nor transcript-centric like Unigene)  so technically their may be multiple protein and mRNA products derived from that “locus”.   However in most cases you will find a  single RefSeq protein reference sequence, as for this entry  NP_005294.  In most cases this will be identical with the Swiss-Prot entry.   The Entrez Gene ID is the highest specificity query for databases.  There is a particular problem in older literature from the use of names which later became expanded gene families. Elastase is a good example, which eventually became 1,2,3a and 3b. In these cases, despite the authenticity and value of the retrospective biochemical data, it may not be possible to make an unequivocal sequence identifier mapping. 

 

Ensembl Gene provides a gene-centric entry point and is conceptually similar to Entrez Gene.  A key use for target mapping is the comprehensive indexing of increasing numbers of orthologues from species with completed genomes.

 

CAS Registry Number is a unique numeric identifier that designates any substance identified from the scientific literature including protein sequences They can be used as entry points to SciFinder but are not mapped or linked outside the SciFinder system.