A guide to Associating Drug Target Names with
Sequences for Querying Databases
Chris Southan, Nov 07
”The Naming
of Cats is a difficult matter, It isn't just one of your holiday games; You may
think at first I'm as mad as a hatter When I tell you, a cat must have Three
different names” ( Eliot, TS 1939)
The precision with
which we can recall all types of internal and external information associated
with genes of interest to pharmaceutical R&D is crucial. In most cases the
established or putative drug target is a biologically functional protein, or
complex, for which we can uniquely define the amino acid sequence(s). We
typically use three basic descriptors for this entity of ”target”, a short functional name (e.g. Beta-site amyloid precursor protein cleaving enzyme 1) a symbolic
abbreviation (e.g. BACE1) and an accession
number (e.g. P56817). Some idea of the precision problem is given
by the Google recall figures of 38K, 82K and 1.3K,
respectively. It is also important to be
able to recall information associated with human paralogues
(e.g. BACE2 or Q9Y5Z0)
and 1:1 orthologues in model organisms (e.g. P56819
for rat and P56818
for mouse). There are of course many
identifiers with useful specificity, for
example, AP000892
refers to the section of genome sequence in which BACE1 is located, AF201468
refers to one of the human BACE1 mRNA entries,
2is0 one of the BACE1 crystal structures with an
inhibitor bound and PMID: 10656250 one of the publications characterising
BACE1. A loss of precision that includes
some false-positives (retrieval of irrelevant information) and/or redundancy is
onerous but manageable. However, the consequences of false-negatives
(information loss) are more serious.
Database choice makes a big difference; Googling ”BACE” gives 722K hits including ”Board of Adult and
Community Education”. A wild card text search of ”BACE” in Swiss-Prot gives 9 matches out of 252616
entries, including BACE1 and BACE2 but with three false-positives (Putative bacilysin exporter, bacE). Extending the search to include UniProtKB/TrEMBL gives 28 out of 3513861 entries with
several redundant entries for human and mouse.
This
specificity problem with gene names is a well known issue in modern biology
that is now faced with having to provide short functional names, symbolic
abbreviations and cross-referenced database identifiers for millions of proteins from hundreds of complete genomes
and thousands of species from which only a small proportion have been experimentally
characterised. Detailed reasons for, and
consequences of, this are outside the scope of this document but the main
problems arise from:
Thus, for
describing drug targets, it is often necessity to disambiguate synonyms
(different terms for the same entity) homonyms (the same terms for different
entities) establish the veracity of links between gene names and their
underlying sequences and to find a reference protein primary sequence which
represents the most common individual (splice) and population (polymorphism)
form. The complexity of the relationships
between the four entites of gene, coding locus,
transcripts, polypeptide chains, active proteins and biochemically
functional oligomeric complexes means that the
problem cannot be completely solved. However, three things have made it at
least manageable for most human genes. The first is that the general problem is
now well recognised by the major databases and they have consequently made
great efforts to improve the cross-mapping and standardisation of names,
symbols and accession numbers between databases and the literature (Kersey
& Apweiler, 2006). The second reason is that
the human genome turned out to encode for a much lower basal number of proteins
than was expected, which now looks to be well below 25,000 (Southan 2004 , Clamp
et al. 2007). The third is the formation of the HUGO Gene Nomenclature Committee in 1979, whose remit is to give unique and
meaningful names to every human gene (Wright
& Bruford 2006). The Guidelines for Human Gene Nomenclature define a gene
as: "a DNA segment that contributes
to phenotype/function. In the absence of demonstrated function a gene may be
characterised by sequence, transcription or homology". Thus, the definition is based on mapping
biological activity to genomic DNA although the names are linked with the
protein that is expressed via the translation of an mRNA transcript derived from that
gene locus.
However
establishing an unequivocal link between a proposed therapeutic target or a
protein name from the literature and the exact sequence is still not trivial,
especially for large multigene families. In patent documentation the use of
non-standard internal sequence designations means the false-negative retrieval
problem is even worse than that for the literature. Within the context of the drug discovery enterprise this needs
to be clarified in the first instance by a target “champion” who can establish
the link from the primary literature or internal target discovery data, ideally
with the help of a bioinformatician. Most difficulties arise from ambiguous or
conflicting descriptions in publications but it is usually possible to
track-back to a stable protein sequence identifier from a secondary database.
Secondary databases include curation efforts to reduce redundancy by
aggregating primary data links. For example the SwissProt
entry for BACE1 has a single accession number, P56817, that links to one reference protein
sequence, 11 mRNA entries, 18 PDB entries, three
alternative splice forms and 16 publications. Stable in this context means that
the link e.g. via P56817 or BACE1_HUMAN,
will remain unambiguous even if changes or new entries appear in any of the
linked primary databases (see Southan 2003 for a discussion of primary and secondary
accession numbers) Examples of key identifiers for a small set of drug targets
are given below in table 1.
Table
1. Identifiers for a
GPCR sub-family example, three of which have
publications implicating them as drug targets
|
GPR40 |
free fatty
acid receptor 1 |
199397-45-0 |
|||||
|
GPR41 |
free fatty
acid receptor 3 |
199397-46-1 |
|||||
|
GPR42 |
G
protein-coupled receptor 42 |
199397-47-2 |
|||||
|
GPR43 |
free fatty
acid receptor 2 |
199397-48-3 |
They illustrate
some of the principles and different utility of certain identifiers. It should be pointed out that web-based
bioinformatics resources are now so richly interconnected that in fact, from
this table, every identifier (except the CAS no) can be accessed from every other identifier,
in most cases via only two or three mouse clicks. However some brief explanations are as
follows:
The First name given to gene products that are at
least patricianly characterised in publication are useful because they usually
persist in databases as synonyms even if more appropriate functional re-naming
occurs on the basis of new data. In this
case the arbitrary start at 40 is simply because of the productivity of the O'Dowd
team in GPCR cloning during the 1990’s.
The HGNC approved symbol is being increasingly used for human
genes and is very useful for database queries, especially since “stemming” can
be used for sub-family queries. For example FFAR will
retrieve FFAR’s 1,2 and 3,
(but not GPR42) The HGNC
symbol web link also provides a useful
”Atlantic crossing” where outlinks to both European
and US identifiers are included. However
there are caveats to be aware of.
The HGNC approved name is also being adopted by major
databases but it is also not without problems, the main one being spelling or
punctuation differences (generated systematically or just errors), in many
databases. Their correct use in
publications is patchy and also confounded by the fact that while Greek symbols
are used freely in print they have to be spelled out in databases. Some HGNC names
actually include synonyms in brackets e.g. A3GALT2 alpha 1,3-galactosyltransferase
2 (isoglobotriaosylceramide synthase),
which seems to defeat the object but they are trying to move these to the alias
fields. Names have advantages over
symbols of course in being descriptive and allow some useful stemming searches
e.g. ”fatty acid”.
Swiss-Prot also known as UniProtKB/SwissProt is the worlds leading source of
protein annotation and in most cases the HGNC symbol
will link to it. With over 50 outlinks it provides comprehensive context information on
the bioinformatics, structure and function for any protein that is
indexed. It also now includes links to DrugBank for 413 human drug targets.
The Entrez Gene ID (formerly called Locus Link) defines a
unique gene locus in a particular species. From table 1 2864
defines FFAR1
specifically from Homo sapiens
whereas 233081
defines the mouse ortholgue. The Entrez Gene
system is gene-centric (not protein-centric like Swiss-Prot nor transcript-centric like Unigene) so technically
their may be multiple protein and mRNA products derived from that “locus”. However in most cases you will find a single RefSeq
protein reference sequence, as for this entry
NP_005294. In most cases this will be identical with the
Swiss-Prot entry. The Entrez Gene ID is the highest specificity query for databases. There is a particular problem in older
literature from the use of names which later became expanded gene families. Elastase is a good example, which eventually became 1,2,3a and 3b. In these cases, despite the authenticity and
value of the retrospective biochemical data, it may not be possible to make an
unequivocal sequence identifier mapping.
Ensembl Gene provides a gene-centric entry point and is
conceptually similar to Entrez Gene. A key use for target mapping is the
comprehensive indexing of increasing numbers of orthologues
from species with completed genomes.
CAS Registry Number is a unique numeric identifier that designates any substance
identified from the scientific literature including protein sequences They can
be used as entry points to SciFinder but are not mapped or linked outside
the SciFinder system.