https://msystems.asm.org/content/4/2/e00072-19
Curated BLAST for Genomes finds candidate genes for a process or an enzymatic activity within a genome of interest. In contrast to annotation tools, which usually predict a single activity for each protein, Curated BLAST asks if any of the proteins in the genome are similar to characterized proteins that are relevant. Given a query such as an enzyme’s name or an EC number, Curated BLAST searches the curated descriptions of over 100,000 characterized proteins, and it compares the relevant characterized proteins to the predicted proteins in the genome of interest. In case of errors in the gene models, Curated BLAST also searches the six-frame translation of the genome. Curated BLAST is available at http://papers.genomics.lbl.gov/curated.
IMPORTANCE Given a microbe’s genome sequence, we often want to predict what capabilities the organism has, such as which nutrients it requires or which energy sources it can use. Or, we know the organism has a capability and we want to find the genes involved. Scientists often use automated gene annotations to find relevant genes, but automated annotations are often vague or incorrect. Curated BLAST finds candidate genes for a capability without relying on automated annotations. First, Curated BLAST finds proteins (usually from other organisms) whose functions have been studied experimentally and whose curated descriptions match a query. Then, it searches the genome of interest for similar proteins and returns a list of candidates. Curated BLAST is fast and often finds relevant genes that are missed by automated annotation.
Given the genome sequence for an organism of interest, we often want to know whether or not it encodes a certain capability and which proteins might be involved. To support this, many genomics websites support searching for proteins whose annotations match a text query. However, annotation tools will usually provide one predicted function for each protein, and these predictions are often incorrect (i.e., reference 1). So, searching through annotations may not be the best way to find proteins that are involved in a process.
Instead, we propose that given a text query, we can identify experimentally characterized proteins (usually from other organisms) that are relevant. Then, we can search in the genome of interest for proteins that are similar to these characterized proteins. For enzymes, this approach obviates the need to predict the substrate specificity (which is often not possible). Instead, we identify candidates that are similar to characterized proteins that have activities of interest.
We implemented this approach in a web-based tool called Curated BLAST for Genomes (http://papers.genomics.lbl.gov/curated). It relies on a collection of over 100,000 characterized proteins, and it usually takes just a few seconds per query.
Get Free Quote!
326 Experts Online