library(tidyverse)
library(janitor)
library(dplyr)
library(readr)

Lab 16.2 Introduction to NCBI

The National Center for Biotechnology Information (NCBI) is a branch of the U.S. National Library of Medicine. It maintains some of the most widely used databases in all of biology, including:

Think of NCBI as a large, semi-organized library of biological data. When a researcher sequences a new organism, they deposit their sequences here so the entire scientific community can use them.

Website: https://www.ncbi.nlm.nih.gov

Learning Goals

At the end of this exercise, you will be able to: 1.Describe what NCBI is and why it is important in biology. 2.Perform a BLAST search on the NCBI website and interpret the results. 3.Identify the five major BLAST types and choose the correct one for a given query.

In this lesson, you will be introduced to NCBI — the world’s largest repository of biological sequence data — and incorporate the use of command line to work with genomic data.

BLAST — Basic Local Alignment Search Tool

BLAST is a tool that finds sequences in a database that are similar to a query sequence you provide. It works by breaking your query into short “words,” finding exact matches in the database, and then extending those matches to find the best local alignment.

Its like a search engine, but instead of searching for words in web pages, it searches for sequence similarity in biological databases.

The Five BLAST Types

There are five main types of BLAST, each designed for a specific combination of query type and database type:

BLAST Type Query Database Best Used When…
blastn Nucleotide Nucleotide You have a DNA/RNA sequence and want to find similar DNA/RNA sequences
blastp Protein Protein You have a protein sequence and want to find similar proteins
blastx Nucleotide (translated) Protein You have a DNA sequence and want to search for similar proteins (BLAST translates your DNA in all 6 reading frames)
tblastn Protein Nucleotide (translated) You have a protein and want to find the gene(s) that might encode it in a nucleotide database
tblastx Nucleotide (translated) Nucleotide (translated) You have a DNA sequence and want to find distantly related DNA sequences by comparing at the protein level

💡 How to remember them: - The “n” in blastn and tblastn refers to nucleotide - The “p” in blastp refers to protein - The “x” means translation is happening - The “t” prefix means the database is being translated

The Central Dogma. Source: Khan Academy
The Central Dogma. Source: Khan Academy

View Fasta Files in R

Install Biostrings

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

#BiocManager::install("Biostrings")

Load Library

library(Biostrings)

Lets take a look at a dna fasta file

Lets look at the head of the fasta file

Lets take a look at a amino acid fasta file

Lets try Protein Blast (blastp). (use unknown_aminoaacid_sequence.txt as the query and the “RefSeq Protein” refseq_protein database, use the blastp Algorithm).

Practice (if time allows)

1.Download the Description Table in CSV format, change the name to “blastp_results.csv” using command line and store the results in the “blast_results” folder. What commands did you use to change the name and move the file into the blast_results folder?

  1. Read the csv file into R and clean the column names. Save the cleaned data frame as blastp_results.

  2. Now try tblastn with the same query and database, use the core nucleotide database (core_nt) database? Download the Description Table in CSV format, change the name to “tblastn_results.csv” using command line and store the results in the “blast_results” folder. What commands did you use to change the name and move the file into the blast_results folder?

  1. What were the top 20 hits, based on e_value, for each blast search? How do they compare? (Hint: the description column)

That’s it! Let’s take a break and then move on to to the homework.

–>Home