๐ This tutorial will show how to use SNP2TFBS, an R script that identifies the enrichment or loss of transcription factor binding sites at single nucleotide polymorphisms (SNPs) associated with changes in ChIP-seq/ATAC-seq signal across multiple genomic regions.
๐ We will use the mouse genome as an example, but this script supports any species from the JASPAR database.
๐ Here are the things that you will need to run this script:
๐ This script requires the following libraries:
๐ The script takes the following parameters:
๐ Here are the different columns that you will find in the output .csv file:
๐ Letโs run this script with a Oct4 ChIP-seq from the mouse genome to exemplify how it works
๐ First, we load the required libraries
๐ Letโs also import our script
source("../R/SNP2TFBS.R")
๐ Next, we obtain a list of PWM from the mouse genome using JASPAR2022
pwmList <- getMatrixSet(JASPAR2022, opts = list(collection = "CORE",
species = "Mus musculus",
matrixtype = "PWM"))
๐ First, letโs run our script with a test subset of 5 regions:
OCT4_test <- findMotifs(
pwmList = pwmList,
input_high = c("../data/OCT4_input_high.fa"),
input_low = c("../data/OCT4_input_low.fa"),
background_1 = c("../data/OCT4_background_1.fa"),
background_2 = c("../data/OCT4_background_2.fa"),
seq_width = 400,
percentage = "85%",
output_file = F,
test_run = 5)
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 104 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 102 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 105 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 104 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."
head(OCT4_test, 10)
๐ Looks like it is working, letโs run the full analysis now! This might take a whileโฆ
OCT4_analysis <- findMotifs(
pwmList = pwmList,
input_high = c("../data/OCT4_input_high.fa"),
input_low = c("../data/OCT4_input_low.fa"),
background_1 = c("../data/OCT4_background_1.fa"),
background_2 = c("../data/OCT4_background_2.fa"),
seq_width = 400,
percentage = "85%",
output_file = F)
## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 139 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 139 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 139 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 139 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."
head(OCT4_analysis, 10)
โ Great, it looks like we have a few significant hits!
OCT4_analysis %>%
filter(p.adj < 0.05)
๐ The top result is the Pou5f1::Sox2 motif, which is a dimer that Oct4 (Pou5f1) is a part of!