SNP2TFBS

👉 This tutorial will show how to use SNP2TFBS, an R script that identifies the enrichment or loss of transcription factor binding sites at single nucleotide polymorphisms (SNPs) associated with changes in ChIP-seq/ATAC-seq signal across multiple genomic regions.

👉 We will use the mouse genome as an example, but this script supports any species from the JASPAR database.

Usage

👇 Here are the things that you will need to run this script:

A list of Position-specific Weight Matrices (PWMs): this will be generated using JASPAR2022.
Two sets of input .fasta files containing the regions to be compared. One file must contain the sequences that show high ChIP-seq/ATAC-seq signal. The other file must contain the sequences that show low ChIP-seq/ATAC-seq signal. They must be the ** SAME ** regions but with and without a SNP (for example, WT and mutated sequences).
Two sets of background .fasta files containing regions to be used as a control. Like the input set, you must use the ** SAME ** regions, with and without a SNP. However, over here you should use regions that do NOT show any gain/loss of ChIP-seq/ATAC-seq signal.

Required libraries

👉 This script requires the following libraries:

tidyverse
TFBSTools
JASPAR2022

Settings

👉 The script takes the following parameters:

pwmList: PWM list from JASPAR2022.
input_high: .fasta file containing input sequences with high ChIP-seq/ATAC-seq signal.
input_low: .fasta file containing input sequences with low ChIP-seq/ATAC-seq signal.
background_1: .fasta file containing background sequences, must contain the same number of sequences as input_high.
background_2: .fasta file containing background sequences, must contain the same number of sequences as input_low.
seq_width: the length of each sequence within each .fasta file.
percentage: the threshold for detecting a motif in TFBSTools. Default: 85%.
output_file: path to output .csv file. If set to False, return the results to a variable. Default: False.
test_run: subset the .fasta files to N regions to test the script. Default: False.

Output

👇 Here are the different columns that you will find in the output .csv file:

ID: JASPAR2022 motif ID.
TF: Transcription factor name.
input_high: number of times a motif appeared within input_high sequences.
input_low: number of times a motif appeared within input_low sequences.
input_diff: the difference in the number of motifs between input_high and input_low.
background_1: number of times a motif appeared within background_1 sequences.
background_2: number of times a motif appeared within background_2 sequences.
background_diff: the difference in the number of motifs between background_1 and background_2.
p: Fisher’s exact test of the absolute difference between high vs. low and the highest amount of motifs in either high or low.
p.adj: FDR-adjusted p value.
p.adj.signif: Significance level (*** < 0.001, ** < 0.01, * < 0.05, ns = non-significant).

Tutorial

👉 Let’s run this script with a Oct4 ChIP-seq from the mouse genome to exemplify how it works

👇 First, we load the required libraries

👇 Let’s also import our script

source("../R/SNP2TFBS.R")

👇 Next, we obtain a list of PWM from the mouse genome using JASPAR2022

pwmList <- getMatrixSet(JASPAR2022, opts = list(collection = "CORE",
                                                     species = "Mus musculus",
                                                     matrixtype = "PWM"))

👇 First, let’s run our script with a test subset of 5 regions:

OCT4_test <- findMotifs(
  pwmList = pwmList,
  input_high = c("../data/OCT4_input_high.fa"),
  input_low = c("../data/OCT4_input_low.fa"),
  background_1 = c("../data/OCT4_background_1.fa"),
  background_2 = c("../data/OCT4_background_2.fa"),
  seq_width = 400,
  percentage = "85%",
  output_file = F,
  test_run = 5)

## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 104 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 102 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 105 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 104 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."

head(OCT4_test, 10)

👇 Looks like it is working, let’s run the full analysis now! This might take a while…

OCT4_analysis <- findMotifs(
  pwmList = pwmList,
  input_high = c("../data/OCT4_input_high.fa"),
  input_low = c("../data/OCT4_input_low.fa"),
  background_1 = c("../data/OCT4_background_1.fa"),
  background_2 = c("../data/OCT4_background_2.fa"),
  seq_width = 400,
  percentage = "85%",
  output_file = F)

## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 139 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 139 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 139 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 139 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."

head(OCT4_analysis, 10)

✅ Great, it looks like we have a few significant hits!

OCT4_analysis %>%
  filter(p.adj < 0.05)

👉 The top result is the Pou5f1::Sox2 motif, which is a dimer that Oct4 (Pou5f1) is a part of!

SNP2TFBS

Luis E. Abatti

2023-05-15

SNP2TFBS

Usage

Required libraries

Settings

Output

Tutorial