SNP2TFBS

๐Ÿ‘‰ This tutorial will show how to use SNP2TFBS, an R script that identifies the enrichment or loss of transcription factor binding sites at single nucleotide polymorphisms (SNPs) associated with changes in ChIP-seq/ATAC-seq signal across multiple genomic regions.

๐Ÿ‘‰ We will use the mouse genome as an example, but this script supports any species from the JASPAR database.

Usage

๐Ÿ‘‡ Here are the things that you will need to run this script:

  1. A list of Position-specific Weight Matrices (PWMs): this will be generated using JASPAR2022.
  2. Two sets of input .fasta files containing the regions to be compared. One file must contain the sequences that show high ChIP-seq/ATAC-seq signal. The other file must contain the sequences that show low ChIP-seq/ATAC-seq signal. They must be the ** SAME ** regions but with and without a SNP (for example, WT and mutated sequences).
  3. Two sets of background .fasta files containing regions to be used as a control. Like the input set, you must use the ** SAME ** regions, with and without a SNP. However, over here you should use regions that do NOT show any gain/loss of ChIP-seq/ATAC-seq signal.

Required libraries

๐Ÿ‘‰ This script requires the following libraries:

  1. tidyverse
  2. TFBSTools
  3. JASPAR2022

Settings

๐Ÿ‘‰ The script takes the following parameters:

  1. pwmList: PWM list from JASPAR2022.
  2. input_high: .fasta file containing input sequences with high ChIP-seq/ATAC-seq signal.
  3. input_low: .fasta file containing input sequences with low ChIP-seq/ATAC-seq signal.
  4. background_1: .fasta file containing background sequences, must contain the same number of sequences as input_high.
  5. background_2: .fasta file containing background sequences, must contain the same number of sequences as input_low.
  6. seq_width: the length of each sequence within each .fasta file.
  7. percentage: the threshold for detecting a motif in TFBSTools. Default: 85%.
  8. output_file: path to output .csv file. If set to False, return the results to a variable. Default: False.
  9. test_run: subset the .fasta files to N regions to test the script. Default: False.

Output

๐Ÿ‘‡ Here are the different columns that you will find in the output .csv file:

  1. ID: JASPAR2022 motif ID.
  2. TF: Transcription factor name.
  3. input_high: number of times a motif appeared within input_high sequences.
  4. input_low: number of times a motif appeared within input_low sequences.
  5. input_diff: the difference in the number of motifs between input_high and input_low.
  6. background_1: number of times a motif appeared within background_1 sequences.
  7. background_2: number of times a motif appeared within background_2 sequences.
  8. background_diff: the difference in the number of motifs between background_1 and background_2.
  9. p: Fisherโ€™s exact test of the absolute difference between high vs.ย low and the highest amount of motifs in either high or low.
  10. p.adj: FDR-adjusted p value.
  11. p.adj.signif: Significance level (*** < 0.001, ** < 0.01, * < 0.05, ns = non-significant).

Tutorial

๐Ÿ‘‰ Letโ€™s run this script with a Oct4 ChIP-seq from the mouse genome to exemplify how it works

๐Ÿ‘‡ First, we load the required libraries

๐Ÿ‘‡ Letโ€™s also import our script

source("../R/SNP2TFBS.R")

๐Ÿ‘‡ Next, we obtain a list of PWM from the mouse genome using JASPAR2022

pwmList <- getMatrixSet(JASPAR2022, opts = list(collection = "CORE",
                                                     species = "Mus musculus",
                                                     matrixtype = "PWM"))

๐Ÿ‘‡ First, letโ€™s run our script with a test subset of 5 regions:

OCT4_test <- findMotifs(
  pwmList = pwmList,
  input_high = c("../data/OCT4_input_high.fa"),
  input_low = c("../data/OCT4_input_low.fa"),
  background_1 = c("../data/OCT4_background_1.fa"),
  background_2 = c("../data/OCT4_background_2.fa"),
  seq_width = 400,
  percentage = "85%",
  output_file = F,
  test_run = 5)
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 104 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 102 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Testing mode on, subsetting the first 5 regions from fasta!!!"
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 105 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 104 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."
head(OCT4_test, 10)

๐Ÿ‘‡ Looks like it is working, letโ€™s run the full analysis now! This might take a whileโ€ฆ

OCT4_analysis <- findMotifs(
  pwmList = pwmList,
  input_high = c("../data/OCT4_input_high.fa"),
  input_low = c("../data/OCT4_input_low.fa"),
  background_1 = c("../data/OCT4_background_1.fa"),
  background_2 = c("../data/OCT4_background_2.fa"),
  seq_width = 400,
  percentage = "85%",
  output_file = F)
## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within input_high..."
## [1] "Found a total of 139 different TF motifs within input sequences with high signal."
## [1] "Searching for motifs within input_low..."
## [1] "Found a total of 139 different TF motifs within input sequences with low signal."
## [1] "Finding motifs enriched in high vs low input signal..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Regions have the expected sizes, continuing..."
## [1] "Searching for motifs within background_1..."
## [1] "Found a total of 139 different TF motifs within background_1 sequences."
## [1] "Searching for motifs within background_2..."
## [1] "Found a total of 139 different TF motifs within background_2 sequences."
## [1] "Finding motifs enriched in background_1 vs background_2..."
head(OCT4_analysis, 10)

โœ… Great, it looks like we have a few significant hits!

OCT4_analysis %>%
  filter(p.adj < 0.05)

๐Ÿ‘‰ The top result is the Pou5f1::Sox2 motif, which is a dimer that Oct4 (Pou5f1) is a part of!