Tn-Seq Analysis using DESeq2
I have adapted Keith Turner’s Tn-Seq analysis pipeline (https://github.com/khturner/Tn-seq) to run locally on any Mac using DESeq2 and 2023 versions of samtools (v1.16.1) and flexbar (v3.5.0). This document gives a brief overview of the pipeline and provides information about how to set up and run the scripts.
DESeq2 analysis of Tn-Seq data is split into 2 parts.
Map reads to genome and count reads per Tn insertion site
Find reads with the Tn sequence
Clip Tn sequence from the beginning of the reads
Map clipped reads to reference genome with Bowtie2
Generate .bam file using Samtools for read visualization with IGV
Generate a table of reads per Tn insertion site
Determine genes with significantly decreased Tn insertions in output relative to input
Loess smoothing to reduce genomic-position sequencing/count bias
Normalize reads per insertion site across replicates
Determine number of reads per gene
Calculate read count variance per gene
Negative binomial test for decreased reads/gene in output vs. input
Input files
1. Fastq files for Tn-Seq reads
2. Reference genome (PAO1 is available in Box/Jorth Lab/ref_genome/PAO1)
Output
1. Table of genes with fold-change output vs. input, include p-values and FDRs for each gene calculated by negative binomial test
2. Table of insertions per gene
3. Table of reads per insertion
4. Graph showing fit of dispersion (variance estimates per gene)
5. Graph showing log2 fold-change (output/input) vs. mean reads/gene
6. PCA plot showing relatedness of samples
7. R history (in case you want to make publication quality graphs in R later)
-
This is the most challenging part and can be affected by packages that you have installed already, as well as by the computer you are using (new Apple silicon processors can cause issues with some packages, but the approach below should fix that).
Before running the pipeline, you need to install dependencies and have the scripts accessible to your user environment. You may have some of these installed, but if you don’t, here are all the tools you should need.
#Install Xcode
Xcode is required for any mac running scripts in the terminal, if you don’t have it already you can download for the Mac App Store App
#Install Macports (pick the package for your OS)
https://www.macports.org/install.php
#Install homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
#install bowtie2
wget https://github.com/BenLangmead/bowtie2/releases/download/v2.5.1/bowtie2-2.5.1-macos-arm64.zip
#install wget
brew install wget
#install R 4.2.2
wget https://cran.r-project.org/src/base/R-4/R-4.2.2.tar.gz
# Installing R dependencies
R
install.packages(“dplyr”)
install.packages(“mclust”)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
#Install anaconda with homebrew:
brew install --cask anaconda
#Install anaconda with homebrew:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
#create a new anaconda environment in which to install flexbar
conda create -n flexbar
conda activate flexbar
conda install flexbar
#to leave flexbar anaconda environment:
conda deactivate
#install fqgrep
conda install -c bioconda fqgrep
#install gawk
sudo port install gawk
#Install samtools
wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2
wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
tar -xvf htslib-1.16.tar.bz2
tar -xvf samtools-1.16.1.tar.bz2
cd samtools-1.16.1
autoheader
autoconf -Wno-syntax
./configure
make
sudo make install
Once you have everything installed you need to make sure all the scripts are in your PATH. I recommend moving executables to your /usr/local/bin/ directory, as this is where you will also copy your Tn-Seq scripts. You can update your path by changing your .zshrc profile:
#to edit your profile
nano ~/.zshrc
#add the following to your profile
path+=('/usr/local/bin')
export PATH
# hit “control + o” to save the .zshrc file, and “control + x” to exit nano
# now source your profile to implement the change to your PATH, or simply open a new
# terminal window
source ~/.zshrc
Now you should have all the dependencies for the Tn-Seq pipeline installed. Next step is to copy the Tn-Seq scripts to your /usr/local/bin/ and make them executable. Note, change /path/to/ to whatever directory you have the scripts in initially. If you have Box Drive installed, you can find the scripts using this path:
/Users/$USER/Library/CloudStorage/Box-Box/Jorth\ Lab/TnSeqScripts/
# Example copy commands
cp /path/to/TnGeneBin.pl /usr/local/bin
cp /path/to/trimmer /usr/local/bin
cp /path/to/TnSeq4.sh /usr/local/bin
cp /path/to/TnSeqAnalysis2022.sh /usr/local/bin
cp /path/to/TnSeqDeSeq2022.R /usr/local/bin
# Example commands to make files executable for your username in /usr/local/bin, note
# may have to do this for other scripts/dependencies too if you receive errors
sudo chmod 757 /usr/local/bin/trimmer
sudo chmod 757 /usr/local/bin/Tn*
# The last step is making sure that you have the reference genome in the right place.
# You are probably using P. aeruginosa PAO1. Follow these steps to make sure that the
# PAO1 reference genome files are in the correct directory.
# change to your home directory
cd ~
mkdir ref_genome
# copy PAO1 from Box to your new ref_genome directory
cp -R /Users/$USER/Library/CloudStorage/Box-Box/Jorth\ Lab/ref_genome/PAO1 ~/ref_genome
-
TnSeq4.sh
usage: TnSeq4.sh [-i <IR seq>] [-a <assembly>] <pfx>
Required parameters:
-i The sequence of the transposon end sequence remaining (for junction authentication)
-a The name of the genome assembly you're using (PAO1)
-m The number of mismatches/indels you want to tolerate during search
-t The number of processing threads you want to use for read mapping
The required parameters must precede the file prefix for your sequence file:
(e.g. if your sequence file is named condition1_R1_001.fastq, the prefix is "condition1")
Example3:
TnSeq4.sh -i GGTTCCGTCCAGGACGCTACTTGTGTATAAGAGTCAG -a PAO1 -m 4 -t 8 condition1
Produces a directory containing several useful files:
Directory:
./condition1
File Description
condition1-IR-clip.fastq Fastq file with Tn sequence trimmed off by fqgrep and trimmer
condition1-mapped.bam Binary mapping file produced by samtools from the .sam file
condition1-mapped.sam Tab-delimited mapping file produced by Bowtie2, filtered to remove unmapped reads(?)
condition1.sam Full tab-delimited mapping file produced by Bowtie2
condition1-sites.txt Reads per Tn site, feeds into the 2nd script
condition1-tailsites.txt ?
condition1-TnSeq.txt Summary file, reads with Tn, reads mapped, etc.
A directory within ./condition1:
/IGV
File Description
condition1-mapped.sorted.bam Can be opened with IGV to view read mapping
condition1-mapped.sorted.bam.bai Required by IGV to view read mapping with .bam
-
TnSeqAnalysis2022.sh
usage: TnSeqAnalysis2022.sh [-i <#>] [-a <assembly>] [-o <output>] [-c <name>] [-x <#>] [-t <name>] [-y <#>] <pfx1> <pfx2> <pfx3> ... <pfxn>
Required parameters:
-i The number of the most represented sites to ignore
-a The name of the assembly you're using (PAO1 only so far)
-o The name for the output file
-c The name for the control condition
-x The number of replicates for the control condition
-t The name for the test condition
-y The number of replicates for the test condition
The required parameters must precede the prefixes to be joined, listed with the control conditions followed by the test conditions. See example below.
Example:
TnSeqAnalysis2022.sh -i 50 -a PAO1 -o Example -c control -x 2 -t test -y 2 C1 C2 T1 T2
Where the options above refer to the following:
-i 50 Ignore the 50 sites with the most reads.
-a PAO1 Use PAO1 gene annotations.
-o Example All output files will have a prefix, “Example…”.
-c control The name of the control condition is “control”
-x 2 There are two replicates of the control.
-t test The name of the test condition is “test”.
-y 2 There are 2 replicates of the test.
C1 C2 T1 T2 The files to be analyzed were produced by separate TnSeq4.sh analysis of C1_R1_001.fastq, C2_R1_001.fastq, T1_R1_001.fastq, and T2_R1_001.fastq.
After running the TnSeqAnalysis2022.sh, you can look at the output files on your computer to see your results.