Tn-Seq Analysis using DESeq2

I have adapted Keith Turner’s Tn-Seq analysis pipeline (https://github.com/khturner/Tn-seq) to run locally on any Mac using DESeq2 and 2023 versions of samtools (v1.16.1) and flexbar (v3.5.0). This document gives a brief overview of the pipeline and provides information about how to set up and run the scripts. 

DESeq2 analysis of Tn-Seq data is split into 2 parts.

  • Map reads to genome and count reads per Tn insertion site

    • Find reads with the Tn sequence

    • Clip Tn sequence from the beginning of the reads

    • Map clipped reads to reference genome with Bowtie2

    • Generate .bam file using Samtools for read visualization with IGV

    • Generate a table of reads per Tn insertion site

  • Determine genes with significantly decreased Tn insertions in output relative to input

    • Loess smoothing to reduce genomic-position sequencing/count bias

    • Normalize reads per insertion site across replicates

    • Determine number of reads per gene

    • Calculate read count variance per gene

    • Negative binomial test for decreased reads/gene in output vs. input

Input files

1.     Fastq files for Tn-Seq reads

2.     Reference genome (PAO1 is available in Box/Jorth Lab/ref_genome/PAO1)

Output

1.     Table of genes with fold-change output vs. input, include p-values and FDRs for each gene calculated by negative binomial test

2.     Table of insertions per gene

3.     Table of reads per insertion

4.     Graph showing fit of dispersion (variance estimates per gene)

5.     Graph showing log2 fold-change (output/input) vs. mean reads/gene

6.     PCA plot showing relatedness of samples

7.     R history (in case you want to make publication quality graphs in R later)


  • This is the most challenging part and can be affected by packages that you have installed already, as well as by the computer you are using (new Apple silicon processors can cause issues with some packages, but the approach below should fix that).

    Before running the pipeline, you need to install dependencies and have the scripts accessible to your user environment. You may have some of these installed, but if you don’t, here are all the tools you should need.

    #Install Xcode

    Xcode is required for any mac running scripts in the terminal, if you don’t have it already you can download for the Mac App Store App

    #Install Macports (pick the package for your OS)

    https://www.macports.org/install.php

    #Install homebrew:

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

    #install bowtie2

    wget https://github.com/BenLangmead/bowtie2/releases/download/v2.5.1/bowtie2-2.5.1-macos-arm64.zip

    #install wget

    brew install wget

    #install R 4.2.2

    wget https://cran.r-project.org/src/base/R-4/R-4.2.2.tar.gz

    # Installing R dependencies

    R

    install.packages(“dplyr”)

    install.packages(“mclust”)

    if (!require("BiocManager", quietly = TRUE))

    install.packages("BiocManager")

    BiocManager::install("DESeq2")

    #Install anaconda with homebrew:

    brew install --cask anaconda

    #Install anaconda with homebrew:

    conda config --add channels defaults

    conda config --add channels bioconda

    conda config --add channels conda-forge

    conda config --set channel_priority strict

    #create a new anaconda environment in which to install flexbar

    conda create -n flexbar

    conda activate flexbar

    conda install flexbar

    #to leave flexbar anaconda environment:

    conda deactivate

    #install fqgrep

    conda install -c bioconda fqgrep

    #install gawk

    sudo port install gawk

    #Install samtools

    wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2

    wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2

    tar -xvf htslib-1.16.tar.bz2

    tar -xvf samtools-1.16.1.tar.bz2

    cd samtools-1.16.1

    autoheader

    autoconf -Wno-syntax

    ./configure

    make

    sudo make install

    Once you have everything installed you need to make sure all the scripts are in your PATH. I recommend moving executables to your /usr/local/bin/ directory, as this is where you will also copy your Tn-Seq scripts. You can update your path by changing your .zshrc profile:

    #to edit your profile

    nano ~/.zshrc

    #add the following to your profile

    path+=('/usr/local/bin')

    export PATH

    # hit “control + o” to save the .zshrc file, and “control + x” to exit nano

    # now source your profile to implement the change to your PATH, or simply open a new

    # terminal window

    source ~/.zshrc

    Now you should have all the dependencies for the Tn-Seq pipeline installed. Next step is to copy the Tn-Seq scripts to your /usr/local/bin/ and make them executable. Note, change /path/to/ to whatever directory you have the scripts in initially. If you have Box Drive installed, you can find the scripts using this path:

    /Users/$USER/Library/CloudStorage/Box-Box/Jorth\ Lab/TnSeqScripts/

    # Example copy commands

    cp /path/to/TnGeneBin.pl /usr/local/bin

    cp /path/to/trimmer /usr/local/bin

    cp /path/to/TnSeq4.sh /usr/local/bin

    cp /path/to/TnSeqAnalysis2022.sh /usr/local/bin

    cp /path/to/TnSeqDeSeq2022.R /usr/local/bin

    # Example commands to make files executable for your username in /usr/local/bin, note

    # may have to do this for other scripts/dependencies too if you receive errors

    sudo chmod 757 /usr/local/bin/trimmer

    sudo chmod 757 /usr/local/bin/Tn*

    # The last step is making sure that you have the reference genome in the right place.

    # You are probably using P. aeruginosa PAO1. Follow these steps to make sure that the

    # PAO1 reference genome files are in the correct directory.

    # change to your home directory

    cd ~

    mkdir ref_genome

    # copy PAO1 from Box to your new ref_genome directory

    cp -R /Users/$USER/Library/CloudStorage/Box-Box/Jorth\ Lab/ref_genome/PAO1 ~/ref_genome

  • TnSeq4.sh

    usage: TnSeq4.sh [-i <IR seq>] [-a <assembly>] <pfx>

    Required parameters:

    -i The sequence of the transposon end sequence remaining (for junction authentication)

    -a The name of the genome assembly you're using (PAO1)

    -m The number of mismatches/indels you want to tolerate during search

    -t The number of processing threads you want to use for read mapping

    The required parameters must precede the file prefix for your sequence file:

    (e.g. if your sequence file is named condition1_R1_001.fastq, the prefix is "condition1")

    Example3:

    TnSeq4.sh -i GGTTCCGTCCAGGACGCTACTTGTGTATAAGAGTCAG -a PAO1 -m 4 -t 8 condition1

    Produces a directory containing several useful files:

    Directory:

    ./condition1

    File Description

    condition1-IR-clip.fastq Fastq file with Tn sequence trimmed off by fqgrep and trimmer

    condition1-mapped.bam Binary mapping file produced by samtools from the .sam file

    condition1-mapped.sam Tab-delimited mapping file produced by Bowtie2, filtered to remove unmapped reads(?)

    condition1.sam Full tab-delimited mapping file produced by Bowtie2

    condition1-sites.txt Reads per Tn site, feeds into the 2nd script

    condition1-tailsites.txt ?

    condition1-TnSeq.txt Summary file, reads with Tn, reads mapped, etc.

    A directory within ./condition1:

    /IGV

    File Description

    condition1-mapped.sorted.bam Can be opened with IGV to view read mapping

    condition1-mapped.sorted.bam.bai Required by IGV to view read mapping with .bam

  • TnSeqAnalysis2022.sh

    usage: TnSeqAnalysis2022.sh [-i <#>] [-a <assembly>] [-o <output>] [-c <name>] [-x <#>] [-t <name>] [-y <#>] <pfx1> <pfx2> <pfx3> ... <pfxn>

    Required parameters:

    -i The number of the most represented sites to ignore

    -a The name of the assembly you're using (PAO1 only so far)

    -o The name for the output file

    -c The name for the control condition

    -x The number of replicates for the control condition

    -t The name for the test condition

    -y The number of replicates for the test condition

    The required parameters must precede the prefixes to be joined, listed with the control conditions followed by the test conditions. See example below.

    Example:

    TnSeqAnalysis2022.sh -i 50 -a PAO1 -o Example -c control -x 2 -t test -y 2 C1 C2 T1 T2

    Where the options above refer to the following:

    -i 50 Ignore the 50 sites with the most reads.

    -a PAO1 Use PAO1 gene annotations.

    -o Example All output files will have a prefix, “Example…”.

    -c control The name of the control condition is “control”

    -x 2 There are two replicates of the control.

    -t test The name of the test condition is “test”.

    -y 2 There are 2 replicates of the test.

    C1 C2 T1 T2 The files to be analyzed were produced by separate TnSeq4.sh analysis of C1_R1_001.fastq, C2_R1_001.fastq, T1_R1_001.fastq, and T2_R1_001.fastq.

    After running the TnSeqAnalysis2022.sh, you can look at the output files on your computer to see your results.