Micro4all workflow
Introduction
The following tutorial has been created to show the workflow we follow in our laboratory for the analysis of amplicon data. It is based in R and it is intented to be easy and readable for all users, even those with little knowledge about R. Moreover, it integrates the main utilities of micro4all package, which was created in order to record R functions we usually use. We will cover several steps, including:
- Produce an amplicon sequence variant (ASV) table, making use of dada2 package.
- Study microbial diversity (\(\alpha\) and \(\beta\) diversity).
- Get taxonomic profiles at phylum and genus level.
- Differential abundance analysis with ANCOMBC package.
- Ordination graphics and environmental fitting of soil physicochemical parameters.
For following this tutorial, we will use an example dataset, which will be detailed in the next section. The results that you will see were produced with R version 4.1.2 (2021-11-01) on Ubuntu 20.04.3 LTS.It is important to know that we use Ubuntu, specifically for tools as figaro or cutadapt (we will see this in detail later on).
Dataset
In this tutorial, we will use a bunch of sequences of our own. Specifically, we collected endosphere samples of 8 olive trees (Picual variety) from three different locations (let’s call them location 1, location 2 and location 3) and sequenced the V3-V4 region of the 16S rRNA gene (with Illumina MiSeq technology and 300bp PE) in order to study the bacterial community. These paired-end fastq files (already “demultiplexed” and without barcodes) can be downloaded here. In this link you will also find a phyloseq object that will be used in environmental fitting section of this tutorial and a modified RDP database file for classification (explained later).
Install package micro4all
Let’s install micro4all package. This package is hosted in Github. Because of that, we will need devtools for installing the package from this respository. If it’s the first time installing this package, it will take a while to get the full installation done. If not, you don’t really need to install it again, just load the libraries before processing your dataset.
##Install devtools
install.packages("devtools")Now, we are ready to install micro4all…
##Install micro4all with devtools
library(devtools)
install_github("nuriamw/micro4all")…and load the library
#Load micro4all
library(micro4all)Produce an amplicon sequence variant (ASV) table
The first part of this tutorial is mainly based on the
information provided by dada2 (we highly
recommend to check its online tutorial) but with our own
implementations. We will start with raw sequences, then we will get to
know their quality profiles, clean them and infer Amplicon
Sequence Variants (ASV). Thus, we will end this section with an
ASV table, ready for microbial diversity analysis.
Prepare data: load libraries and read sequences files
We will make use of several packages, as dada2, ShortRead, ggplot2 and tidyverse . The first two mentioned packages can be installed from BiocManager as follows:
##dada2 installation
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dada2")
##ShortRead installation
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ShortRead")When installing tidyverse, ggplot2 is automatically installed as it is part of the tidyverse package collection:
##Install tidyverse
install.packages("tidyverse")Now, let’s load the libraries
##Load libraries
library(dada2); packageVersion("dada2")
#> [1] '1.24.0'
library(ShortRead);packageVersion("ShortRead")
#> [1] '1.54.0'
library(ggplot2);packageVersion("ggplot2")
#> [1] '3.4.1'
library(tidyverse);packageVersion("tidyverse")
#> [1] '2.0.0'Great! We are ready to use dada2 workflow. First of
all, we will set the path to FASTQ files. We will use them unzipped, but
it is perfectly OK for dada2 to use them zipped. Just
copy and paste your folder path next to path <-,
remember that “~/” is only valid in Ubuntu and it is used to link to
your home folder. Windows users should write something like:
"C:/Users/user_name/Desktop/PATH_to_your_Data".
##SET PATH, FILES AND SAMPLE.NAMES
path <- "/home/nuria/Escritorio/data_micro4all/" # CHANGE ME to the directory containing the fastq files
list.files(path) #list files in this folder
#> [1] "ASV_length_nochim.rds" "cutadapt" "figaro"
#> [4] "filtered" "filtN" "L1-1_S102_R1_001.fastq"
#> [7] "L1-1_S102_R2_001.fastq" "L1-2_S112_R1_001.fastq" "L1-2_S112_R2_001.fastq"
#> [10] "L1-3_S121_R1_001.fastq" "L1-3_S121_R2_001.fastq" "L1-4_S130_R1_001.fastq"
#> [13] "L1-4_S130_R2_001.fastq" "L1-5_S138_R1_001.fastq" "L1-5_S138_R2_001.fastq"
#> [16] "L1-6_S145_R1_001.fastq" "L1-6_S145_R2_001.fastq" "L1-7_S152_R1_001.fastq"
#> [19] "L1-7_S152_R2_001.fastq" "L1-8_S159_R1_001.fastq" "L1-8_S159_R2_001.fastq"
#> [22] "L2-1_S139_R1_001.fastq" "L2-1_S139_R2_001.fastq" "L2-2_S146_R1_001.fastq"
#> [25] "L2-2_S146_R2_001.fastq" "L2-3_S153_R1_001.fastq" "L2-3_S153_R2_001.fastq"
#> [28] "L2-4_S160_R1_001.fastq" "L2-4_S160_R2_001.fastq" "L2-5_S104_R1_001.fastq"
#> [31] "L2-5_S104_R2_001.fastq" "L2-6_S114_R1_001.fastq" "L2-6_S114_R2_001.fastq"
#> [34] "L2-7_S123_R1_001.fastq" "L2-7_S123_R2_001.fastq" "L2-8_S132_R1_001.fastq"
#> [37] "L2-8_S132_R2_001.fastq" "L3-1_S136_R1_001.fastq" "L3-1_S136_R2_001.fastq"
#> [40] "L3-2_S143_R1_001.fastq" "L3-2_S143_R2_001.fastq" "L3-3_S150_R1_001.fastq"
#> [43] "L3-3_S150_R2_001.fastq" "L3-4_S157_R1_001.fastq" "L3-4_S157_R2_001.fastq"
#> [46] "L3-5_S101_R1_001.fastq" "L3-5_S101_R2_001.fastq" "L3-6_S111_R1_001.fastq"
#> [49] "L3-6_S111_R2_001.fastq" "L3-7_S120_R1_001.fastq" "L3-7_S120_R2_001.fastq"
#> [52] "L3-8_S129_R1_001.fastq" "L3-8_S129_R2_001.fastq" "MOCK-1_S169_R1_001.fastq"
#> [55] "MOCK-1_S169_R2_001.fastq" "MOCK-2_S176_R1_001.fastq" "MOCK-2_S176_R2_001.fastq"
#> [58] "MOCK-3_S183_R1_001.fastq" "MOCK-3_S183_R2_001.fastq" "nochim"
#> [61] "num_seqs.txt" "rdp_train_set_18_H.fa" "tabla_otu_inicial.rds"
#> [64] "taxa_rdp_16S.rds"Second, we will store path names for every fastq file. We will do this for forward reads (fnfFs) and reverse reads (fnRs), respectively.
## SORT FORWARD AND REVERSE READS SEPARETELY; forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq ##
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)📝TIP: We get sample names with
basename(fnFs), a function that removes all the previous
path up to the last path separator (/). Then, strsplit
works over sample names and splits the name in every “_“. Afterwards,
sapply will loop over the produced list of splitted names
and take the first part (corresponding to SAMPLENAME, if file names have
format SAMPLENAME_XXX.fastq). For example, our first sample has the
following name:
##See name of the first sample (fnFs)
basename(fnFs[1])#> [1] "L1-1_S102_R1_001.fastq"
Then, strsplit returns…
##Apply strsplit
strsplit(basename(fnFs[1]), "_")#> [[1]]
#> [1] "L1-1" "S102" "R1" "001.fastq"
…and we take the first part, which corresponds to “L1-1”. If your
file names are more complex and you need to remove some previous
pattern, you can always use gsub over sample.names. For
example, to remove the last “-1” of sample.names[1], we
would use
gsub(pattern="-1", replacement="", sample.names[1]).
Moreover, it is important for later analysis that your sample
names are not only numerical. If that’s the case, you can
easily add a letter (for example, an “S”) as follows:
paste0("S",sample.names).
Check sequences quality
A crucial step in any sequencing analysis pipeline is quality
checking. For this purpose, we will use several approaches in order to
have several perspectives on this aspect. First, we will generate a
table with the number of reads in each sample, just to check if anything
went wrong during sequencing process. We will use
ShortRead package. As we have already loaded the
library, the use of ShortRead:: would not be necessary, but
it is useful to know from which packages comes each function.
##COUNT NUMBER OF READS IN EACH SAMPLE BEFORE FILTERING
raw_reads_count <- NULL #Creates an empty object
for (i in 1:length(fnFs)){ #loop over fnFs to count number of sequences with length over readFastq
raw_reads_count <- rbind(raw_reads_count, length(ShortRead::readFastq(fnFs[i])))}
#Format table to give it sample.names
rownames(raw_reads_count)<- sample.names
colnames(raw_reads_count)<- "Number of reads"With raw_reads_count we can easily now the minimum and
maximum number of sequences as follows:
#Get min and max number of sequences
min(raw_reads_count)#> [1] 66426
max(raw_reads_count)#> [1] 125873
As we can see, even the minimum number of sequences is good enough
for our analysis. However, if that’s not the case, we can visually check
the table with View(raw_reads_count) and write it to your
device with
write.table(raw_reads_count, file="raw_reads_count.txt", sep="\t").
Another approach of quality checking is to plot the distribution of sequences length with a histogram. Let’s see how it works!
##Histogram of sequences length
reads <- ShortRead::readFastq(fnFs) #store sequences in an R object
#Get lengths with unique
uniques <- unique(reads@quality@quality@ranges@width) #get length accessing to width attribute within reads
#Count number of sequences for each length
counts <- NULL #Creates a null object for storing counts
for (i in 1:length(uniques)) {
counts<- rbind(counts,length(which(reads@quality@quality@ranges@width==uniques[i])))
}
#format histogram table
histogram <- cbind(uniques,counts)
colnames(histogram) <- c("Seq.length", "counts")Again, we can visually check the histogram table with
View(histogram) and write it to your device with
write.table(histogram, file="histogram.txt", sep="\t). For
now on, we will use head to see the main results, as well
as hist for plotting.
#Check histogram matrix
head(histogram[order(histogram[,1],decreasing = TRUE),]) #Most sequences fall in expected sequence length
#> Seq.length counts
#> [1,] 301 2154642
#> [2,] 300 71753
#> [3,] 299 16251
#> [4,] 298 61651
#> [5,] 297 10119
#> [6,] 296 377
# PLOT HISTOGRAM
hist(reads@quality@quality@ranges@width, main="Forward length distribution", xlab="Sequence length", ylab="Raw reads")Moreover, plotQualityProfile function of
dada2 is very useful for a quick visualization of sequences
quality profile. We can visualize both profiles for forward (fnFs) and
reverse (fnRs) reads. It is a common fact to have worse reverse reads
profiles because of the sequencing process. In general, we can see that
the average quality is good, as it only falls below 20 at the end of the
reverse sequences. Again, we strongly suggests to check
dada2 tutorial for more information. We will visualize
quality plots for the first two forward and reverse sequences (indicated
by [1:2]).
## VIEW AND SAVE QUALITY PLOT FOR FW AND RV ##
plotQualityProfile(fnFs[1:2])plotQualityProfile(fnRs[1:2])Figaro: selection of trimming parameters
In this tutorial, we will use the Figaro tool.
It can be downloaded and installed following instructions from the
provided link. Don’t forget that in this tutorial we are running
it on Ubuntu, so some changes would be necessary for Windows
distributions. This tool makes the selection of trimming
parameters for dada2 more objective. Sadly, figaro has not yet
been updated to lead with variations in sequence lengths. We will solve
this following author’s recommendation (which you can find here). It
consists in performing a pre-trimming step at about 295 bp.This
pre-trimming will be used only for figaro, so we will store these
sequences in a new folder called ‘figaro’. Pre-trimming will be done
making use of filterAndTrim function from
dada2. Let’s see how it works!
##FIGARO
##Creates a new folder called 'figaro' inside our path to store reads after trimming for fígaro
figFs <- file.path(path, "figaro", basename(fnFs))
figRs <- file.path(path, "figaro", basename(fnRs))
##TRIMMING AT 295 pb
out.figaro <- filterAndTrim(fnFs, figFs, fnRs, figRs, #inputFreads,outputFreads,inputRreads,outputRreads
compress=TRUE, multithread=TRUE, truncLen=c(295,295)) Once we have the reads stored in figaro folder and cutted at 295 bp,
we will run figaro. For calling figaro from R in Ubuntu, we use the
function system, which allows us to access to Ubuntu
command line. We have to call python3, as figaro works with it. Then,
all the appropiete arguments for figaro should be given: path to
figaro program, -i path to input, -o path to output, -a length of your
amplicon without primers, -f primer forward length, -r primer reverse
length. As you can see, before using figaro, you will need to know
your expected amplicon length without primers (you can calculate it
based on E. coli 16S rRNA gene) as well as both primers
lengths.
##RUN FIGARO
figaro <- system(("python3 /home/programs/figaro/figaro/figaro.py -i /home/nuria/Escritorio/data_micro4all/figaro -o /home/nuria/Escritorio/data_micro4all/figaro -a 426 -f 17 -r 21"),
intern=TRUE) #path to figaro program -i path to input - path to output -a length of your amplicon without primers, -f primer forward length, -r primer reverse length
head(figaro)#> [1] "Forward read length: 295"
#> [2] "Reverse read length: 295"
#> [3] "{\"trimPosition\": [279, 205], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.69, \"score\": 76.68937485688116}"
#> [4] "{\"trimPosition\": [278, 206], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.68, \"score\": 76.6750629722922}"
#> [5] "{\"trimPosition\": [277, 207], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.67, \"score\": 76.66838409281735}"
#> [6] "{\"trimPosition\": [280, 204], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.66, \"score\": 76.6617052133425}"
We will select the first result, which gives us the trimming positions and maximum expected errors that produces the greater score according to Figaro model. These are returned as [forward,reverse].
ITS considerations
ITS region has a very variable length. Consequently, no length trimming
(truncLen argument) is performed in
filterAndTrim function from dada2. Therefore, it would not
be necessary to use Figaro.
Filter and trim sequences
Now that we have the best trimming and expected errors values
according to figaro, we can apply it to our original sequences (fnFs and
fnRs, remember that figFs and figRs were used only for figaro). Let’s
apply figaro parameters to filterAndTrim function from
dada2! You will see that we produce
filtFsand filtRs objects first. These are the
files for storing the filtered sequences after applying
filterAndTrim over raw sequences (fnFsand
fnRs). Afterwards, primers will be removed from
filtFsand filtRs with cutadapt tool.
### FILTER AND TRIM SEQUENCES ACCORDING TO FIGARO ####
## Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", basename(fnFs))
filtRs <- file.path(path, "filtered", basename(fnRs))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(279,205),
maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, minLen=50)
head(out)#> reads.in reads.out
#> L1-1_S102_R1_001.fastq 80054 61741
#> L1-2_S112_R1_001.fastq 76776 60781
#> L1-3_S121_R1_001.fastq 84947 67563
#> L1-4_S130_R1_001.fastq 87190 68997
#> L1-5_S138_R1_001.fastq 114810 93159
#> L1-6_S145_R1_001.fastq 88519 70075
ITS considerations
As explained in the last _ ITS considerations_ box, we do not use figaro in ITS reads. However, we should use the ‘filerAndTrim’ step to remove ambiguous bases so that we can better recognize the primers afterwards. Remember not to use trimming parameters.
Cutadapt: efficient removal of primers
Now that we have run Figaro and performed quality trimming, we can
follow with primer removing, an obligatory step for
dada2 workflow. This will be done with cutapat tool. You
can also remove your primers with the argument
trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LENfrom
filterAndTrimfunction (dada2). However, we consider
cutadapt a more accurate tool for this purpose, as trimLeft
indistinctly cut the sequences at the given primer lengths, whether or
not these primers are found in the sequence. As it was said above,
we are working with Ubuntu, so if you are using
cutadapt on Windows, you should make some changes to the code below.
Another consideration is that cutadapt should not be applied before
figaro because it can modify the quality profiles in which figaro is
based.
First of all, we have to store our primers sequences
in two different R objets (FWD and REV). Then, we will produce all the
possible orientations of these primers (further details can be found in
dada2 tutorial). This is possible thanks to the
function allOrients, which we create below.
#### IDENTIFY PRIMERS ####
FWD <- "CCTACGGGNBGCASCAG" ## CHANGE ME to your forward primer sequence
REV <- "GACTACNVGGGTATCTAATCC" ## CHANGE ME to your reverse primer sequence
## VERIFY PRESENCE AND ORENTATION OF PRIMERS ##
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works with DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = reverse(dna),
RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients#> Forward Complement Reverse RevComp
#> "CCTACGGGNBGCASCAG" "GGATGCCCNVCGTSGTC" "GACSACGBNGGGCATCC" "CTGSTGCVNCCCGTAGG"
REV.orients#> Forward Complement Reverse RevComp
#> "GACTACNVGGGTATCTAATCC" "CTGATGNBCCCATAGATTAGG" "CCTAATCTATGGGVNCATCAG" "GGATTAGATACCCBNGTAGTC"
ITS considerations
For ITS analysis, don’t forget to change primers to the appropiate ones, in our case, we use fITS7 (FWD <- “GTGARTCATCGAATCTTTG”) and ITS4 (REV <- “TCCTCCGCTTATTGATATGC”) for ITS2 region.
Afterwards, we want to count the appearence of the primers in our
sequences with the function primerHits:
## COUNT THE APPEARENCE AND ORIENTATION OF PRIMERS ##
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = filtFs[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = filtRs[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = filtFs[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = filtRs[[1]]))#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 59926 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 2 0 0 0
#> REV.ReverseReads 60270 0 0 0
But… what is this table all about? On the left, we have the presence of forward and reverse primers (FWD and REV) in forward and reverse reads (FWD.ForwardReads, FWD.ReverseReads, REV.ForwardReads and REV.ReverseReads). On the top of the table, we see the four possible orientations of the primers (Forward, Complement, Reverse and RevComp). Taking all these information into account, in our data set, we found 59926 sequences with the FWD primer in the forward reads in its forward orientation. As expected, the REV primer was found in the reverse reads in its forward orientation. Cutadapt is used in dada2 ITS tutorial, so check it if you want more information.
ITS considerations
Change filtFs and filtRs for fnFs
and fnRs, respectively, since you didn’t perform the
previous filtering step.
Now, let’s run cutadapt. The process is similar to what we did with
figaro. First, we save the path to cutadapt program in
cutadapt. Then, we use system2 to call
cutadapt.
## RUN CUTADAPT
cutadapt <- "/usr/local/bin/cutadapt" #path to cutadapt
system2(cutadapt, args = c("--version")) # Run shell commands from RNow, it is crucial to create a variable call path.cut
where we will store the sequences after being processed by cutadapt. We
will give these sequences the same names as filtFsand
filtRs. For an easier use of the tool, we create some of
its arguments with paste0. The arguments we use are very
well explained in cutadapt
user guide. Specifically, we implement arguments for linked
adapters. In this sense, -a and -A are
preceding the adapters sequences, respectively for R1 (forward) and R2
(reverse) reads. We use argument ^ for anchoring primers, that is,
primers are only removed if the anchored primers is found. Moreover, we
combine it with the –discard-untrimed argument. In this way, we
make sure that a sequence is not retained if only the reverse primer is
found. Forward and reverse primers should be present in order to keep
the sequences. At the end, we use a for loop over system2
to run cutadapt in every file.
##Create path to cutadapt sequences
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(filtFs))
fnRs.cut <- file.path(path.cut, basename(filtRs))
##Produce arguments for cutadapt. rc function creates the reverse complementary of the provided sequence (FWD).
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste0("-a", " ", "^",FWD,"...", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste0("-A"," ","^", REV, "...", FWD.RC)
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,"-m", 1, # -n 2 required to remove FWD and REV from reads
#-m 1 is required to remove empty sequences for plotting quality plots
"--discard-untrimmed",
"-j",0,#automatically detect number of cores
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
filtFs[i], filtRs[i],# input files
"--report=minimal")) #Report minimal reports a summary
}ITS considerations
Change filtFs and filtRs for fnFs
and fnRs, respectively, since you didn’t perform the
previous filtering step.
Now, we have the filtered sequences without primer in
fnFs.cut and fnRs.cut. From now on, these will
be the sequences that we will use for analysis.
Let’s see if cutadapt has properly removed the primers with
primerHits
#Check primer presence after cutadapt
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 1 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 0 0 0 0
#> REV.ReverseReads 0 0 0 0
As we can see, there are still some forward primers to be found on
forward reads. We should not worry about that. Cutadapt and
primerHits have different ways to look for primers in the
sequence. Moreover, if primers are still to be found in low quality
sequences, they will probably be removed in next steps.
ITS considerations
If analysing ITS sequences, remember to apply now the filtering step. See the code below:
filtFs <- file.path(path.cut, "filtered", basename(fnFs.cut))
filtRs <- file.path(path.cut, "filtered", basename(fnRs.cut))
out <- filterAndTrim(fnFs.cut, filtFs, fnRs.cut, filtRs, maxN = 0, maxEE = c(2, 5), truncQ = 2,
minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)
head(out)
head(out) you can check if the maxEEyou
set was too strict. If you are loosing too many sequences, consider to
increase its values.
In next steps, we will apply the main core of DADA2 pipeline. As it is very well detailed in its tutorial, we will not make a lot of comments about it. Good luck! 🕓
Learn error rates
#### LEARN ERROR RATES ####
errF <- learnErrors(fnFs.cut, multithread=T, verbose=1) ##> 109967023 total bases in 419736 reads from 6 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ......
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> Convergence after 5 rounds.
errR <- learnErrors(fnRs.cut, multithread=T, verbose=1) ##> 102040114 total bases in 554558 reads from 8 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ........
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> selfConsist step 6
#> Convergence after 6 rounds.
ITS considerations
Change fnFs.cut and fnRs.cut for
filtFs and filtRs, respectively. Do it also in
the next steps: dadaFs, dadaRs and mergers.
View error plots
As we can see, the plots follow the tendency adequately.
#Plot errors
plotErrors(errF, nominalQ=TRUE)#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
plotErrors(errR, nominalQ=TRUE)#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
Sample inference
#Sample inference
dadaFs <- dada(fnFs.cut, err=errF, multithread=TRUE)#> Sample 1 - 61506 reads in 15424 unique sequences.
#> Sample 2 - 60287 reads in 12454 unique sequences.
#> Sample 3 - 67108 reads in 13471 unique sequences.
#> Sample 4 - 68541 reads in 13392 unique sequences.
#> Sample 5 - 92672 reads in 19174 unique sequences.
#> Sample 6 - 69622 reads in 14664 unique sequences.
#> Sample 7 - 66191 reads in 14858 unique sequences.
#> Sample 8 - 68631 reads in 17064 unique sequences.
#> Sample 9 - 64360 reads in 12997 unique sequences.
#> Sample 10 - 73537 reads in 16302 unique sequences.
#> Sample 11 - 64206 reads in 12105 unique sequences.
#> Sample 12 - 74353 reads in 16945 unique sequences.
#> Sample 13 - 62585 reads in 14748 unique sequences.
#> Sample 14 - 72255 reads in 16401 unique sequences.
#> Sample 15 - 69370 reads in 15705 unique sequences.
#> Sample 16 - 56068 reads in 13202 unique sequences.
#> Sample 17 - 73197 reads in 14786 unique sequences.
#> Sample 18 - 58236 reads in 12100 unique sequences.
#> Sample 19 - 56801 reads in 13257 unique sequences.
#> Sample 20 - 56072 reads in 12635 unique sequences.
#> Sample 21 - 57847 reads in 12828 unique sequences.
#> Sample 22 - 65716 reads in 11388 unique sequences.
#> Sample 23 - 69528 reads in 13095 unique sequences.
#> Sample 24 - 52479 reads in 9714 unique sequences.
#> Sample 25 - 89584 reads in 7824 unique sequences.
#> Sample 26 - 93427 reads in 8209 unique sequences.
#> Sample 27 - 106500 reads in 9724 unique sequences.
dadaRs <- dada(fnRs.cut, err=errR, multithread=TRUE)#> Sample 1 - 61506 reads in 15090 unique sequences.
#> Sample 2 - 60287 reads in 8972 unique sequences.
#> Sample 3 - 67108 reads in 10240 unique sequences.
#> Sample 4 - 68541 reads in 11098 unique sequences.
#> Sample 5 - 92672 reads in 15470 unique sequences.
#> Sample 6 - 69622 reads in 11810 unique sequences.
#> Sample 7 - 66191 reads in 10626 unique sequences.
#> Sample 8 - 68631 reads in 13787 unique sequences.
#> Sample 9 - 64360 reads in 9862 unique sequences.
#> Sample 10 - 73537 reads in 11339 unique sequences.
#> Sample 11 - 64206 reads in 8219 unique sequences.
#> Sample 12 - 74353 reads in 12900 unique sequences.
#> Sample 13 - 62585 reads in 11369 unique sequences.
#> Sample 14 - 72255 reads in 12843 unique sequences.
#> Sample 15 - 69370 reads in 12557 unique sequences.
#> Sample 16 - 56068 reads in 11103 unique sequences.
#> Sample 17 - 73197 reads in 12138 unique sequences.
#> Sample 18 - 58236 reads in 9899 unique sequences.
#> Sample 19 - 56801 reads in 9540 unique sequences.
#> Sample 20 - 56072 reads in 10430 unique sequences.
#> Sample 21 - 57847 reads in 9716 unique sequences.
#> Sample 22 - 65716 reads in 9249 unique sequences.
#> Sample 23 - 69528 reads in 11676 unique sequences.
#> Sample 24 - 52479 reads in 8641 unique sequences.
#> Sample 25 - 89584 reads in 6378 unique sequences.
#> Sample 26 - 93427 reads in 6198 unique sequences.
#> Sample 27 - 106500 reads in 7362 unique sequences.
dadaFs[[1]]#> dada-class: object describing DADA2 denoising results
#> 758 sequence variants were inferred from 15424 input unique sequences.
#> Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
As we can see, thanks to the DADA2 algorithm, 758 ASVs were inferred,
coming from 15424 unique sequences (remember that this is only for the
frist sample, as we wrote [[1]]. Next, we will give
dadaFsand dadaRs the
sample.names.
#Set sample names
names(dadaFs) <- sample.names
names(dadaRs) <- sample.namesMerge paired-end reads
#Merge sequences
mergers <- mergePairs(dadaFs, fnFs.cut, dadaRs, fnRs.cut, verbose=TRUE)
#> 58056 paired-reads (in 841 unique pairings) successfully merged out of 59510 (in 1465 pairings) input.
#> 57884 paired-reads (in 527 unique pairings) successfully merged out of 58969 (in 973 pairings) input.
#> 65180 paired-reads (in 492 unique pairings) successfully merged out of 66042 (in 881 pairings) input.
#> 66703 paired-reads (in 486 unique pairings) successfully merged out of 67552 (in 865 pairings) input.
#> 89414 paired-reads (in 932 unique pairings) successfully merged out of 90777 (in 1561 pairings) input.
#> 66897 paired-reads (in 665 unique pairings) successfully merged out of 68102 (in 1223 pairings) input.
#> 63906 paired-reads (in 609 unique pairings) successfully merged out of 64927 (in 1106 pairings) input.
#> 65300 paired-reads (in 860 unique pairings) successfully merged out of 66704 (in 1524 pairings) input.
#> 62656 paired-reads (in 606 unique pairings) successfully merged out of 63300 (in 889 pairings) input.
#> 71291 paired-reads (in 635 unique pairings) successfully merged out of 72263 (in 1082 pairings) input.
#> 62506 paired-reads (in 443 unique pairings) successfully merged out of 63204 (in 757 pairings) input.
#> 71825 paired-reads (in 733 unique pairings) successfully merged out of 72805 (in 1226 pairings) input.
#> 59867 paired-reads (in 626 unique pairings) successfully merged out of 61145 (in 1295 pairings) input.
#> 69397 paired-reads (in 807 unique pairings) successfully merged out of 70680 (in 1361 pairings) input.
#> 66800 paired-reads (in 773 unique pairings) successfully merged out of 67865 (in 1198 pairings) input.
#> 53498 paired-reads (in 691 unique pairings) successfully merged out of 54585 (in 1149 pairings) input.
#> 70984 paired-reads (in 668 unique pairings) successfully merged out of 71999 (in 1094 pairings) input.
#> 56672 paired-reads (in 449 unique pairings) successfully merged out of 57345 (in 744 pairings) input.
#> 54758 paired-reads (in 705 unique pairings) successfully merged out of 55509 (in 1006 pairings) input.
#> 54202 paired-reads (in 648 unique pairings) successfully merged out of 54911 (in 986 pairings) input.
#> 55901 paired-reads (in 585 unique pairings) successfully merged out of 56778 (in 1000 pairings) input.
#> 64397 paired-reads (in 475 unique pairings) successfully merged out of 65119 (in 738 pairings) input.
#> 67867 paired-reads (in 532 unique pairings) successfully merged out of 68681 (in 897 pairings) input.
#> 51487 paired-reads (in 353 unique pairings) successfully merged out of 51953 (in 540 pairings) input.
#> 89364 paired-reads (in 74 unique pairings) successfully merged out of 89388 (in 85 pairings) input.
#> 93212 paired-reads (in 82 unique pairings) successfully merged out of 93267 (in 96 pairings) input.
#> 106318 paired-reads (in 139 unique pairings) successfully merged out of 106393 (in 160 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
#> sequence
#> 1 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCGCGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 2 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGCTTTCGGGTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 3 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 4 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGGGTGACTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 5 TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCTAACGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCCGTCGTGAAAGCCCACGGCTTAACTGTGGGTCTGCGGTGGATACGGGCAGACTAGAGGCAGGTAGGGGAGCATGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTGCTCTGGGCCTGTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 6 TGGGGAATCTTGGACAATGGGCGAAAGCCCGATCCAGCAATATCGCGTGAGTGAAGAAGGGCAATGCCGCTTGTAAAGCTCTTTCATCGAGTGCGCGATCATGACAAGACTCGAGGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGGAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACA
#> abundance forward reverse nmatch nmismatch nindel prefer accept
#> 1 6997 1 1 39 0 0 2 TRUE
#> 2 5696 2 2 37 0 0 2 TRUE
#> 3 3738 3 1 39 0 0 2 TRUE
#> 4 2282 4 2 39 0 0 2 TRUE
#> 5 1950 5 5 39 0 0 2 TRUE
#> 6 1931 6 3 42 0 0 2 TRUEConstruct sequence table
#Construct sequence table
seqtab <- makeSequenceTable(mergers)
dim(seqtab) #> [1] 27 7080
Our ASV table (seqtab) contains 7080 ASVs from 27
samples (24 samples + 3 MOCK).
Remove chimeras
#Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)#> Identified 2163 bimeras out of 7080 input sequences.
dim(seqtab.nochim)#> [1] 27 4917
After chimera removing, we have reduced the number of ASV from 7080 to 4917. It may seem a lot, but they do not account for a great number of sequences. Don’t worry, we will check that in a few steps later!
Inspect ASV length and abundance
After all this process, we have an ASV table without chimeras
(seqtab.nochim). Before further analysis, it is interesting
to check the distribution of ASV length as well as the number of ASV for
each length. This will give us some valuable information as a check
point of the process. Moreover, we will remove those lengths that are
not expected from our sequencing process and that represent a small
number of ASV and sequences, as they could probably be a result of
non-specific priming.
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim))) #Number of ASV of each length#>
#> 244 247 253 255 259 262 264 269 273 276 299 304 311 316 321 327 338 339 356
#> 2 2 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 359 361 364 369 370 373 376 380 382 384 385 386 390 396 397 400 401 402 403
#> 1 1 2 1 1 1 1 1 2 2 6 8 1 5 2 4 44 984 294
#> 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
#> 202 156 20 387 82 44 24 20 3 11 7 5 50 16 18 32 27 216 552
#> 423 424 425 426 427 428 429 430 431 432
#> 40 29 57 265 1026 218 14 3 2 6
This table is really simple, you can see the ASV length in the first line and below the number of ASV. As it was expected, most ASV are found in a length of 427 bp.
However, in this way we don’t know how many sequences these ASVs
account for in each length. Let’s calculate it and make a plot! We will
store the number of sequences in reads.per.seqlen
#Get number of sequences per length
reads.per.seqlen <- tapply(colSums(seqtab.nochim), factor(nchar(getSequences(seqtab.nochim))), sum) #number of sequences for each length
reads.per.seqlen#> 244 247 253 255 259 262 264 269 273 276 299 304 311
#> 4 104 2 4 2 175 6 3 2 2 2 4 2
#> 316 321 327 338 339 356 359 361 364 369 370 373 376
#> 2 2 4 2 2 2 2 4 50 4 2 6 4
#> 380 382 384 385 386 390 396 397 400 401 402 403 404
#> 4 36451 8 40 1082 2 15 13 661 504 130086 99017 277172
#> 405 406 407 408 409 410 411 412 413 414 415 416 417
#> 3998 995 592438 116044 58285 1601 1328 4790 179 2455 93 2878 133
#> 418 419 420 421 422 423 424 425 426 427 428 429 430
#> 612 5778 1023 12092 29632 268 448 4098 12680 370689 13647 119 8
#> 431 432
#> 4 213
Now, we will produce a table ready for plotting
## Plot length distribution
table_reads_seqlen <- data.frame(length=as.numeric(names(reads.per.seqlen)), count=reads.per.seqlen) #Create a table
ggplot(data=table_reads_seqlen, aes(x=length, y=count)) + geom_col() #plot it!As we can see with the table and the plot, we will choose the interval ranging from 402 to 428 pb.
#Filter ASV length
seqtab.nochim <- seqtab.nochim[,nchar(colnames(seqtab.nochim)) %in% seq(402,428)]ITS considerations
The above analysis should be avoided because of the variable length of ITS2 region.
Check number of reads
At this point, it is important to check how many sequences we have
lost in each step. Apart from filtering, not other step should have led
to a substantial decreased in sequence number. First, and as it is shown
in dada2 tutorial, we create getN function and then apply
it to produce a track table.
#Check number of sequences at each step
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
#Let's apply a mutate to get percentage, making it easier to analyze.
track <- track %>%as.data.frame() %>% mutate(Perc.filtered=filtered*100/input,
Perc.denoisedF= denoisedF*100/filtered,
Perc.denoisedR= denoisedR*100/filtered,
Perc.merged= merged*100/filtered,
Perc.nonchim= nonchim*100/merged,
Perc.retained=nonchim*100/input
)
head(track)#> input filtered denoisedF denoisedR merged nonchim Perc.filtered Perc.denoisedF
#> L1-1 80054 61741 60003 60662 58056 55455 77.12419 97.18501
#> L1-2 76776 60781 59308 59694 57884 56153 79.16667 97.57655
#> L1-3 84947 67563 66299 66670 65180 63021 79.53548 98.12915
#> L1-4 87190 68997 67828 68092 66703 63831 79.13408 98.30572
#> L1-5 114810 93159 91321 91801 89414 84491 81.14189 98.02703
#> L1-6 88519 70075 68479 68988 66897 64905 79.16380 97.72244
#> Perc.denoisedR Perc.merged Perc.nonchim Perc.retained
#> L1-1 98.25238 94.03152 95.51984 69.27199
#> L1-2 98.21161 95.23371 97.00954 73.13874
#> L1-3 98.67827 96.47292 96.68763 74.18861
#> L1-4 98.68835 96.67522 95.69435 73.20908
#> L1-5 98.54228 95.97999 94.49415 73.59202
#> L1-6 98.44880 95.46486 97.02229 73.32324
Everything is as expected: the only step were we lose a greater
number of sequences is the filtering step. If you want to visualize the
full table and save it, remember to use View(track) and
write.table(track, file="dada2_track.txt", sep="\t"). Let’s
keep on with classification!
Taxonomy classification with RDP
As it was mentioned at the beginning of the tutorial, we will use a
modfied RDP database. We usually work with root samples, so the
detection of mithocondria sequences from host cells was crucial for us.
Because of that, we have implemented mithocondrial sequences in the
database. Moreover, RDP database has the drawbak that some taxa have not
all classification levels. This is the case of some
Acidobacteria. We found phylum acidobacteria,
class Acidobacteria_gp1 and genus gp1.
However, no family or order was found for this record, giving an
unclassified result at these levels. Classification is easy to
apply with assignTaxonomy, you only have to change the path
to your RDP database file.
#Classification with RDP
taxa_rdp <- assignTaxonomy(seqtab.nochim, "/home/nuria/Escritorio/data_micro4all/rdp_train_set_18_H.fa", multithread=TRUE)ITS considerations
For ITS classification, you can download the UNITE database in DADA2 taxonomic reference data website
Get ASV table
Now that we have a classification (taxa_rdp) table and
an ASV table (seqtab.nochim), we will produce a table
combining classification and ASV abundance. First, lets assign
seqtab.nochim to a new variable called ASV.
Then, we will substitute ‘NA’ in taxonomy_rdp with
unclassified.
#Combine taxonomy and ASV abundance
ASV <- seqtab.nochim
ASVt <- t(ASV)
## SUBSTITUTE 'NA' WITH 'UNCLASSIFIED'and remove species column
taxa_rdp_na <- apply(taxa_rdp,2, tidyr::replace_na, "unclassified")[,-7]The next lines are used to create ASV names. We do it in a way that if there are 1000 ASV, they will be named from ASV0001 to ASV1000. The code below takes automatically into consideration the number of 0 to add according to the total number of ASV.
##Create names for each ASV: if there are 1000 ASVs, call them from ASV0001 to ASV1000
number.digit<- nchar(as.integer(nrow(ASVt)))
names <- paste0("ASV%0", number.digit, "d") #As many 0 as digits
ASV_names<- sprintf(names, 1:nrow(ASVt))At last, we have everything to create our ASV table. We are going to
bind its different elements and format it a little. Then, we save it as
.txt. This allows us to manipulate the data manually if
needed (you can open it with tools as Excel), as well as saving a raw
data before further steps are performed.
## CREATE AND SAVE ASV TABLE
ASV_table_classified<- cbind(as.data.frame(taxa_rdp_na,stringsAsFactors = FALSE),as.data.frame(ASV_names, stringsAsFactors = FALSE),as.data.frame(ASVt,stringsAsFactors = FALSE))
##Make rownames a new column, in order to keep sequences during the filtering process
ASV_seqs <- rownames(ASV_table_classified)
rownames(ASV_table_classified) <- NULL
ASV_table_classified <- cbind(ASV_seqs, ASV_table_classified)
write.table(ASV_table_classified, file="ASV_table_classified.txt", sep="\t")If you already have an ASV table, you can load it here with
ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE)
and follow the tutorial from now on.
MOCK community
In this step we want to filter out those ASV that are misassigned to
another samples. This can be done with the analysis of MOCK community
samples that were introduced in the same sequencing run. In this sense,
we always introduce in our sequencing three samples of a commercial MOCK
community (ZymoBIOMICS Microbial Community Standard II (Log
Distribution), ZYMORESEARCH, CA, USA). The logic of this analysis is as
follows: because we already know its composition, we can check if there
is any ASV identified by the sequencing process that is not a real
member of the MOCK community. Once identified, we calculate the
percentage of sequences it represents in MOCK samples. We make the
assumption that ASVs that represent less than this percentage in the
sequencing run should be discarded from our data. In order to automate
this analysis, micro4all implements a function called
MockCommunity. If you want to know more about this
function, you can always read the help information with
?MockCommunity. Basically, you have to give it an ASV table
where the samples are columns. MOCK samples should include the word
‘MOCK’ in capital letters. Moreover, the second argument corresponds to
a MOCK composition file which includes the members of the community at
genus level in decreasing order of abundance. You can see the structure
of this file as it is loaded with micro4all package
(use View(micro4all::mock_composition)). If you want to
create a different MOCK composition object, you can easily do it with
mock_comp <- setNames(as.data.frame(c("genera1", "genera2","genera3")), c("MockGenera")).
Lastly, you should change the argument ASV_column to the
name of the column where your ASV names are stored.
MockCommunity function is user-interactive, it will ask
you wheter you want to select the first misassigned ASV detected to
calculate the percentage or not. If answer fits ‘no’, then it will look
for the next one. One the answer fits ‘yes’, it will remove the MOCK
community samples, apply the filtering and return a cleaned ASV table
sorted in decreasing ASV abundance.
Let’s see how it works:
#MOCK community cleaning.
ASV_filtered_MOCK<-MockCommunity(ASV_table_classified, mock_composition, ASV_column = "ASV_names") #mock_composition is a data frame included as data in this package. See documentation for details.#> ASV0811 does not belong to the MOCK community. It representes a 0.031757 perc. of the sequences
#> and it classifies as Limosilactobacillus
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]ASV0002 does not belong to the MOCK community. It representes a 0.003881 perc. of the sequences
#> and it classifies as Streptomyces
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]You made a decision! ASV0002 which classifies as Streptomyces
#> and represents a 0.003881 perc. of the sequences, was used to calculate the percentage
As explained above, MockCommunity makes a question to
user before making the decision. Why? Sometimes, like in the example
above, an ASV could be classified with a different name, still being the
same genera. In this example, the first ASV which is not detected in the
MOCK community composition table, is Limosilactobacillus.
However, we can see that it represents a high percentage of sequences.
This is probably a bacillus from MOCK samples. Because of that,
we decide to choose the second option.
ITS considerations
If analysing ITS data, MOCK samples with a good number of fungi representatives should be included. If your bacteria and fungi are sequenced in the same run, the calculate percentage for bacteria can be applied to fungi dataset.
When a MOCK community has not been analyzed, it is recommended to filter out those ASV which represent less than a 0.005% of the sequences, as it was studied by Bokulich et al. (2013)3. We can do it easily in R as follow:
###CLEAN according to Bokulich####
# Get number of sequences of each ASV
ASV_sums <- rowSums(ASV_table_classified[,9:ncol(ASV_table_classified)])
# Get total number of sequences
sum.total<-sum(ASV_sums)
# Apply percentage to sequence number
nseq_cutoff<-(0.005/100)*sum.total
# Filter table.
ASV_filtered_bokulich<- ASV_table_classified[which(ASV_sums>nseq_cutoff),]
# Sort table in ascending order of ASV names
ASV_table_bokulich<-ASV_filtered_bokulich[order(ASV_filtered_bokulich[["ASV_names"]]),]Once the percentage is applied, we can remove sequences coming from chloroplast and mitochondria at various levels
#### CLEAN CHLOROPLAST SEQUENCES ####
ASV_final<-ASV_filtered_MOCK[(which(ASV_filtered_MOCK$Genus!="Streptophyta"
& ASV_filtered_MOCK$Genus!="Chlorophyta"
&ASV_filtered_MOCK$Genus!="Bacillariophyta"
&ASV_filtered_MOCK$Family!="Streptophyta"
& ASV_filtered_MOCK$Family!="Chlorophyta"
&ASV_filtered_MOCK$Family!="Bacillariophyta"
& ASV_filtered_MOCK$Family!="Mitochondria"
& ASV_filtered_MOCK$Class!="Chloroplast"
& ASV_filtered_MOCK$Order!="Chloroplast"
& ASV_filtered_MOCK$Kingdom!="Eukaryota"
& ASV_filtered_MOCK$Kingdom!="unclassified")),]ITS considerations
For ITS analysis, we should also remove those ASVs that were not
classified as _k__Fungi_ at kingdom level. We could do it as follows:
ASV_final<-ASV_filtered_MOCK[which(ASV_filtered_MOCK$Kingdom!="k__Fungi"),]
After this, the presence of Cyanobacteria/Chloroplast at phylum level
should be checked. If we only find Cyanobacteria/Chloroplast at phylum
level which are not further classified, we remove it as they can be as
well coming from host DNA. To check this, we can use these lines (if
working on RStudio, add a View(...) to visualize it
easier):
#Check Cyanobacteria
ASV_final[which(ASV_final$Phylum=="Cyanobacteria/Chloroplast"),] #> [1] ASV_seqs Kingdom Phylum Class Order Family Genus ASV_names L1-1
#> [10] L1-2 L1-3 L1-4 L1-5 L1-6 L1-7 L1-8 L2-1 L2-2
#> [19] L2-3 L2-4 L2-5 L2-6 L2-7 L2-8 L3-1 L3-2 L3-3
#> [28] L3-4 L3-5 L3-6 L3-7 L3-8
#> <0 rows> (or 0-length row.names)
In this particular case, we find one ASV classified as Cyanobacteria/Chloroplast at phylum level but not classified at deeper levels. We can remove them as follows:
#Remove cyanobacteria at phylum level
ASV_final<-ASV_final[(which(ASV_final$Phylum!="Cyanobacteria/Chloroplast"
)),]We will write this table on our local machine
##Save table
write.table(ASV_final, file="ASV_final.txt", sep="\t")Furthermore, if working with easily host contaminated samples (as it
could be endosphere samples or seed samples), it could happen that
unclassified sequences at phylum level are chimeric products of
microorganisms and host DNA. We will not take this into account in this
tutorial, but above we detail the lines to check it. It consists on
getting the ASV sequences from unclassified phyla. We saved it to the
local machine in case you want to perform an BLASTn manually. However,
it can also be performed from R thanks to system, running a
local blast. Let’s install some packages beforehand:
#Install seqinr for fasta writing
install.packages("seqinr", repos="http://R-Forge.R-project.org")
#Install annotate
BiocManager::install("annotate")First, we will store the unclassified phyla in a variable, then the
sequences and write them as a FASTA file to your local machine.
Furthermore, we will make an BLASTn with blastSequence (in
this function, change the [[1]] to analyze another
sequence).
##Analyze unclassified phyla
#Load libraries
library(seqinr); packageVersion("seqinr"); library(annotate); packageVersion("annotate")
#> [1] '4.2.23'
#> [1] '1.74.0'
#Get sequences of unclassified phyla
unclassified_phyla <-ASV_final[which(ASV_final$Phylum=="unclassified"),]
seq_unclassified_phyla<- as.list(unclassified_phyla$ASV_seqs)
#write fasta
write.fasta(seq_unclassified_phyla, names=unclassified_phyla$ASV_names, file.out="unclassified_phyla.fas", open = "w", nbchar =1000 , as.string = FALSE)With this file you can perform a BLASTn online. Nonetheless, you can
also perform a local blast as follows. Remember that we are working in
Ubuntu, so first you should download BLASTn in your machine, as well as
the corresponding databases (in this case, the ntdatabase).
The arguments are:
- query. The path to the FASTA file to analyze.
- db. Path to database on your local machine.
- out. Name of the output file
- outfmt. Output format. We use format 6 (tabular without headers). This format does not give you the species name, that’s why we add “stitle”.
- show_gis. It shows the NCBI GIs in deflines.
- max_target_seqs. Number of hits to return for each sequences.
- num_threads. Number of threads to use.
#Local BLASTn
system(("/home/programs/ncbi-blast-2.11.0+/bin/blastn -query unclassified_phyla.fas -db /mnt/datos/databases/nt -out unclassified_phyla_hits.txt -outfmt '6 std stitle' -show_gis -max_target_seqs 20 -parse_deflines -num_threads 10"),intern = TRUE)In the code above, remember to change "/home/..." to the
path where BLASTn program is stored in your machine.
Diversity analysis
We have finished the first part of the tutorial. Now, we should have
a final ASV table, already cleaned from misassigned ASV and host ASV. We
are ready to start with diversity analysis! If you already have an ASV
table, you can load it here with
ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE)
and follow the tutorial from now on.
For this purpose, we need a phyloseq object. phyloseq package provides multiple tools for analyzing microbial communities data. The first challenge that phyloseq solves is to wrap different information into a single object, easy to use with multiple packages. First of all, let’s install phyloseq and load the library:
##Install phyloseq
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")##Load library
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'Specifically for this tutorial, we will need an ASV table with number of reads per sample, their taxonomy classification, metadata table, a phylogenetic tree and DNA sequences. As we can see, we should produce a phylogenetic tree. This will allows us to use UniFrac and WeightedUnifrac distances later on in β diversity analysis. For this purpose, we will use tools as MAFFT and FastTreeMP. Again, we should remind you that we are using these tools under a Ubuntu distribution and they should be appropriately installed beforehand. Moreover, if you have any problem with FastTreeMP, you can always switch to FastTree.
We will start with getting a FASTA file with all ASV sequences. As
mentioned in the last section, we will use seqinr (if
you haven’t loaded it yet, do it now with library(seqinr)).
With this file, we will perform a multiple sequence
alignment with MAFFT (storing first the path
to the program and the using system2 for calling it into
the shell). We will use default alignment options with argument
--auto. The output file will be called
alignment. This file will be used by
FastTreeMP to produce a phylogenetic tree. The steps to
execute this program are the same as with MAFFT.
####GET PHYLOGENETIC TREE####
## GET THE FASTA FILE ##
ASVseqs_fasta<- as.list(ASV_final$ASV_seqs)
write.fasta(ASVseqs_fasta, names=ASV_final$ASV_names, file.out="ASV_tree.fas", open = "w", nbchar =1000 , as.string = FALSE)
##Get the alignment ##
mafft <- "/usr/bin/mafft" #path to program
system2(mafft, args="--version")
system2(mafft, args=c("--auto", "ASV_tree.fas>", "alignment"))
## Get the tree ##
FastTreeMP <- "/home/programs/FastTreeMP/FastTreeMP"
system2(FastTreeMP, args="--version" )
system2(FastTreeMP, args = c("-gamma", "-nt", "-gtr", "-spr",4 ,"-mlacc", 2, "-slownni", "<alignment>", "phy_tree"))# Run FastTreeMPITS considerations
ITS region is not appropriate for calculating phylogenetic trees. As a result, Unifrac nor Weighted Unifrac distances can be applied.
The only thing left to produce our phyloseq object is to have the metadata table, which includes further information about your samples. This information is used later on in order to group samples together (for example, replicates from the same location). In this tutorial, we will use the metadata table that is loaded with micro4all. It simply is a data frame containing sample names and, in another column, its location (location1, location2 and location3). We will store it in another variable and make some modifications to remove “-” and replace it with “_“, both in sample columns and in row names.
##Metadata from micro4all
metadata.micro <-micro4all::metadata
metadata.micro$samples <- str_replace_all(metadata.micro$samples,"-", "_")
rownames(metadata.micro) <- str_replace_all(rownames(metadata.micro),"-", "_")If you are using your own metadata table, you can create it outside R
with any text editor and save it as a tabular separated file. Then, you
could load it with
mt <- read.table("yourmtfile.txt", sep="\t", header=TRUE)
and row.names(mt)<- mt$samples. Remember that it
must include a column with sample names (exactly the same
sample names and in the same order as in the ASV table). In the next
lines, remember to replace metadata.mciro with your own
mt object.
Create phyloseq
We will put all the information we have (classification, abundance, metadata, ASV sequences and phylogenetic tree) in one object thanks to phyloseq. First, each variable will be transformed to phyloseq and then we will merge them.
####CREATE PHYLOSEQ OBJECT FROM TABLES
tax <- ASV_final[,2:8] #Tax
OTU <- ASV_final[,9:ncol(ASV_final)] #ASV
colnames(OTU) <- str_replace_all(colnames(OTU),"-", "_")
dna<-Biostrings::DNAStringSet(ASV_final$ASV_seqs) #Sequences (BioString is imported with phyloseq)
names(dna)<- ASV_final$ASV_names
##ADD ASV NAMES
row.names(tax)<-ASV_final$ASV_names
row.names(OTU)<-ASV_final$ASV_names
#Check rownames are equal
identical(rownames(OTU), rownames(tax))
#> [1] TRUE
##Introduce phylogenetic tree
phy_tree <- phyloseq::read_tree("phy_tree")
unrooted_tree<- phy_tree
ape::is.rooted(unrooted_tree)
#> [1] FALSE
##Produce root (we need a root for distance calculation)
tree_root<-ape::root(unrooted_tree, 1, resolve.root = T)
tree_root
#>
#> Phylogenetic tree with 1035 tips and 1034 internal nodes.
#>
#> Tip labels:
#> ASV0449, ASV0153, ASV0306, ASV0086, ASV0347, ASV0343, ...
#> Node labels:
#> Root, , 0.867, 0.000, 0.789, 0.900, ...
#>
#> Rooted; includes branch lengths.
ape::is.rooted(tree_root)
#> [1] TRUE
##CONVERT TO PHYLOSEQ FORMART
phy_OTUtable<-otu_table(OTU, taxa_are_rows = T)
phy_taxonomy<-tax_table(as.matrix(tax))
phy_metadata<-sample_data(metadata.micro) #Change here metadata.micro with mt (your own metadata object)
#Put everything into a phyloseq object
loc_phyloseq<-phyloseq(phy_OTUtable,phy_taxonomy,phy_metadata,dna,tree_root)#Remove Tree_root when working with ITSNow we will load the libraries needed to carry out the diversity analysis. Warning! If some of these packages are missing, install them with install.packages() function.
##Load packages
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'
library(Biostrings); packageVersion("Biostrings")
#> [1] '2.64.1'
library(GUniFrac); packageVersion("GUniFrac")
#> [1] '1.7'
library(phangorn); packageVersion("phangorn")
#> [1] '2.11.1'
library(vegan); packageVersion("vegan")
#> [1] '2.6.4'
library(pheatmap); packageVersion("pheatmap")
#> [1] '1.0.12'
library(colorspace); packageVersion("colorspace")
#> [1] '2.1.0'\(\alpha\) diversity
In this section, we will study the diversity of our community in terms of richness, diversity and evenness. For this purpose, several indices will be used, like Shannon index, Inverse of Simpson and Pilou’s index.
Data preparation
First, we will produce a table with \(\alpha\) diversity indices. To achieve this,we will rarefy the data to the number of sequences of the smallest sample. This is done in order to avoid a bias in estimation of diversity only because of variability in sequencing depth among samples.
Rarefaction curves should be produced first. This makes it easy to visualize if rarefaction process encountered a great loss of information or not. This plot represents number of ASV (richness) against number of sequences. Moreover, a vertical line will show the number of sequences where normalization occurred. If this line is found in the asymptote of every rarefaction curve, that means we are safe: although we would have more sequences for every sample, we would not have a significantly greater number of ASV.
#### RARECURVE ####
ASV <- as.data.frame(t(otu_table(loc_phyloseq)))
sample_names <- rownames(ASV)
#Generate rarefaction curves
rarecurve <- rarecurve(ASV, step = 100, label = F,)Function rarecurve produces a simple plot. However, we
can produce a nicer plot with ggplot2. This package was
already loaded in the DADA2 section, if not, load it with
library(ggplot2).
Our aim is to produce lines colored by location (or any other
condition in your metadata table). But to achieve this we need to
prepare our data. First of all, we will separate each rarefaction curve
from rarecurve into a data frame and then add a column with
sample names. For this step, we will use a function from
purrr package, but it doesn’t need installation as it
is included in the tidyverse collection, which was
installed in the first section of the tutorial.
#For each rarefaction curve, transform rarecurve output to a dataframe.
rarecurve_table <- lapply(rarecurve, function(x){
b <- as.data.frame(x)
b <- data.frame(ASV = b[,1], raw.read = rownames(b))
b$raw.read <- as.numeric(gsub("N", "", b$raw.read))
return(b)
})
#Produce a column with sample names and put everything in one data frame
names(rarecurve_table) <- sample_names
rarecurve_table <- purrr::map_dfr(rarecurve_table, function(x){ #for each rarecurve element, we produce a dataframe and bind them all together.
z <- data.frame(x)
return(z)
}, .id = "Sample")Now, we will add to rarecurve_table a new column. In
this column, we will have the grouping variable of interest (for
example, location1) repeated as many times as necessary. Remember that,
in rarecurve_table, sample names are repeated, as it have
the information of every rarefaction curve. If you want to adapt this to
your data, you should change metadata.micro with your
metadata variable name (for example, mt) and the variable
of interest (´$location´ to the user variable of interest).
#To color lines according to group, let's create a new column
#Coloring
color <- NULL
for (i in (1:length(rarecurve_table$Sample))) {
color <- rbind(color,metadata.micro$location[which(metadata.micro$samples==rarecurve_table$Sample[i])])##Change metadata.micro to user mt and "Location" to variable of interest
}
#Bind this column
rarecurve_table_color <- cbind(rarecurve_table, color)
colnames(rarecurve_table_color) <- c(colnames(rarecurve_table), "Location")#Change Location to variable of interestWith all data prepared, it is moment to use ggplot for graphical
representation. When using your own data, remember to change
colour=Location to your variable of interest. You will find
where to change the code next to each line, followed by a
#.
## RARECURVE WITH GGPLOT ##
rareggplot<-ggplot(rarecurve_table_color, aes(x=raw.read, y=ASV, colour=Location, group=Sample)) + ### Change location for your variable of interest and IMPORTANT TO GROUP BY SAMPLE
theme_bw()+
geom_point(aes(colour=Location), size=0.01)+ #Change location to your variable of interest
geom_line(aes(colour=Location),size=0.5)+ ##Change location to your variable of interest, Change size for line thickness
geom_vline(aes(xintercept = min(sample_sums(loc_phyloseq))), lty=1, colour="black")+
scale_fill_manual(values = c("location1"= "#33CC00", "location2"= "#ffd624", "location3"="#80CC88"))+ #Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
scale_color_manual(values = c("location1"= "#33CC00","location2"= "#ffd624", "location3"="blue"),#Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
name="Location", #change Location to variable of interest
breaks=c("location1", "location2","location3"),#change to values of your variable of interest
labels=c("Location 1","Location 2","Location 3"))+#change to values of your variable of interest
labs(title="Rarefaction curves", x="Number of sequences", y="Number of ASV")+
guides(alpha="none")+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold",size=14),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold", size=16),
legend.text = element_text(size = 12))
rareggplotAs we can see, the rarefaction process took place in the asymptote of all samples.
For saving plots in your machine, an easy and versatile option is to
use ggsave form ggplot. For publication, you
can save it in .TIFF, but there are also others interesting formats. For
example, .svg allows you to open the file in
PowerPoint and easily modify it. If you have more
knowledge of graphical design programs as Adobe
Illustrator, you can save it in .eps
#tiff
ggsave(filename = "rarefaction_curve.tiff", plot = rareggplot,device = tiff(),width = 18, height = 16, units = "cm", dpi = 800)
#svg
library(svglite)
ggsave(filename = "rarefaction_curve.svg", plot = rareggplot,device = svg())
#eps
postscript("curve_wito_RE.1.10.eps")
rareggplot
dev.off() # Close the deviceLet’s continue! Now, we will compute α diversity indices. As everything worked well with rarefaction, we can now create a rarefied object from our phyloseq (loc_phyloseq).
#Rarefy phyloseq
set.seed(10403)
rarefaction <- rarefy_even_depth(loc_phyloseq, sample.size = min(sample_sums(loc_phyloseq)), rngseed = FALSE)With this rarefaction object, we will compute diversity indices
making use of estimate_richnessfunction. Pilou’s index is
calculated manually from Shannon and Observed richness.
#Estimate alpha indices and save it
alpha_diversity_rarefied <- estimate_richness(rarefaction,measures=c("InvSimpson", "Shannon", "Observed"))
## CALCULATE EVENNESS
Evenness_index <- as.data.frame(alpha_diversity_rarefied$Shannon/log(alpha_diversity_rarefied$Observed))
Evenness <- cbind(Evenness_index, rownames(alpha_diversity_rarefied))
colnames(Evenness) <- c("Evenness", "Samples")In this step, it is worth mentioning that we have not used diversity
estimates of richness. These estimates rely on singletons. However,
DADA2 doesn’t produce singletons as ASV, they can only appear because of
the merging process (thus, being not a real singleton for a richness
estimate to rely on). For more details, check dada2 issue #92.
In the code below, remember to change metadata.microfor
your metadata variable if analysing your own data.
#Generate final table
alpha_table <- cbind(alpha_diversity_rarefied, Evenness$Evenness, metadata.micro) #change metadata to user provided metadata
colnames(alpha_table)<- c("Observed", "Shannon", "InvSimpson", "Evenness",colnames(metadata.micro))#change metadata to user provided metadata
#Write final table on local machine
write.table(alpha_table, file="alpha_table.txt", sep="\t")We got the final table with the diversity indices we want to study, but it is also interesting to produce a table ready for publication, i.e., with mean and standard deviation grouped by variable of interest (in this case, location). Remember to change location and write your variable of interest.
#### ALPHA PUB TABLE
#Select numeric columns and compute mean and SD
alpha_mean <- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), mean)%>% mutate_if(is.numeric, round, digits=2) #Change location to variable of interest
alpha_sd <- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), sd)%>% mutate_if(is.numeric, round, digits=2)#Change location to variable of interest
#Paste mean ± SD
mean_sd <- NULL
for (i in 2:5){
mean_sd <- cbind(mean_sd,paste0(alpha_mean[,i], " +/- ", alpha_sd[,i]))}
#generate table and give it columns names
alpha_pub_table <- cbind(alpha_mean$grouping, mean_sd)
colnames(alpha_pub_table) <- c("Location","Observed richness", "Shannon", "InvSimpson", "Evenness")#Change location to variable of interest
#Write table on your local machine
write.table(alpha_pub_table, file="alpha_pub_table.txt", sep="\t")Statistical analysis
Once diversity indices are calculated and stored in a data frame
(alpha_table), we can make statistical comparisons between
experimental conditions. In this tutorial, we are interested in
differences among locations. Here we will introduce several functions
from micro4all package. These functions are
levene.test.alpha, Shapiro,
BalancedAnova, UnbalancedAnova,
kruskal.wallis, Tukey.test, dunnT
and wilcoxon.test. They all rely on another statistical
functions, like levene.test, shapiro.test,
aov, kruskal.test, TukeyHSD,
HSD, dunn.test and
pairwise.wilcoxon.test) from packages as
car, stats, agricolae
and dunn.test. The utility of these functions from
micro4all is that they loop over the different
calculated indices. In this way, we can perform the desired test in one
line for all indices and get them in a unique table, making analyzing
easier and faster. It is strongly recommend to check the help
page of each function with ?functionname. All
statistical analysis will be explained and exemplified with
micro4all package, but a section will be as well
included at the end of this tutorial to show statistical analysis
without this package.
The basic arguments of all these functions are the same. First, we
have to provide an α diversity table with indices and metadata
variables (as the one produced here). If you have not followed the
tutorial with the example dataset, you can always check a similar table
with View(micro4all::alpha_diversity_table), as well as
checking the examples with ?functionOfInterest. Then, we
have to give the number of indices to be tested and the name of the
variable of interest. Specific details of each function will be provided
in due course.
First, we will check the homogeneity of variances and normality of
data, in order to choose a proper statistical method later on. We will
use levene.test.alpha and Shapiro functions
from micro4all. The result of these functions is the same
as its original functions (levene.test and
shapiro.test) but for every index.
#### Apply levene and shapiro test ####
levene<- levene.test.alpha(alpha_table, 4, "location") #change location to your variable of interest#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
levene#> Df F value Pr(>F) IndexColumn
#> group 2 0.4798205 0.6255148 Observed
#> 21 NA NA Observed
#> group1 2 0.2423080 0.7869784 Shannon
#> 1 21 NA NA Shannon
#> group2 2 0.5007044 0.6131581 InvSimpson
#> 2 21 NA NA InvSimpson
#> group3 2 0.3925887 0.6801599 Evenness
#> 3 21 NA NA Evenness
#Write table on local machine
write.table(levene, file="levene_test.txt", sep="\t")
shapiro <- Shapiro(alpha_table, 4, "location") #change location to your variable of interest
shapiro#> p.value Index
#> shapiro 0.5529423 Observed
#> shapiro.1 0.8088936 Shannon
#> shapiro.2 0.4396685 InvSimpson
#> shapiro.3 0.9296042 Evenness
#Write table on local machine
write.table(shapiro, file="shapiro_test.txt", sep="\t")None of them returns a significant value. That means, according to
these statistics tests, every diversity index from our data follows a
normal distribution with no significant differences in variance
(homoscedasticity). With this information we can perform an ANOVA test.
With BalancedAnovafunction from micro4all, it
is possible to run an ANOVA test for every index at once.
BalancedAnova returns a list with two elements, first
element is a data frame with summary results for all indices.
#### Balanced anova ####
balanced_anova<- BalancedAnova(alpha_table, numberOfIndexes = 4, formula = "location") #change location to your variable of interest
balanced_anova[[1]]#> Df Sum Sq Mean Sq F value Pr(>F) IndexColum
#> data[, formula] 2 6.190750e+03 3.095375e+03 1.125528 0.34328482 Observed
#> Residuals 21 5.775325e+04 2.750155e+03 NA NA Observed
#> data[, formula]1 2 2.739446e-01 1.369723e-01 1.242577 0.30900953 Shannon
#> Residuals 1 21 2.314882e+00 1.102325e-01 NA NA Shannon
#> data[, formula]2 2 2.837867e+02 1.418933e+02 3.693011 0.04223816 InvSimpson
#> Residuals 2 21 8.068648e+02 3.842213e+01 NA NA InvSimpson
#> data[, formula]3 2 3.978665e-03 1.989332e-03 1.155852 0.33402229 Evenness
#> Residuals 3 21 3.614302e-02 1.721096e-03 NA NA Evenness
#Write table on local machine
write.table(balanced_anova[[1]], file="balancedAnova_test.txt", sep="\t")According to this ANOVA test, the only differences are found for the Inverse of Simpson index, but the result is marginally significant.
Remember that, in case you have an unbalanced experiment, you can
always use the function UnbalancedAnova in a similar way.
Moreover, if you prefer to use a non-parametrical test, the function
kruskal.wallis is also available. These could be use as
follows: kruskal.wallis(alpha_table, 4, "location") and
UnbalancedAnova(alpha_table, 4, "location")
Now that we have an ANOVA result, imagine we want to test differences
between every two groups, that is, location1 vs location2, location2 vs
location3 and location1 vs location3. Let’s do that, again for every
index, with Tukey.test from micro4all.
#### TUKEY ####
tukey <- Tukey.test(alpha_table, 4, "location", balanced=TRUE,)#change location to your variable of interest
tukey#> diff lwr upr p adj IndexColumn
#> location2.location1 23.12500000 -42.96676903 89.21676903 0.65731496 Observed
#> location3.location1 -16.00000000 -82.09176903 50.09176903 0.81616965 Observed
#> location3.location2 -39.12500000 -105.21676903 26.96676903 0.31475156 Observed
#> location2.location1.1 0.23860162 -0.17982909 0.65703233 0.34061351 Shannon
#> location3.location1.1 0.02620648 -0.39222423 0.44463718 0.98636386 Shannon
#> location3.location2.1 -0.21239514 -0.63082585 0.20603556 0.42190599 Shannon
#> location2.location1.2 7.74026543 -0.07168602 15.55221688 0.05242611 InvSimpson
#> location3.location1.2 0.99328997 -6.81866147 8.80524142 0.94510063 InvSimpson
#> location3.location2.2 -6.74697545 -14.55892690 1.06497600 0.09867956 InvSimpson
#> location2.location1.3 0.03124359 -0.02104072 0.08352790 0.30831191 Evenness
#> location3.location1.3 0.01189647 -0.04038784 0.06418078 0.83555221 Evenness
#> location3.location2.3 -0.01934712 -0.07163144 0.03293719 0.62618013 Evenness
#Write table on local machine
write.table(tukey, file="tukey_test.txt", sep="\t")As we can see, there is not any statistically significant comparison
among groups. Another pairwise test option is the Dunn’s test, which is
less powerful but make no assumptions on data. This will be done in a
similar way (dunn <- dunnT(alpha_table, 4, "location")).
In this function, the result is composed by two elements,
dunn[1] represents all results from dunn.test
function and dunn[2] summarizes results in a data- frame.
When using KruskalWallis function, the pairwise test to
be implemented should not be Tukey’s test, because it assumes normal
distribution of data. Instead, it is recommended to use Dunn’s test or
Mann–Whitney–Wilcoxon. micro4allalso implements a
wilcoxon.test function, which would be used like this:
wilcoxon <- wilcoxon.test(alpha_table, 4, "location", p.adjust.method = "BH")
Plot \(\alpha\) diversity indices
We will produce a ggplot graphic with α-diversity indices. It
is necessary to have a formatted table for ggplot, which we can achieve
using pivot_longer function from
tidyr.
#### PRODUCE A TABLE FOR ALPHA PLOT ####
alpha_plot_table <- tidyr::pivot_longer(data = alpha_table, names_to = "Measure", values_to = "Value", cols=c(Observed, Shannon, InvSimpson, Evenness))In the next lines, remember to change location to your
variable of interest and also its values in limits,
labelsand colors. For example, in
scale_fill_manual(values=c("location1"=...)), your grouping
variable could be health status, thus, colors should be assigned to
healthy and infected. Furthermore, the order of the panels can be
changed in the second line, specifically, changing the order of levels
in facet_wrap
### ALPHA GRAPHIC GGPLOT ###
alpha_graphic <- ggplot(data = alpha_plot_table, aes(x = location, y = Value)) + #Change location to variable of interest
facet_wrap(~factor(Measure, levels=c("Observed", "InvSimpson","Shannon", "Evenness")), scale = "free") + #change levels order to reorder panels
geom_boxplot()+
theme(axis.text.x = element_text(size=13),
legend.position="bottom", strip.text = element_text(size = 20), axis.text.y = element_text(size=15), axis.title.y=element_text(size=20)) +
scale_x_discrete(limits=c("location1",
"location2","location3"), breaks=c("location1",
"location2","location3"), ##With breaks and labels you can change the name displayed on labels, #Change locations to values of variable of interest
labels=c("Location 1",#Change locations to values of variable of interest
"Location 2","Location 3")) +
aes(fill=location)+ scale_fill_manual(values = c( "location1"= "#33CC00",#Change locations to values of variable of interest
"location2"= "#ffd624", "location3"="blue"), na.translate=FALSE) +
theme(legend.key.size = unit(1, 'cm')) +
ylab("Alpha Diversity Measure")
alpha_graphic#use ggsave for saving ggplots
ggsave("alpha_plot.tiff", plot = alpha_graphic,width = 17, height = 30, units = "cm", dpi = 800)Both statistical analysis and graphic show no significant differences between groups. Yet there could be distinct groups when studying β diversity, i.e. integrating abundance and taxonomic information.
\(\beta\) diversity analysis
Data preparation
A \(\beta\) diversity analysis
allows us to study the structure of communities along a environmental
gradient, e.g., quantify differences between locations. Before computing
these differences or distances, it is crucial to normalize data. This is
mainly needed in order to account for differences in library sizes
between samples (sequencing depth), as well as to consider its
compositionality, an inherent characteristic of high-throughput
sequencing (HTS) data1. In this tutorial, normalization will
be done with edgeR package. When processing data with
this package, variations in sequencing depth are automatically adjusted.
Moreover, it implements the method of trimmed
mean of M-values
(TMM) through the function
calcNormFactors.This method will be used because our data
always sum up to 1 (compositionality). Because of that, an ASV could be
under-represented in a sample only because it was under-sampled due to
other abundant ASV consuming the library size, a situation that could
lead to non-trustful statistical differences.
We will install the package first:
#Install edgeR
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")And load the library:
#Load library
library(edgeR); packageVersion("edgeR")
#> [1] '3.38.4'WithDGEList we create a DGEList object needed for
edgeR processing. With this object,
calcNormFactors computes normalization factors to account
for effective library size (compositionality). Then, we will extract the
normalized counts with cpm.
#### NORMALIZE ACCORDING TO EDGER ####
edgeR <- DGEList(counts = OTU, samples = metadata.micro, genes = tax)#Change metadata.micro for your own mt object
edgeR <- calcNormFactors(edgeR)
##EXTRACT NORMALIZED COUNTS
ASV_norm <- cpm(edgeR, normalized.lib.sizes=T, log=F)One normalized counts are produced, we need to construct a normalized
phyloseq object for further analysis. Remember again to change
metadata.micro to your own mt file and to
remove tree_root when working with ITS data.
##CREATE NORMALIZED PHYLOSEQ OBJECT
phy_OTU_norm<-otu_table(as.data.frame(ASV_norm,row.names=F), taxa_are_rows = T)
phy_taxonomy_norm<-tax_table(as.matrix(tax))
phy_metadata_norm<-sample_data(metadata.micro) #Change metadata.micro to your own mt file
##Add taxa names
taxa_names(phy_OTU_norm)<- taxa_names(phy_taxonomy_norm)
#Check
identical(rownames(ASV_norm), rownames(tax))
#> [1] TRUE
##Merge
norm_phyloseq<-phyloseq(phy_OTU_norm,phy_taxonomy_norm,phy_metadata_norm,tree_root) #Remove tree_root when working with ITSEverything is now ready to analyze \(\beta\) diversity. A distance method will be needed to measure the the dissimilarities between samples. There are several ways to compute these dissimilarities. We will mainly use Bray-Curtis, Unifrac and Weighted-Unifrac distances. Bray-Curtis only takes into account the presence or not of shared members between samples, while Unifrac distances is based on shared phylogenetic branches. Weighted-Unifrac is similar to Unifrac, but it also includes information about relative abundances. A very detailed and readable information about this can be found in O Paliy et al., 20162
With these distance measures, a graphical representation can be
obtained to visualize if our samples are differently grouped according
to our variable of interest. In our case, we would like to know if
location shapes bacterial communities differently. However,
a statistical test is needed to confirm results. For this purpose, we
will use a permutational multivariate
analysis of variance
(PERMANOVA). adonis function from vegan is
implemented in Permanova function of
micro4all with the aim of computing a PERMANOVA test
with different distances at one. This function works with a normalized
phyloseq object, a character vector with distances and a formula, where
the variable of interest should be included. Once performed,
permanova[[1]] returns a data frame with summary results
from adonis function. For a more detailed result, check
permanova[[2]].
#### PERMANOVA ####
permanova<- Permanova(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #Change location to variable of interestThe formula argument can be used to analyze more than
one factor at once. This can be done as follows:
formula="variable1+variable2" (in case of independent
factors) or formula="variable1*variable2"(in case of
dependent factors).
Next, let’s see the result from Permanova:
##Permanova result
permanova[[1]]#> Df SumOfSqs R2 F Pr(>F) distances
#> location 2 1.4040287 0.3004895 4.510496 0.001 bray
#> Residual 21 3.2684435 0.6995105 NA NA bray
#> Total 23 4.6724722 1.0000000 NA NA bray
#> location1 2 0.5468104 0.2156555 2.886975 0.005 unifrac
#> Residual1 21 1.9887628 0.7843445 NA NA unifrac
#> Total1 23 2.5355732 1.0000000 NA NA unifrac
#> location2 2 0.1343614 0.1647734 2.071439 0.025 wunifrac
#> Residual2 21 0.6810699 0.8352266 NA NA wunifrac
#> Total2 23 0.8154313 1.0000000 NA NA wunifrac
#Write table to local machine
write.table(permanova[[1]], file="permanova.txt",sep="\t")With the table above, we can see that differences are statistically
significant with every distance method used. Moreover, Bray-Curtis
distance seems to explain the most variance (expressed as
R2). Nevertheless, another test is necessary
to confirm that this significant results are not due to differences
among group variances (dispersion of samples). Again, we can perform
multiple betadisper test (from vegan
package) with several distances measures thanks to
micro4all function Betadispersion. It is very
similar to they way Permanova works, so do not forget to
change location to your variable of interest.
#### BETADISPERSION ####
betadisper<- Betadispersion(loc_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #change location to variable of interest
betadisper[1] #data frame with results for every distance method#> [[1]]
#> Df Sum Sq Mean Sq F N.Perm Pr(>F) distances
#> Groups 2 0.009769604 0.004884802 0.7280460 999 0.482 bray
#> Residuals 21 0.140898839 0.006709469 NA NA NA bray
#> Groups1 2 0.009190666 0.004595333 0.8473928 999 0.460 unifrac
#> Residuals1 21 0.113881059 0.005422908 NA NA NA unifrac
#> Groups2 2 0.002940247 0.001470124 0.5231941 999 0.615 wunifrac
#> Residuals2 21 0.059007916 0.002809901 NA NA NA wunifrac
#write file to your local machine
write.table(betadisper[1], file="betadisper.txt", sep="\t")In this example, dispersion was not statistically significantly
different between the three groups, so we can be sure of the PERMANOVA
results. Nonetheless, it would be interesting to know the pairwise
differences between groups. We can make a pairwiseAdonis
test to check for differences. Once more, micro4allgives us
the option to loop over distances with PairwiseAdonisFun.
pval argument refers to p-value threshold to filter
results.
#### PAIRWISE ADONIS ####
pairwise<- PairwiseAdonisFun(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location", pval=0.05) #Change location to your variable of interest
pairwise #data frame with all significant results#> pairs Df SumsOfSqs F.Model R2 p.value p.adjusted sig distances
#> 1 location1 vs location2 1 0.3391022 2.076899 0.1291853 0.005 0.0050 * bray
#> 2 location1 vs location3 1 0.9248550 6.081317 0.3028346 0.001 0.0015 * bray
#> 3 location2 vs location3 1 0.8420859 5.555908 0.2841038 0.001 0.0015 * bray
#> 4 location1 vs location2 1 0.2696235 2.823642 0.1678377 0.029 0.0370 . unifrac
#> 5 location1 vs location3 1 0.2773369 2.785471 0.1659454 0.011 0.0330 . unifrac
#> 6 location2 vs location3 1 0.2732552 3.068367 0.1797692 0.037 0.0370 . unifrac
#write file to your local machine
write.table(pairwise, file="pairwise.txt", sep="\t")It seems that, according to statistical analysis used and with Bray-Curtis distances, statistically significant differences are found among all locations.
For visualization, we can use PCoA and
NMDS representations with every measure. While PCoA
produces artificial variables to explain the variance among samples,
NMDS tries to represent all the variance in two axes. In the code below,
we will see how to use a loop to produces every possible plot, combining
PCoA, NMDS and the three distances.
Every plot will be assigned to a variable called plot_list.
Then, calling plot_list[1] will give you the first plot. If
you want to use more types of distances, modify var_list.
Pay attention to the comments (#) to see where changes
should be applied when using your own data. To save the plots, you can
choose between several options which were detailed at the end of
Data preparation section in \(\alpha\) diversity.
#### BETA DIVERSITY PLOTS ####
var_list <- c("bray", "unifrac", "wunifrac")
plot_type <- c("PCoA", "NMDS")
combination_plot <- purrr::cross2(plot_type,var_list )
# Make plots.
plot_list = list()
for (i in 1:length(combination_plot)){
pcoa <- ordinate(norm_phyloseq,combination_plot[[i]][[1]],combination_plot[[i]][[2]]) #get distance matrix
plot <- plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location variable of interest (e.g., health status)
geom_point(size=4)
if (combination_plot[[i]][[1]]=="NMDS"){ ## graphic details for NMDS
plot <- plot + xlab(paste("Axis 1"))+
ylab(paste("Axis 2"))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances", round(pcoa$stress, digits = 3)))
plot_list[[i]] = plot
}
else {## graphic details for PCoA
plot= plot + xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances"))
plot_list[[i]] = plot
}
}
#Visualize plot
plot_list[[1]]#Save plot on local machine
ggsave("bray-curtis_pcoa.tiff", plot = plot_list[[1]],width = 17, height = 30, units = "cm", dpi = 800 )As it was shown by PERMANOVA, there are differences between all groups (locations) and location3 samples seem to differ more from other groups (i.e., the distance is greater).
If you want to perform only one plot, you can change
var_list and plot_type or use the code below,
which shows you how to do a NMDS and PCoA separately. Pay attention to
the comments (#) in order to adapt the code to your own
data:
####PCOA####
#Calculate distances
pcoa <- ordinate(norm_phyloseq,"PCoA","bray") #get distance matrix, change bray to choosen distance
#Produce plot
pcoa_bray <- plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest
geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
geom_point(size=4) +
xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle("Bacterial Community PCoA on Bray Curtis distances") #change title according to distance used
#Visualize plot
pcoa_bray#Save it to local machine
ggsave("bray-curtis_pcoa.tiff", plot = pcoa_bray,width = 17, height = 30, units = "cm", dpi = 800 )Let’s see how it works for a NMDS!
#### NMDS####
#Calculate distances
nmds <- ordinate(norm_phyloseq,"NMDS","bray") #get distance matrix, change bray to choosen distance
#> Square root transformation
#> Wisconsin double standardization
#> Run 0 stress 0.1895576
#> Run 1 stress 0.2136299
#> Run 2 stress 0.1907108
#> Run 3 stress 0.203114
#> Run 4 stress 0.1907108
#> Run 5 stress 0.1925871
#> Run 6 stress 0.2273544
#> Run 7 stress 0.1889755
#> ... New best solution
#> ... Procrustes: rmse 0.05124936 max resid 0.2188175
#> Run 8 stress 0.2846452
#> Run 9 stress 0.1904783
#> Run 10 stress 0.217401
#> Run 11 stress 0.1889754
#> ... New best solution
#> ... Procrustes: rmse 6.24282e-05 max resid 0.0001315906
#> ... Similar to previous best
#> Run 12 stress 0.1907108
#> Run 13 stress 0.1925871
#> Run 14 stress 0.203114
#> Run 15 stress 0.1925872
#> Run 16 stress 0.1889755
#> ... Procrustes: rmse 0.0001726498 max resid 0.0005084086
#> ... Similar to previous best
#> Run 17 stress 0.2277738
#> Run 18 stress 0.1889755
#> ... Procrustes: rmse 0.0002819056 max resid 0.0007794508
#> ... Similar to previous best
#> Run 19 stress 0.1895576
#> Run 20 stress 0.2647637
#> *** Best solution repeated 3 times
#Produce plot
nmds_bray <- plot_ordination(norm_phyloseq, nmds, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
geom_point(size=4)+
xlab(paste("Axis 1"))+
ylab(paste("Axis 2"))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+ #change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community NMDS on Bray-Curtis distances", round(nmds$stress, digits = 3))) #change title according to distance used
#Visualize plot
nmds_bray#Save it to local machine
ggsave("bray-curtis_nmds.tiff", plot = nmds_bray,width = 17, height = 30, units = "cm", dpi = 800 )
As it was mentioned before, PCoA produces aritifical variables that
are represented as axis. Each of them explains a certain amount of
variance. Going deeper in the analysis, we can get a bar plot with the
variance explained by every axis. Moreover, we will use
length to see how many axes were needed for PCoA
ordination.
#### ORDINATE PHYLOSEQ OBJECT ####
PCOA_bray <- ordinate(norm_phyloseq, "PCoA", "bray", formula="location") #Change to your variable of interest
#Get number of axes
length(PCOA_bray$values$Relative_eig)
#> [1] 23
#Get plot
plot_scree(PCOA_bray)And with the next line, we can get the percentage of variance explained by the number of axes we choose. The number of axes needed to explain a certain amount of variance can give us an idea of the multidimensionality of our data
##Get explained variance in first 10 axis
sum(PCOA_bray$values$Relative_eig[1:10]) #change 1:10 to select the number of axes#> [1] 0.833356
Produce tables with relative abundance
With α-diversity and β-diversity analysis, we got a global
perspective of the structure a diversity of microbial communities.
Nevertheless, it is interesting as well to get information about
taxonomic profile, i.e., which phyla and genera are more abundant in our
samples. In this line, several analysis can be performed. We will start
with producing and saving tables with relative abundances at different
levels using our phyloseq object without normalization
(loc_phyloseq). For example, we can produce a table with
relative abundance of ASV for each sample. Moreover, we can group this
by location, thus having ASV relative abundance for each location. This
can be done as well at genus and phylum level. We will start at ASV and
sample level. First, with the function
transform_sample_counts, our phyloseq object
(loc_phyloseq) will be converted to relative
abundances. Then, each element of this phyloseq object will be extracted
and bind into a table.
##CALCULATE RELAIVE ABUNDANCE, ASV, SAMPLES
sample_relabun_ASV <- transform_sample_counts(loc_phyloseq, function(x){x/sum(x)}*100)
##CONSTRUCT THE TABLE
OTU_sample <- as.data.frame((otu_table(sample_relabun_ASV)))
taxonomy_sample <- as.data.frame(tax_table(sample_relabun_ASV))
identical(rownames(OTU_sample),rownames(taxonomy_sample))
#> [1] TRUE
sample_relabun_ASV_table <- cbind(taxonomy_sample, OTU_sample)
#SORT IT
sample_relabun_ASV_table <- sample_relabun_ASV_table[sort(sample_relabun_ASV_table$ASV_names, decreasing = FALSE),]With colSums we should check if summing ASV abundances
for every sample (i.e., each column) gives us a result of 100
#Check relative abundance sums 100
colSums(sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)])#> L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_4 L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100 100
We can have a look at the beginning of the table and save it on our local machine as follows:
##See table
head(sample_relabun_ASV_table)#> Kingdom Phylum Class Order Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0003 Bacteria Proteobacteria unclassified unclassified unclassified
#> ASV0006 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0008 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> Genus ASV_names L1_1 L1_2 L1_3 L1_4 L1_5
#> ASV0002 Streptomyces ASV0002 13.8889992 4.937848 10.893415 22.1990687 9.5526244
#> ASV0003 unclassified ASV0003 3.8330224 27.683538 26.364099 15.4691786 10.6119119
#> ASV0006 unclassified ASV0006 2.5943864 6.496807 7.246131 10.9847116 19.6124558
#> ASV0007 Pseudonocardia ASV0007 1.3855254 3.001168 2.136059 0.9440983 0.4732119
#> ASV0008 unclassified ASV0008 11.3065227 2.218254 2.586520 1.1120798 7.1675279
#> ASV0009 Promicromonospora ASV0009 0.1905594 1.504018 2.310431 1.8605541 0.1060647
#> L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5
#> ASV0002 6.9848271 20.4381884 5.620269 7.6930658 7.2420046 11.2892520 3.0031259 14.4508131
#> ASV0003 24.8232220 10.3801344 5.665003 10.5471676 3.6577360 22.6933878 7.1731339 1.5160285
#> ASV0006 6.2825092 5.3759593 1.043176 6.6385461 4.5674805 7.4010722 7.1764071 1.9043707
#> ASV0007 1.8744457 2.3142073 0.925081 2.0439942 0.3016851 0.5514424 1.8673388 0.6254551
#> ASV0008 4.2882140 3.0677164 6.514932 14.3000749 12.2581049 7.5312739 1.3452695 4.4099251
#> ASV0009 0.4674129 0.3419619 3.689588 0.2030197 7.7359552 0.1582844 0.5007937 7.1880659
#> L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5
#> ASV0002 12.829278 8.654647 9.041784 13.6220214 5.024374 17.70502153 7.8901345 8.995795
#> ASV0003 0.921740 9.530792 9.079588 8.5899753 11.863936 10.48276173 11.3430493 3.270292
#> ASV0006 1.774789 6.913218 5.461540 0.3648316 1.406825 0.62233647 0.7892377 1.592300
#> ASV0007 1.442515 4.163499 3.624719 14.8167625 16.529120 5.98209655 9.9327354 4.862592
#> ASV0008 8.960207 1.828319 6.101981 0.0000000 0.000000 0.00000000 0.0000000 0.000000
#> ASV0009 2.551159 1.178451 1.461006 7.1783073 1.233644 0.03156779 1.8609865 21.526934
#> L3_6 L3_7 L3_8
#> ASV0002 19.96646697 19.7996690 5.78009910
#> ASV0003 14.42738258 9.9207386 21.88830856
#> ASV0006 2.58040771 0.1428447 0.82764162
#> ASV0007 16.93827060 3.8899051 13.41368689
#> ASV0008 0.00000000 0.0000000 0.00000000
#> ASV0009 0.01840227 0.5853149 0.03214142
#Save ASV sample table on your local machine
write.table(sample_relabun_ASV_table, file="ASV_sample_relabun.txt", sep="\t")So, how could we do this at genus and phylum level? First, the
phyloseq object has to be grouped at the desired taxonomic level with
tax_glom and follow the same steps as before.
### GET TABLE BY SAMPLE AT GENUS LEVEL ###
#Glom phyloseq
loc_phyloseq_genus <- tax_glom(loc_phyloseq, taxrank = "Genus") #Change Genus to the desired taxonomic rank
sample_relabun_genus <- transform_sample_counts(loc_phyloseq_genus, function(x){x/sum(x)}*100)
##Extract elements from phyloseq
OTU_sample_genus <- as.data.frame((otu_table(sample_relabun_genus)))
taxonomy_sample_genus <- as.data.frame(tax_table(sample_relabun_genus)[,-7])
identical(rownames(OTU_sample_genus),rownames(taxonomy_sample_genus))
#> [1] TRUE
#Bind table
sample_table_genus <- cbind(taxonomy_sample_genus, OTU_sample_genus)Furthermore, if not interested in the taxonomic profile of
unclassified sequences at genus level, we can group unclassified genera
together. First, we save the unclassified genera in a new variable
(sample_talbe_genus_unclass) where we sum up its
abundances. Then we bind sample_table_genus without
unclassifieds with this new variable:
##Agglomerate unclassified in otu table
#Get sum of unclassified sequences
sample_table_genus_unclass <- sample_table_genus %>% subset(Genus=="unclassified", select=c(7:ncol(sample_table_genus))) %>%
colSums() %>% t() %>% as.data.frame() %>% cbind(Kingdom="unclassified", Phylum="unclassified", Class="unclassified",
Order="unclassified", Family="unclassified", Genus="unclassified", .)
#Bind sample_table_genus wito unclassified with sample_table_genus_class
sample_table_genus_final <- rbind(subset(sample_table_genus, Genus!="unclassified"), sample_table_genus_unclass)
#Check relative abundance sums 100
colSums(sample_table_genus_final[,8:ncol(sample_table_genus_final)])#> L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
sample_table_genus_final <- sample_table_genus_final[order(rowSums(sample_table_genus_final[,7:ncol(sample_table_genus_final)]), decreasing=TRUE),]
#See first lines of this table
head(sample_table_genus_final)#> Kingdom Phylum Class Order Family
#> 1 unclassified unclassified unclassified unclassified unclassified
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0022 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae
#> Genus L1_1 L1_2 L1_3 L1_4 L1_5 L1_6
#> 1 unclassified 27.8534281 47.8401209 52.6289022 42.899062 50.3290726 53.4049234
#> ASV0002 Streptomyces 24.5027591 13.3266946 15.1413141 28.718450 10.8362796 10.6258539
#> ASV0007 Pseudonocardia 6.9494621 4.7215164 8.2560364 4.986285 8.2186565 6.6037057
#> ASV0010 Amycolatopsis 7.5211402 2.8123068 5.9238091 5.628442 5.3848246 8.7346293
#> ASV0009 Promicromonospora 0.2679741 1.5040176 2.4508973 1.860554 0.1060647 0.4674129
#> ASV0022 Neorhizobium 0.4069237 0.1201841 0.2567145 1.599013 0.1033451 0.4698099
#> L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6
#> 1 38.2838284 36.6489524 56.2679860 39.491043 49.1881542 41.2663863 19.876403 33.165067
#> ASV0002 29.6135035 11.0705531 9.3605866 14.072905 18.6341588 5.4039900 23.175445 13.396380
#> ASV0007 5.3998171 7.6189454 9.5892301 4.206396 5.5450600 3.7559531 4.025317 6.603939
#> ASV0010 5.1254523 7.5634763 11.6923562 11.939225 11.5905029 7.6575618 7.787383 11.818080
#> ASV0009 0.4731798 4.3910033 0.3666180 7.735955 0.1582844 0.5007937 7.188066 2.551159
#> ASV0022 0.3280449 0.3453397 0.3409942 4.917623 0.4952770 4.8377330 5.132466 6.952188
#> L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5 L3_6
#> 1 38.769415 47.3192644 18.7625308 30.366459 34.09772486 27.253363 17.798282 24.87169526
#> ASV0002 13.127693 10.1047388 21.5250616 7.904302 23.50222102 10.071749 10.554216 24.77968389
#> ASV0007 9.020311 8.9928617 22.2925226 23.293851 11.42528580 16.302691 9.041631 25.36855665
#> ASV0010 3.939032 7.2983611 2.7198028 9.501411 4.71713004 5.627803 4.430140 5.16899421
#> ASV0009 1.178451 1.4610065 7.1783073 1.233644 0.03156779 1.860987 21.526934 0.01840227
#> ASV0022 2.463705 0.7560764 0.9235826 1.026255 0.11048727 1.004484 1.377070 1.25748870
#> L3_7 L3_8
#> 1 19.3136486 43.26771126
#> ASV0002 31.9118544 8.33266372
#> ASV0007 6.3827193 22.18829517
#> ASV0010 2.5067503 13.26101513
#> ASV0009 0.5853149 0.03214142
#> ASV0022 2.2715791 0.16606401
#Save to your local machine
write.table(sample_table_genus_final, file="sample_table_genus.txt", sep="\t")If you want to keep the table with all unclassified taxa in
individual rows, you can always use write.table with
sample_table_genus.
With the tables produced above we have information about relative
abundance in each sample. However, it is very useful to group this
tables by the variable of interest. That would mean to know the mean
abundances of members of a taxonomic rank in each location with standard
deviations. First, we will save genus abundance and taxonomy from
sample_table_genus_final in two variables. Then, mean and
SD will be computed according to the grouping variable of interest. For
your own data, it is crucial to substitute metadata.micro
by your own metadata variable (e.g., mt) and
location by your metadata variable of interest.
### GROUP GENUS TABLE BY LOCATION
#Save genus abundances
otu_genus <- sample_table_genus_final[,7:ncol(sample_table_genus_final)] %>% t() %>% as.data.frame()
#Save taxonomy data
tax_genus <- sample_table_genus_final[,1:6]
#Calculate OTU mean abundance based on grouping factor (e.g., location)
location_mean_genus <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
#Calculate OTU SD based on grouping factor (e.g., location) and change colnames
location_SD_genus <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>% #Change metadata.micro and location to your metadata and variable of interest
as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_mean_genus), "SD"))
#Merge mean abundance, SD and taxonomy.
genus_location_table <- merge(tax_genus, location_mean_genus, by=0) %>%column_to_rownames("Row.names") %>%
merge(location_SD_genus, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(genus_location_table[,7:ncol(genus_location_table)])#> location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 51.40124 59.14292 66.73249
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
genus_location_table <- genus_location_table[order(rowSums(genus_location_table[,7:ncol(genus_location_table)]), decreasing=TRUE),]
#View table
head(genus_location_table)#> Kingdom Phylum Class Order Family
#> 1 unclassified unclassified unclassified unclassified unclassified
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0023 Bacteria Actinobacteria Actinobacteria Micromonosporales Micromonosporaceae
#> Genus location1 location2 location3 location1SD location2SD location3SD
#> 1 unclassified 43.736036 40.6679647 26.966427 8.990417 11.046165 8.803745
#> ASV0002 Streptomyces 17.979426 13.4094871 17.322719 8.244194 5.536809 9.197919
#> ASV0007 Pseudonocardia 6.594303 6.4673835 17.036944 1.419046 2.447691 7.291772
#> ASV0010 Amycolatopsis 6.086760 9.2153128 5.991631 1.837111 2.974914 3.642673
#> ASV0009 Promicromonospora 1.440138 2.6425417 4.058412 1.461164 3.072715 7.449846
#> ASV0023 Micromonospora 1.409967 0.8506067 2.076555 2.588402 1.083355 2.368180
#Save table on your local machine
write.table(genus_location_table, file="genus_abund_location.txt", sep="\t")If you want to do this at other taxonomic levels, you should simply
change sample_table_genus_final to the grouped table, e.g.,
for ASV level it would be sample_relabun_ASV_table.
Remember to change metadata.micro to your metadata file and
location by your variable of interest:
#Save OTU data
otu_location_ASV <- sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)] %>% t() %>% as.data.frame()
#Save Taxnomy data
tax_location_ASV <- sample_relabun_ASV_table[,1:8]
#Calculate OTU mean abundance based on grouping factor (e.g., PLOT)
location_ASV_mean <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
#Calculate OTU SD based on grouping factor (e.g., PLOT) and change colnames
location_ASV_SD <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>% #Change metadata.micro and location to your metadata and variable of interest
as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_ASV_mean), "SD"))
#Merge mean abundance, SD and taxonomy.
ASV_location_table <- merge(tax_location_ASV, location_ASV_mean, by=0) %>%column_to_rownames("Row.names") %>%
merge(location_ASV_SD, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(ASV_location_table[,8:ncol(ASV_location_table)])#> L1_1 location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 100.00000 111.04717 115.16294 98.63739
#Save to local machine
write.table(ASV_location_table, file="ASV_abund_location.txt", sep="\t")Taxonomic profile graphics
This section will cover the graphic representation of taxonomic
profiles at genus level. These kind of graphics can be interesting to
give a visual idea of the main taxonomic groups present in our
communities. It will be exemplified at genus level, but it can be easily
switched to phylum and specific changes will be explained after comments
# (check also the previous section, as it explains how to
produce the table).
First, we prepare a table for plotting. We create a column with the
mean abundance of each taxa between conditions. With this column, we can
sort the table and filter the number of taxa to plot. In our case, we
usually represent the 20 most abundant genera. Check the number of
columns for your own data. For your own data, this number could differ
depending on the number of samples or experimental settings that you
have (check first your table on your local machine or in RStudio with
View(genus_location_table)).
###TAXONOMICAL PROFILE GENUS LEVEL###
#Format table.
#Add a column with the mean abundance between conditions and arrange it un decreasing order.
sample_table_genus_sort <- genus_location_table %>% mutate(mean=rowMeans(genus_location_table[,7:ncol(genus_location_table)])) %>% arrange(desc(mean))
#We have to use pivot_longer to get a table in the right format for ggplot
#We filter out unclassified
taxonomic_genus_location <- sample_table_genus_sort[1:21,] %>% filter(Genus!="unclassified") %>% dplyr::select(6:9) %>% pivot_longer(!Genus, names_to="Samples", values_to="Abundance")##we select 2:21 because the unclassified sequences are the most abundant.The data is ready but we still need to arrange colors and legend labels. We need to produce an automatized expression for ggplot in order to show genera and phyla names in italics. We will use the package gdata. Let’s install it and load the library:
#Install gdata
install.packages("gdata")#Load library
library(gdata);packageVersion("gdata")
#> [1] '2.18.0.1'Now, we will produce the legend labels for genera in italics. First,
we get the location names and genera names. With sapply we
will paste the name of genera, a “=” symbol, and italic(“name of
genera”) for every genera. If plotting at phylum level, make proper
changes (pay attention to # comments).
#Get labels for location
location_label <- levels(as.factor(unique(taxonomic_genus_location$Samples)))
#Get unique names for genera
unique_genera <- unique(as.vector(taxonomic_genus_location$Genus)) #Change Genus to Phylum when needed
#REORDER FACTORS LEVELS IN DATA FRAME
taxonomic_genus_location$Genus=reorder.factor(taxonomic_genus_location$Genus,new.order=rev(unique_genera))#Change Genus to Phylum when needed
sorted_labels_genus<- as.data.frame(unique_genera)
##CREATE AN EXPRESSION FOR GGPLOT WITH ITALICS WHEN NEEDED.
sorted_labels_ggplot <- sapply(sorted_labels_genus$unique_genera,
function(x) parse(text = paste0("italic('",as.character(x), "')")))When having to color a great number of objects,
randomcoloR package can be very useful. It allows us to
produce a number of random color and it is possible to modify them later
to not repeat them. You can combine it with package scales
to display the random generated colors, as well as its codes.
#Install randomcoloR
install.packages("randomcoloR")
#Install scales
install.packages("scales")#Load libraries
library(randomcoloR);packageVersion("randomcoloR")
#> [1] '1.1.0.1'
library(scales);packageVersion("scales")
#> [1] '1.2.1'With randomColor you can produce a number of random
colors. The length of it will be set by
length(unique_genera), which tells us how many different
genera there are.
#Produce random color and visualize them
#Get number of genera
length(unique_genera) #This will be the number of genera to color
#> [1] 20
colors_genus <- randomColor(count=length(unique_genera), hue=c("random"))
show_col(colors_genus)If you want to change some of these colors, it can be easily done as follows:
#Modify colors
colors_genus[5] <- "#448800"In this tutorial, we will use the next palette, because it has been previously optimized to include very distinctive colors.
#Produce a palette of colours
c24 <- c(
"dodgerblue2", "#E31A1C", # red
"green4",
"#6A3D9A", # purple
"pink", # orange
"black", "gold1",
"skyblue2", "#FB9A99", # lt pink
"#CAB2D6", # lt purple
"#FDBF6F", # lt orange
"gray70", "khaki2",
"maroon", "steelblue4",
"darkturquoise", "green1", "yellow4", "yellow3",
"darkorange4", "blue","orange", "darkgrey"
)Main taxa plotting
At last, let’s plot taxonomic profile at genus level! Comments
(#) include information to change the code for phylum
profile.
#### BACTERIAL GENERA TAXONOMIC PROFILE ####
ggplot_title <- "Bacterial genera by location"
ggGenera=ggplot(taxonomic_genus_location, aes(x=Samples, y=Abundance, fill=Genus, order=Genus)) + #Change genus to phylum
geom_bar(stat="identity", position="stack")+
scale_fill_manual(values=c24,#Change c24 to colors_genus if you are using ramdom colors
breaks=unique_genera, #Include genera names
labels=sorted_labels_ggplot)+ #Include italic tags
labs(y="Mean relative abundance (%)", x=NULL, fill="Genus",title=ggplot_title)+
guides(fill = guide_legend(reverse = TRUE))+
scale_x_discrete(limits=location_label,
labels=location_label)+
scale_y_continuous(expand=c(0.01,0.01),
breaks=c(0,10,20,30,40,50,60,70),
labels=c("0","10", "20","30","40","50","60","70"),
limits = c(NA, 70))+
theme_bw()+
theme(panel.border = element_rect(colour="black"), axis.title.x=element_blank(),
plot.title = element_text(face="bold", hjust = 0.5, size=16), axis.text = element_text(size = 14),
axis.text.x = element_text(face="bold", size=10, angle = 45, vjust=1, hjust = 1), axis.title.y = element_text(size = 16),
legend.key.size = unit(0.9, "cm"), legend.text = element_text(size = 8),
legend.title = element_text(size=7, face="bold"), legend.title.align=0.5)
#Visualize plot
ggGenera#Save it
ggsave("genera_profile.tiff", plot = ggGenera,width = 35, height = 30, units = "cm", dpi = 800 )ANCOM-BC analysis
There are several methods to perform differential abundance analysis
on microbial community data. Among them, we choose ANCOMBC because it is
specifically designed for this kind of data and accounts for its
compositional structure. Unfortunately, when having multiple groups to
compare, ancombc function only compares the first group
against all others. This is a big drawback and requires the user to
change the input phyloseq object for every other combination between
groups. To solve this problem, micro4all has
implemented the function ancomloop. Apart from looping over
groups, it also returns a table with ANCOM corrected logarithmic
abundances. First, we will install microbiome package,
as it is required by ancombc function from
micro4all. ANCOMBC package is already
installed with micro4all. Although ANCOMBC has recently
been updated to ANCOMBC-II, we continue to use ‘ancombc’ instead of
‘ancombc2’ as some of its functionality is still being revised.
Therefore, a normal warning message is displayed stating that “ancombc
is deprecated”. Also, the new function ‘ancombc’ contains the argument
‘tax_level’. Thus, we do not need to glom the phyloseq
object at the specific taxonomic level. However, it is not possible to
implement ANCOM at ASV level, and the ‘tax.level’ argument must always
be specified (including one of the ranks allowed by taxRanks()),
otherwise an error will occur.
##Install microbiome package
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("microbiome")#Load microbiome library
library(microbiome);packageVersion("microbiome")
#> [1] '1.18.0'Now we will see how to use ancomloop function. The input
has to be a phyloseq object. First of all, we have to use a phyloseq
object (loc_phyloseq). Moreover, we have to give the
grouping variable (change it to your variable of
interest). If we do not want to analyze the unclassified genera, we can
set out.unclassified to TRUE. When this
argument is turned to TRUE, a tax.level should
be provided. Finally, we can also add more arguments from the original
ANCOMBC function with ancom.options. When
running this function, the comparisons performed will be printed on
screen (for example,
“location1”-“location2”-“location3” means that
location1 is being compared with location2 and location3). After it is
finished, it returns a list with as many elements as different groups
were in the grouping variable, in this case, 3. Each element is a data
frame containing the taxonomy, relative abundance corrected by ANCOMBC,
standard deviations and all ANCOMBC results for every possible
comparison. If you want to visualize each element, use
View(ANCOM_location_genus[["location1"]]).
#### ANCOMBC COMPARISONS ####
ANCOM_location_genus <- ancomloop(input_object_phyloseq = loc_phyloseq, grouping = "location", ancom.options = list(global=FALSE, struc_zero=TRUE, n_cl=8),out.unclassified = TRUE, tax.level="Genus") #When performing ancombc at genus level, we can filter unclassified genera out with out.unclassified and tax.level arguments #REMEMBER! Change location1 according to your metadata#> [1] "location1" "location2" "location3"
#> [1] "location2" "location1" "location3"
#> [1] "location3" "location1" "location2"
#Visualize first lines
head(ANCOM_location_genus[["location1"]])#> Genus Kingdom Phylum Class Order
#> 1 Achromobacter Bacteria Proteobacteria Betaproteobacteria Burkholderiales
#> 2 Acidaminococcus Bacteria Firmicutes Negativicutes Acidaminococcales
#> 3 Actinocorallia Bacteria Actinobacteria Actinobacteria Streptosporangiales
#> 4 Actinomadura Bacteria Actinobacteria Actinobacteria Streptosporangiales
#> 5 Actinophytocola Bacteria Actinobacteria Actinobacteria Pseudonocardiales
#> 6 Actinoplanes Bacteria Actinobacteria Actinobacteria Micromonosporales
#> Family ASV_names Intercept_lfc Intercept_se Intercept_W Intercept_pval
#> 1 Alcaligenaceae <NA> 2.8600907 0.2510751 11.3913736 4.616539e-30
#> 2 Acidaminococcaceae <NA> -0.1595666 0.3490052 -0.4572041 6.475244e-01
#> 3 Thermomonosporaceae <NA> -0.9505587 0.2369933 -4.0109101 6.048515e-05
#> 4 Thermomonosporaceae <NA> 4.0296225 0.5369732 7.5043276 6.174469e-14
#> 5 Pseudonocardiaceae <NA> 1.6747693 0.7152141 2.3416334 1.919956e-02
#> 6 Micromonosporaceae <NA> 3.4144532 0.4015234 8.5037460 1.835689e-17
#> Intercept_qval Intercept_diff location1vslocation2_lfc location1vslocation2_se
#> 1 7.109471e-28 TRUE -2.5689349 0.7038106
#> 2 1.000000e+00 FALSE 1.4555339 0.5739782
#> 3 6.955792e-03 TRUE 0.0675060 0.3812191
#> 4 8.582512e-12 TRUE 0.3953266 0.6812116
#> 5 1.000000e+00 FALSE 0.6683542 1.0329619
#> 6 2.588322e-15 TRUE 0.2784884 0.5147770
#> location1vslocation2_W location1vslocation2_p_val location1vslocation2_q_val
#> 1 -3.6500374 0.0002622022 0.04431216
#> 2 2.5358696 0.0112168507 1.00000000
#> 3 0.1770793 0.8594461208 1.00000000
#> 4 0.5803286 0.5616930167 1.00000000
#> 5 0.6470270 0.5176144794 1.00000000
#> 6 0.5409885 0.5885155182 1.00000000
#> location1vslocation2_diff_abn location1vslocation3_lfc location1vslocation3_se
#> 1 TRUE -4.1372748 0.0000000
#> 2 FALSE 0.2987000 0.6613357
#> 3 FALSE 2.0390103 0.4893351
#> 4 FALSE 0.1400161 0.7890357
#> 5 FALSE 1.3378918 0.8607808
#> 6 FALSE 1.3220703 0.4769997
#> location1vslocation3_W location1vslocation3_p_val location1vslocation3_q_val
#> 1 -14.1889557 0.000000000 0.000000000
#> 2 0.4516616 0.651512781 1.000000000
#> 3 4.1668999 0.000030877 0.005063829
#> 4 0.1774521 0.859153279 1.000000000
#> 5 1.5542770 0.120118350 1.000000000
#> 6 2.7716375 0.005577511 0.792006625
#> location1vslocation3_diff_abn
#> 1 TRUE
#> 2 FALSE
#> 3 TRUE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#Save any of the table on your local machine
write.table(ANCOM_location_genus[["location1"]], file="ancom_location1_genus.txt", sep="\t")Once saved to your local machine, the ANCOM table can be manually filtered to get only significant results. However, this can be done as well in R. You will see that we select a specific number of columns (1:13) in order to save only location1 vs. location2 comparison.
###Filter ANCOM Results
#Get results for location1
ANCOM_location1_genus <- ANCOM_location_genus[["location1"]] #Change location1 to your variable of interest
#Get only significant results
ANCOM_loc1vsloc2_genus_sig <- ANCOM_location1_genus[,1:19] %>% filter(location1vslocation2_diff_abn,TRUE) #Change [,1:19] according to the comparison of interest
#Write results on your local machine
write.table(ANCOM_loc1vsloc2_genus_sig, file="ANCOM_loc1loc2_sig.txt", sep="\t")ANCOM-BC graphic
Graphical representation of differential analysis results can be achieved with two approaches: log fold change with corrected abundances or relative abundance without correction. Here, we will show how to accomplish both graphics. For publication purposes, we rather represent both graphics in one figure.
We will start with log fold change represented in a waterfall plot. As we did ANCOM at genus level, we will bind phylum and genus name for an easier identification. This is most useful when representing differential abundances at ASV level. The graphic will be exemplified with location1 vs location2 comparison.
#Bind phylum name to genus name
#Create a new variable for graphics
graphic_loc1vsloc2_genus<-ANCOM_loc1vsloc2_genus_sig
#Loop to add a new column called Classification and paste phylum and genus names
for (i in 1:nrow(graphic_loc1vsloc2_genus)){
graphic_loc1vsloc2_genus$Classification[i]<- paste(graphic_loc1vsloc2_genus$Phylum[i],graphic_loc1vsloc2_genus$Genus[i],sep="|")
}When using unclassified genera or DA analysis at ASV level, it would be interesting to show classification at higher levels (family, class, order or phylum). The next loop will allow us to paste names according to classification:
#Get the last classified taxonomy level
for (i in 1:nrow(graphic_loc1vsloc2_genus)){ #Loop over rows
if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_genus$Family[i]!="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_genus$Family[i],
"|", graphic_loc1vsloc2_genus$Genus[i])}
else if (isTRUE(graphic_loc1vsloc2_genus$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Order[i]!="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_genus$Order[i],
"|", graphic_loc1vsloc2_genus$Genus[i])
}
else if (isTRUE(graphic_loc1vsloc2_genus$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Class[i]!="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_genus$Class[i],
"|", graphic_loc1vsloc2_genus$Genus[i])
}
else if (isTRUE(graphic_loc1vsloc2_genus$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Phylum[i]!="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
"|", graphic_loc1vsloc2_genus$Genus[i])
}
else if (isTRUE(graphic_loc1vsloc2_genus$Phylum[i]=="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", "unclassified")
}
else if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]!="unclassified")) {
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
"|", graphic_loc1vsloc2_genus$Genus[i])
}}Next, we can filter out unclassified phyla:
#Filter out unclassified phyla
graphic_loc1vsloc2_genus <- graphic_loc1vsloc2_genus %>% filter(Phylum!="unclassified") If using ANCOM at ASV level, the code would be as follows:
#Get the last classified taxonomy level for ANCOM at ASV level
for (i in 1:nrow(graphic_loc1vsloc2_ASV)){
if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_ASV$Family[i]!="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_ASV$Family[i],
"|", graphic_loc1vsloc2_ASV$Genus[i], "|", graphic_loc1vsloc2_ASV$ASV_names[i])}
else if (isTRUE(graphic_loc1vsloc2_ASV$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Order[i]!="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_ASV$Order[i],
"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}
else if (isTRUE(graphic_loc1vsloc2_ASV$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Class[i]!="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_ASV$Class[i],
"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}
else if (isTRUE(graphic_loc1vsloc2_ASV$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]!="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}
else if (isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]=="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", "unclassified")
}
else if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]!="unclassified")) {
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}}Now we have a graphic_loc1vsloc2_genus variable with
ANCOM results and a new column called Classification. Next,
we create a data frame including only data we want to represent. This is
the classification column, the log fold change (represented as beta in
ANCOM results) and the standard deviation (represented as SE in ANCOM
results). In this data frame those genera where log fold change is 0 or
less than a specific value will be filter out to reduce the size of
graphic.
##Create the data frame for representation, with Classification, Log fold change and standard deviation
logfold_df <- graphic_loc1vsloc2_genus %>% dplyr::select(Classification, location1vslocation2_lfc, location1vslocation2_se)
colnames(logfold_df)<- c("Genus", "LogFold_location1vslocation2", "SD")
##Filter out 0 log fold change and assign name according to the direction of change
logfold_df <- logfold_df %>%
filter(LogFold_location1vslocation2 != 0) %>% arrange(desc(LogFold_location1vslocation2)) %>%
mutate(group = ifelse(LogFold_location1vslocation2 > 0, "Location2", "Location1")) #Change location to your variable of interest from your mt object
logfold_df$Genus = factor(logfold_df$Genus, levels = logfold_df$Genus)Let’s write the table to your local machine:
write.table(file="waterfall_table.txt", logfold_df, sep="\t")Here we have the filtering step. First, range function
is used to see the minimum and maximum log fold change. Then, we filter
the data frame. Remember to change the number to your desired threshold,
in this tutorial we will use a range between -1.5 and 1.5.
#FILTER WHEN NEEDED. This can be used to filter the results according to log fold change level
range(logfold_df$LogFold_location1vslocation2)#> [1] -2.880552 3.241747
logfold_df_filtered <- rbind(logfold_df[which(logfold_df$LogFold_location1vslocation2>=1.5),], logfold_df[which(logfold_df$LogFold_location1vslocation2<=-1.5),])Now that we have prepared all data, we can use ggplot
for plotting log fold change.
##Waterfall plot
waterfall_location1 <- ggplot(data = logfold_df_filtered,#If not filtering, use logfold_df in plotting
aes(x = Genus, y = LogFold_location1vslocation2, fill = group, color = group)) + #Change Genus to taxonomic rank of interest and logFold_location1vslocation2 to your object
geom_bar(stat = "identity", width = 0.7,
position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = LogFold_location1vslocation2 - SD, ymax = LogFold_location1vslocation2 + SD), width = 0.2, #Change LogFold_location1vslocation2 to your object
position = position_dodge(0.05), color = "black") +
labs(x = NULL, y = "Log fold change",
title = "Waterfall Plot for the Location Effect") + #change title
theme_bw() +
theme(legend.position = "right",
legend.title =element_blank(),
plot.title = element_text(hjust = 0.5),
panel.grid.minor.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1), plot.margin=margin(10,10,10,100))
#Visualize plot
waterfall_location1#Save it
ggsave("waterfall_location1.tiff", plot = waterfall_location1,width = 17, height = 30, units = "cm", dpi = 800 )As it was said at the beginning of this section, we will also produce
a graphic with relative abundance, specifically, a back-to-back plot or
pyramidal plot. Again, we will show how to do this plot with location1
vs. location2 comparison. However, we need to start with a relative
abundance table, in our case, genus_location_table
(relative abundance at genus level and grouped by location). First, we
will filter out those genera that were not statistically different
according to ANCOM and sort the table as
logfold_df_filtered (waterfall plot table). Then, we will
select columns with classification and abundance and use
pivot_longer to give it a format suitable for
ggplot2.
#Again, bind phylum name and genus name
pyramidal_df<- genus_location_table
for (i in 1:nrow(pyramidal_df)){
pyramidal_df$Classification[i]<- paste(pyramidal_df$Phylum[i],pyramidal_df$Genus[i],sep="|")
}
#Filter to include only significant genera according to ANCOM
pyramidal_loc1vsloc2_filt <- pyramidal_df[which(pyramidal_df$Classification %in%logfold_df_filtered$Genus ),]
#Sort it in the same way as waterfall plot
pyramidal_loc1vsloc2_filt <- pyramidal_loc1vsloc2_filt[match(logfold_df_filtered$Genus,pyramidal_loc1vsloc2_filt$Classification),]
#Now, transform it to ggplot format
pyramidal_loc1vsloc2 <- pyramidal_loc1vsloc2_filt %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#We can write this table to our local machine:
write.table(file="pyramidal_table.txt", pyramidal_loc1vsloc2, sep="\t")Finally, the data frame (pyramidal_loc1vsloc2_filt) has
to be split into two data frames for each location and then create a
vector variable with factors (this is needed for keeping the order in
ggplot). This is essential for the back-to-back plot.
#Split table according to location
loc1pyrm <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location1"),]#Change location1 to variable of interest
loc2pyrm <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location2"),]#Change location1 to variable of interest
#Create vector with factors
loc1pyrm$Classification <- factor(loc1pyrm$Classification, levels=loc1pyrm$Classification)
loc2pyrm$Classification <- factor(loc2pyrm$Classification, levels=loc2pyrm$Classification)Before plotting, it is important to know the maximum and minimum values of abundances to set the limits of the graphic:
#Get limits for plotting
min(pyramidal_loc1vsloc2$Abundance)
#> [1] 0
max(pyramidal_loc1vsloc2$Abundance)
#> [1] 3.237008As we can see, we should change the limits to a maximum of 4. Check
the comments below (#):
#### PYRAMIDAL PLOT ####
pyramidalggplot=ggplot(data = pyramidal_loc1vsloc2, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
geom_bar(loc1pyrm, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
geom_bar(data=loc2pyrm, stat="identity")+ #in data, give the second split dataframe
theme_bw()+
scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
labs(y="Relative abundance (%)")+
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
coord_flip()
#Visualize plot
pyramidalggplot#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )Another way to represent the previous plot would be by abundance order. We will generate the tables needed using the original genus table in reversed order and avoiding the step to order the pyramidal data frame as the waterfall plot.
#Get genus location table in reverse order
genus_location_table_rev <- genus_location_table[nrow(genus_location_table):1,]
#Again, bind phylum name and genus name
pyramidal_df_abun<- genus_location_table_rev
for (i in 1:nrow(pyramidal_df_abun)){
pyramidal_df_abun$Classification[i]<- paste(pyramidal_df_abun$Phylum[i],pyramidal_df_abun$Genus[i],sep="|")
}
#Filter to include only significant genera according to ANCOM
pyramidal_loc1vsloc2_filt_abund <- pyramidal_df_abun[which(pyramidal_df_abun$Classification %in%logfold_df_filtered$Genus ),]
#Now, transform it to ggplot format
pyramidal_loc1vsloc2_abund <- pyramidal_loc1vsloc2_filt_abund %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#
#Split table according to location
loc1pyrm_abund <- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location1"),]#Change location1 to variable of interest
loc2pyrm_abund <- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location2"),]#Change location1 to variable of interest
#Create vector with factors
loc1pyrm_abund$Classification <- factor(loc1pyrm_abund$Classification, levels=loc1pyrm_abund$Classification)
loc2pyrm_abund$Classification <- factor(loc2pyrm_abund$Classification, levels=loc2pyrm_abund$Classification)Let’s see how it works
#### PYRAMIDAL PLOT ####
pyramidalggplot_abund=ggplot(data = pyramidal_loc1vsloc2_abund, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
geom_bar(loc1pyrm_abund, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
geom_bar(data=loc2pyrm_abund, stat="identity")+ #in data, give the second split dataframe
theme_bw()+
scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
labs(y="Relative abundance (%)")+
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
coord_flip()
#Visualize plot
pyramidalggplot_abund#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )Constrained analysis of proximities (CAP)
Microbial communities are greatly influenced by environmental factors. Thus, it is interesting to analyze how these variables relate to community structure. This can be achieved by studying different soil physicochemical parameters and fitting them to an ordination plot. The package vegan includes several functions and a very useful tutorial.
Among the options offered by vegan, we have implemented them in a way it is logical for our daily studies. We will exemplify it with some data from pine trees (which can be downloaded in the link at the beggining of the tutorial).
We will read this phyloseq file (saved as .RDS), convert the abundances to relative abundances and get the soil physicochemical parameters from metadata.
### PREPARE DATA FOR CAP ####
#Read phyloseq and get relative abundance
cap_phyloseq <- readRDS(file = "cap_phyloseq")
cap_phyloseq_rel <- transform_sample_counts(cap_phyloseq,function(x){x/sum(x)}*100)
#Get ASV and metadata
ASV_cap<-as.data.frame(t(otu_table(cap_phyloseq_rel)))
metadata_cap <- as.matrix(sample_data(cap_phyloseq))
metadata_cap <- as.data.frame(metadata_cap)
#metadata_cap <- metadata_cap[1:ncol(metadata_cap)-1] #Remove column Season_NucleicAcidIf using your own data, you can use your previously generated
phyloseq object. If physicochemical metadata was not originally in your
metadata file, you can always read it into a variable in R with
metadata_cap <- read.table("yourmtfile.txt", sep="\t", header=TRUE)
and row.names(metadata_cap)<- metadata_cap$samples. In
this file, row names should fit sample names and columns should fit
physicochemical parameters.
Now, we will generate two models. CAP1 will include all
variables to be part of constrained analysis, that is, it will explain
the variance produced by all variables, and CAP0, which is
the opposite of CAP1, i.e., fully unconstrained (explained
variance without influence of physicochemical parameters).
These two models will be the input for ordistep
function. This function allows us to get a model with physicochemical
variables that significantly explain a part of the variance in the
community and that are not colinear between them. The function starts
with a model with all unconstrained variables (CAP0) and
then, in each step, it adds a new variable to the constrained model.
Then, it performs a checking step. If the variables left are no longer
significant, or add no more explained variance to the model, the
function stops. In the end, the output will be the formula with those
variables that explain the majority of the variance of data. In this
way, when having multiple variables, the model can be reduced and remove
those that are interdependent.
#Compute CAP1 and CAP0 models
CAP1<-capscale(ASV_cap~Water+ln_P+SOM+N+ph_KCl+CN+Na_Exch+Clay+Sand+Slime+Texture, data=metadata_cap,distance = "bray")#Add here, after ASV_cap, all your parameters
CAP0<-capscale(ASV_cap~1, data=metadata_cap, distance = "bray")
#Get formula with ordistep
CAP_ordi<-ordistep(CAP0, scope=formula(CAP1))#>
#> Start: ASV_cap ~ 1
#>
#> Df AIC F Pr(>F)
#> + ph_KCl 11 28 1.2601 0.010 **
#> + Clay 15 10 1.0955 0.335
#> + Water 15 10 1.0896 0.355
#> + ln_P 8 33 1.0101 0.495
#> + Na_Exch 10 33 0.9478 0.735
#> + SOM 14 23 0.8758 0.905
#> + N 13 28 0.8611 0.910
#> + CN 16 -Inf
#> + Sand 16 -Inf
#> + Slime 16 -Inf
#> + Texture 16 -Inf
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Step: ASV_cap ~ ph_KCl
#>
#> Df AIC F Pr(>F)
#> - ph_KCl 11 28.737 1.2601 0.02 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Df AIC F Pr(>F)
#> + Na_Exch 4 8 1.0277 0.465
#> + ln_P 4 10 0.9176 0.540
#> + Water 5 -Inf
#> + SOM 5 -Inf
#> + N 5 -Inf
#> + CN 5 -Inf
#> + Clay 5 -Inf
#> + Sand 5 -Inf
#> + Slime 5 -Inf
#> + Texture 5 -Inf
Thus, we know now which variables influence our microbial community
according to ordistep, in this case, pH. With this
parameter, we will use the function envfit to fit this
environmental factor as a vector into an ordination plot and get the
correlation with each ordination axis. Remember that this correlation
values correspond to an unconstrained analysis, i.e., correlation with
ordination plots performed in β-diversity analysis.
#### ENVFIT ####
ef <- envfit (CAP1~ph_KCl, data = metadata_cap, perm = 999) # AFter CAP1, introduce the significant parametersNow, we can adjust p-values for multiple comparisons and visualize the result
#Now we can adjust pvalues of envfit
ef.adj <- ef
pvals.adj <- p.adjust (ef$vectors$pvals, method = 'BH')
ef.adj$vectors$pvals <- pvals.adj
ef.adj#>
#> ***VECTORS
#>
#> $pvals
#> numeric(0)
#>
#>
#> ***FACTORS:
#>
#> Centroids:
#> CAP1 CAP2
#> ph_KCl5.6 -0.5680 -1.0802
#> ph_KCl5.7 -0.6385 -1.0075
#> ph_KCl5.8 -0.4246 -0.1969
#> ph_KCl5.9 -0.5018 0.1034
#> ph_KCl6.0 0.6368 0.3278
#> ph_KCl6.1 0.1524 0.6719
#> ph_KCl6.2 -0.3268 1.0491
#> ph_KCl6.4 -0.0529 1.3443
#> ph_KCl6.5 1.0748 0.1100
#> ph_KCl6.6 0.9464 -0.2098
#> ph_KCl7.0 0.4793 -0.3622
#> ph_KCl7.7 1.6547 -0.9666
#>
#> Goodness of fit:
#> r2 Pr(>r)
#> ph_KCl 0.818 0.078 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Permutation: free
#> Number of permutations: 999
With plot_ordination we can get a graphic representation
of constrained ordination. In this sense, we can visualize the direction
of change of environmental variables. We will introduce another
variable, although it was not significant, for showing how to plot
various parameters.
#### CAP PLOT ####
#First, get distance matrix
distance_matrix <- phyloseq::distance(physeq = cap_phyloseq_rel, method = "wunifrac")
#Then, generate ordination with CAP (that is, constrained ordination to ph_KCl)
CAP_wunifrac <- ordinate(physeq = cap_phyloseq_rel, method = "CAP",distance = distance_matrix,
formula = ~ph_KCl+Clay)#Change parameters to yours
#Produce plot
CAP_wunifrac_plot <- plot_ordination(physeq = cap_phyloseq_rel, ordination = CAP_wunifrac,
color = "Species", axes = c(1,2)) +
aes(shape = Species) +
geom_point(aes(colour = Species), alpha = 1, size = 3.5) +
scale_shape_manual(values=c("Pinaster"=16,"Other"=15), #Change Pinaster and other according to your variable of interest
breaks=c("Pinaster","Other"))+#Change Pinaster and other according to your variable of interest
scale_color_manual(values = c("Pinaster"="brown", "Other"="orange"), #Change Pinaster and other according to your variable of interest
name="Host genotype",
breaks=c("Pinaster", "Other"),
labels=expression(paste("Confirmed ", italic("P. pinaster")), "Pine forest samples"))+ #This is used to set the legend, change it according to your preference
guides(shape="none")+
ggtitle("CAP on Weighted Unifrac distance")+
theme_bw()+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold"),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold"),
legend.text = element_text(size = 16))+
geom_hline(aes(yintercept = c(0.00)), lty=2, colour="grey")+
geom_vline(aes(xintercept = c(0.00)), lty=2, colour="grey")
#Define arrows aesthetic and labels
arrowmat <- vegan::scores(CAP_wunifrac, display = "bp")
arrowdf <- data.frame(labels = rownames(arrowmat), arrowmat)
arrow_map <- aes(xend = CAP1, yend = CAP2,x = 0, y = 0, shape = NULL, color = NULL, label=labels)
label_map <- aes(x = 1.1 * CAP1, y = 1.1 * CAP2, shape = NULL,color = NULL, label=labels)
arrowhead <- arrow(length = unit(0.02, "npc"))
#Introduce them to plot
CAP_wunifrac_plot +
geom_segment(mapping = arrow_map, size = c(ph_KCl=1,Clay=1), data = arrowdf, color =
c(ph_KCl="black",Clay="black"),arrow = arrowhead) +#Change pH and Clay to your physycochemical variables of interest
geom_text(mapping = label_map, size = 4,data = arrowdf, show.legend = FALSE)BONUS: statistical analysis whitout micro4all
In this section, we will show how it is possible to perform the statistical analysis shown in this tutorial without using functions from micro4all package. This section will not include explanations, as everything was clearly explained throughout the tutorial. We will start with α-diversity. First of all, let’s load some libraries (which were installed with micro4all):
##Load library
library(car);packageVersion("car")
#> Loading required package: carData
#>
#> Attaching package: 'car'
#> The following object is masked from 'package:DescTools':
#>
#> Recode
#> The following object is masked from 'package:dplyr':
#>
#> recode
#> The following object is masked from 'package:purrr':
#>
#> some
#> [1] '3.1.1'
library(stats);packageVersion("stats")
#> [1] '4.2.2'
library(BiocGenerics); packageVersion("BiocGenerics")
#> [1] '0.42.0'
library(agricolae);packageVersion("agricolae")
#> Registered S3 methods overwritten by 'klaR':
#> method from
#> predict.rda vegan
#> print.rda vegan
#> plot.rda vegan
#>
#> Attaching package: 'agricolae'
#> The following object is masked from 'package:ape':
#>
#> consensus
#> The following object is masked from 'package:seqinr':
#>
#> consensus
#> [1] '1.3.5'
library(pairwiseAdonis);packageVersion("pairwiseAdonis")
#> Loading required package: cluster
#> [1] '0.4.1'
library(dunn.test);packageVersion("dunn.test")
#> [1] '1.3.5'For balanced ANOVA, the next lines should be repeated for every diversity index:
#### ALPHA DIVERSITY WITHOUT MICRO4ALL ####
##Balanced anova
res.aov_observed <- aov(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest and location to mt variable
summary(res.aov_observed)#> Df Sum Sq Mean Sq F value Pr(>F)
#> location 2 6191 3095 1.126 0.343
#> Residuals 21 57753 2750
The same logic can be applied to unbalanced designs:
##Unbalanced anova
res.aov_Observed_u <- Anova(res.aov_observed, type = "III")# Change res.aov_observed for every index (res.aov_shannon, etc...)
res.aov_Observed_u #> Anova Table (Type III tests)
#>
#> Response: Observed
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 568711 1 206.7924 2.421e-12 ***
#> location 6191 2 1.1255 0.3433
#> Residuals 57753 21
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a post-hoc, Tukey test or Dunn test can be applied. Again, this should be repeated for every index
##POST-HOC TEST, TUKEYHSD
post_observed <- TukeyHSD(res.aov_observed, which = "location")$location # Change res.aov_observed for every index (res.aov_shannon, etc...) and location to mt variable
which(post_observed[,"p adj"]<0.05) #> named integer(0)
##Post-hoc, dunn Test
dunn.result<- dunn.test(x=alpha_table$Observed, g=alpha_table$location, method = "BH")#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.5
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -0.901953
#> | 0.2753
#> |
#> location | 0.212224 1.114177
#> | 0.4160 0.3978
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
Now, normality and homocedasticity will be checked:
##Levene test and shapiro test
leveneTest(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest and location to mt variable#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 2 0.4798 0.6255
#> 21
#Shapiro
aov_residuals_observed = residuals(object = res.aov_observed) # Change res.aov_observed for every index (res.aov_shannon, etc...)
shapiro.test(x = aov_residuals_observed) #>
#> Shapiro-Wilk normality test
#>
#> data: aov_residuals_observed
#> W = 0.96527, p-value = 0.5529
For Kruskal-Wallis and pairwise (Wilcoxon test), a similar procedure will be followed:
##Kruskal Wallis
k_observed <- kruskal.test(Observed ~ location, data = alpha_table) #Change OBSERVED for index of interest and location to mt variable
k_observed#>
#> Kruskal-Wallis rank sum test
#>
#> data: Observed by location
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.4966
#Wilcoxon test
pairwise.wilcox.test(alpha_table$Observed, alpha_table$location, p.adjust.method = "BH") #Change OBSERVED for index of interest and location to mt variable#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
#>
#> Pairwise comparisons using Wilcoxon rank sum exact test
#>
#> data: alpha_table$Observed and alpha_table$location
#>
#> location1 location2
#> location2 0.66 -
#> location3 0.92 0.66
#>
#> P value adjustment method: BH
For β-diversity analysis, PERMANOVA test can be applied to a single distance method.
#### BETA DIVERSITY WITHOUT MICRO4ALL ####
#PERMANOVA with bray curtis distances
df <- data.frame(sample_data(norm_phyloseq))
bray <- phyloseq::distance(norm_phyloseq, method="bray") #Change bray to include another distance method
permanova_all <- adonis2(bray~ location, data=df) #Change location to mt variable of interest
permanova_all#> Permutation test for adonis under reduced model
#> Terms added sequentially (first to last)
#> Permutation: free
#> Number of permutations: 999
#>
#> adonis2(formula = bray ~ location, data = df)
#> Df SumOfSqs R2 F Pr(>F)
#> location 2 1.4040 0.30049 4.5105 0.001 ***
#> Residual 21 3.2684 0.69951
#> Total 23 4.6725 1.00000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion will be checked with betadisper function. If
working with another distance, do the proper change
adonis:
#Betadisper with bray curtis distances
disper_d <- betadisper(bray, df$location) #Change location to mt variable of interest
permutest(disper_d) # NOT SIGNIFICANT #>
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#>
#> Response: Distances
#> Df Sum Sq Mean Sq F N.Perm Pr(>F)
#> Groups 2 0.004001 0.0020003 0.317 999 0.706
#> Residuals 21 0.132519 0.0063104
Next, pairwise.adonis function will be applied to check
differences between groups:
#Pairwise adonis
d <- data.frame(sample_data(norm_phyloseq))
table_vegan <- cbind(t(OTU), df)
ncolum <- ncol(t(OTU))
## PAIRWISE ADONIS ON BRAY CURTIS ##
pw_result <- pairwise.adonis(table_vegan[,1:ncolum], table_vegan$location, p.adjust.m = "BH") #Change location to mt variable of interest
pw_result_sig <- pw_result[which(pw_result$p.adjusted<0.05),]
pw_result_sig # None#> pairs Df SumsOfSqs F.Model R2 p.value p.adjusted sig
#> 1 location1 vs location2 1 0.3254020 2.020915 0.1261423 0.011 0.0110 .
#> 2 location1 vs location3 1 0.8502918 5.871144 0.2954608 0.001 0.0015 *
#> 3 location2 vs location3 1 0.8446613 5.871946 0.2954892 0.001 0.0015 *
References
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology, 8, 2224.
Paliy, O., & Shankar, V. (2016). Application of multivariate statistical techniques in microbial ecology. Molecular ecology, 25(5), 1032-1057.
Bokulich, N. A., Subramanian, S., Faith, J. J., Gevers, D., Gordon, J. I., Knight, R., … & Caporaso, J. G. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature methods, 10(1), 57-59.