Micro4all workflow
Introduction
The following tutorial has been created to show the workflow we follow in our laboratory for the analysis of amplicon data. It is based in R and it is intented to be easy and readable for all users, even those with little knowledge about R. Moreover, it integrates the main utilities of micro4all package, which was created in order to record R functions we usually use. We will cover several steps, including:
- Produce an amplicon sequence variant (ASV) table, making use of dada2 package.
- Study microbial diversity (\(\alpha\) and \(\beta\) diversity).
- Get taxonomic profiles at phylum and genus level.
- Differential abundance analysis with ANCOMBC package.
- Ordination graphics and environmental fitting of soil physicochemical parameters.
For following this tutorial, we will use an example dataset, which will be detailed in the next section. The results that you will see were produced with R version 4.1.2 (2021-11-01) on Ubuntu 20.04.3 LTS.It is important to know that we use Ubuntu, specifically for tools as figaro or cutadapt (we will see this in detail later on).
Dataset
In this tutorial, we will use a bunch of sequences of our own. Specifically, we collected endosphere samples of 8 olive trees (Picual variety) from three different locations (let’s call them location 1, location 2 and location 3) and sequenced the V3-V4 region of the 16S rRNA gene (with Illumina MiSeq technology and 300bp PE) in order to study the bacterial community. These paired-end fastq files (already “demultiplexed” and without barcodes) can be downloaded here. In this link you will also find a phyloseq object that will be used in environmental fitting section of this tutorial and a modified RDP database file for classification (explained later).
Install package micro4all
Let’s install micro4all package. This package is hosted in Github. Because of that, we will need devtools for installing the package from this respository. If it’s the first time installing this package, it will take a while to get the full installation done. If not, you don’t really need to install it again, just load the libraries before processing your dataset.
##Install devtools
install.packages("devtools")
Now, we are ready to install micro4all…
##Install micro4all with devtools
library(devtools)
install_github("nuriamw/micro4all")
…and load the library
#Load micro4all
library(micro4all)
Produce an amplicon sequence variant (ASV) table
The first part of this tutorial is mainly based on the
information provided by dada2
(we highly
recommend to check its online tutorial) but with our own
implementations. We will start with raw sequences, then we will get to
know their quality profiles, clean them and infer Amplicon
Sequence Variants (ASV). Thus, we will end this section with an
ASV table, ready for microbial diversity analysis.
Prepare data: load libraries and read sequences files
We will make use of several packages, as dada2, ShortRead, ggplot2 and tidyverse . The first two mentioned packages can be installed from BiocManager as follows:
##dada2 installation
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("dada2")
BiocManager##ShortRead installation
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("ShortRead") BiocManager
When installing tidyverse, ggplot2 is automatically installed as it is part of the tidyverse package collection:
##Install tidyverse
install.packages("tidyverse")
Now, let’s load the libraries
##Load libraries
library(dada2); packageVersion("dada2")
#> [1] '1.24.0'
library(ShortRead);packageVersion("ShortRead")
#> [1] '1.54.0'
library(ggplot2);packageVersion("ggplot2")
#> [1] '3.4.1'
library(tidyverse);packageVersion("tidyverse")
#> [1] '2.0.0'
Great! We are ready to use dada2 workflow. First of
all, we will set the path to FASTQ files. We will use them unzipped, but
it is perfectly OK for dada2 to use them zipped. Just
copy and paste your folder path next to path <-
,
remember that “~/” is only valid in Ubuntu and it is used to link to
your home folder. Windows users should write something like:
"C:/Users/user_name/Desktop/PATH_to_your_Data"
.
##SET PATH, FILES AND SAMPLE.NAMES
<- "/home/nuria/Escritorio/data_micro4all/" # CHANGE ME to the directory containing the fastq files
path list.files(path) #list files in this folder
#> [1] "ASV_length_nochim.rds" "cutadapt" "figaro"
#> [4] "filtered" "filtN" "L1-1_S102_R1_001.fastq"
#> [7] "L1-1_S102_R2_001.fastq" "L1-2_S112_R1_001.fastq" "L1-2_S112_R2_001.fastq"
#> [10] "L1-3_S121_R1_001.fastq" "L1-3_S121_R2_001.fastq" "L1-4_S130_R1_001.fastq"
#> [13] "L1-4_S130_R2_001.fastq" "L1-5_S138_R1_001.fastq" "L1-5_S138_R2_001.fastq"
#> [16] "L1-6_S145_R1_001.fastq" "L1-6_S145_R2_001.fastq" "L1-7_S152_R1_001.fastq"
#> [19] "L1-7_S152_R2_001.fastq" "L1-8_S159_R1_001.fastq" "L1-8_S159_R2_001.fastq"
#> [22] "L2-1_S139_R1_001.fastq" "L2-1_S139_R2_001.fastq" "L2-2_S146_R1_001.fastq"
#> [25] "L2-2_S146_R2_001.fastq" "L2-3_S153_R1_001.fastq" "L2-3_S153_R2_001.fastq"
#> [28] "L2-4_S160_R1_001.fastq" "L2-4_S160_R2_001.fastq" "L2-5_S104_R1_001.fastq"
#> [31] "L2-5_S104_R2_001.fastq" "L2-6_S114_R1_001.fastq" "L2-6_S114_R2_001.fastq"
#> [34] "L2-7_S123_R1_001.fastq" "L2-7_S123_R2_001.fastq" "L2-8_S132_R1_001.fastq"
#> [37] "L2-8_S132_R2_001.fastq" "L3-1_S136_R1_001.fastq" "L3-1_S136_R2_001.fastq"
#> [40] "L3-2_S143_R1_001.fastq" "L3-2_S143_R2_001.fastq" "L3-3_S150_R1_001.fastq"
#> [43] "L3-3_S150_R2_001.fastq" "L3-4_S157_R1_001.fastq" "L3-4_S157_R2_001.fastq"
#> [46] "L3-5_S101_R1_001.fastq" "L3-5_S101_R2_001.fastq" "L3-6_S111_R1_001.fastq"
#> [49] "L3-6_S111_R2_001.fastq" "L3-7_S120_R1_001.fastq" "L3-7_S120_R2_001.fastq"
#> [52] "L3-8_S129_R1_001.fastq" "L3-8_S129_R2_001.fastq" "MOCK-1_S169_R1_001.fastq"
#> [55] "MOCK-1_S169_R2_001.fastq" "MOCK-2_S176_R1_001.fastq" "MOCK-2_S176_R2_001.fastq"
#> [58] "MOCK-3_S183_R1_001.fastq" "MOCK-3_S183_R2_001.fastq" "nochim"
#> [61] "num_seqs.txt" "rdp_train_set_18_H.fa" "tabla_otu_inicial.rds"
#> [64] "taxa_rdp_16S.rds"
Second, we will store path names for every fastq file. We will do this for forward reads (fnfFs) and reverse reads (fnRs), respectively.
## SORT FORWARD AND REVERSE READS SEPARETELY; forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq ##
<- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnFs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
fnRs # Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
<- sapply(strsplit(basename(fnFs), "_"), `[`, 1) sample.names
📝
TIP: We get sample names with
basename(fnFs)
, a function that removes all the previous
path up to the last path separator (/). Then, strsplit
works over sample names and splits the name in every “_“. Afterwards,
sapply
will loop over the produced list of splitted names
and take the first part (corresponding to SAMPLENAME, if file names have
format SAMPLENAME_XXX.fastq). For example, our first sample has the
following name:
##See name of the first sample (fnFs)
basename(fnFs[1])
#> [1] "L1-1_S102_R1_001.fastq"
Then, strsplit
returns…
##Apply strsplit
strsplit(basename(fnFs[1]), "_")
#> [[1]]
#> [1] "L1-1" "S102" "R1" "001.fastq"
…and we take the first part, which corresponds to “L1-1”. If your
file names are more complex and you need to remove some previous
pattern, you can always use gsub
over sample.names. For
example, to remove the last “-1” of sample.names[1]
, we
would use
gsub(pattern="-1", replacement="", sample.names[1])
.
Moreover, it is important for later analysis that your sample
names are not only numerical. If that’s the case, you can
easily add a letter (for example, an “S”) as follows:
paste0("S",sample.names)
.
Check sequences quality
A crucial step in any sequencing analysis pipeline is quality
checking. For this purpose, we will use several approaches in order to
have several perspectives on this aspect. First, we will generate a
table with the number of reads in each sample, just to check if anything
went wrong during sequencing process. We will use
ShortRead package. As we have already loaded the
library, the use of ShortRead::
would not be necessary, but
it is useful to know from which packages comes each function.
##COUNT NUMBER OF READS IN EACH SAMPLE BEFORE FILTERING
<- NULL #Creates an empty object
raw_reads_count
for (i in 1:length(fnFs)){ #loop over fnFs to count number of sequences with length over readFastq
<- rbind(raw_reads_count, length(ShortRead::readFastq(fnFs[i])))}
raw_reads_count
#Format table to give it sample.names
rownames(raw_reads_count)<- sample.names
colnames(raw_reads_count)<- "Number of reads"
With raw_reads_count
we can easily now the minimum and
maximum number of sequences as follows:
#Get min and max number of sequences
min(raw_reads_count)
#> [1] 66426
max(raw_reads_count)
#> [1] 125873
As we can see, even the minimum number of sequences is good enough
for our analysis. However, if that’s not the case, we can visually check
the table with View(raw_reads_count)
and write it to your
device with
write.table(raw_reads_count, file="raw_reads_count.txt", sep="\t")
.
Another approach of quality checking is to plot the distribution of sequences length with a histogram. Let’s see how it works!
##Histogram of sequences length
<- ShortRead::readFastq(fnFs) #store sequences in an R object
reads
#Get lengths with unique
<- unique(reads@quality@quality@ranges@width) #get length accessing to width attribute within reads
uniques
#Count number of sequences for each length
<- NULL #Creates a null object for storing counts
counts for (i in 1:length(uniques)) {
<- rbind(counts,length(which(reads@quality@quality@ranges@width==uniques[i])))
counts
}
#format histogram table
<- cbind(uniques,counts)
histogram colnames(histogram) <- c("Seq.length", "counts")
Again, we can visually check the histogram table with
View(histogram)
and write it to your device with
write.table(histogram, file="histogram.txt", sep="\t)
. For
now on, we will use head
to see the main results, as well
as hist
for plotting.
#Check histogram matrix
head(histogram[order(histogram[,1],decreasing = TRUE),]) #Most sequences fall in expected sequence length
#> Seq.length counts
#> [1,] 301 2154642
#> [2,] 300 71753
#> [3,] 299 16251
#> [4,] 298 61651
#> [5,] 297 10119
#> [6,] 296 377
# PLOT HISTOGRAM
hist(reads@quality@quality@ranges@width, main="Forward length distribution", xlab="Sequence length", ylab="Raw reads")
Moreover, plotQualityProfile
function of
dada2
is very useful for a quick visualization of sequences
quality profile. We can visualize both profiles for forward (fnFs) and
reverse (fnRs) reads. It is a common fact to have worse reverse reads
profiles because of the sequencing process. In general, we can see that
the average quality is good, as it only falls below 20 at the end of the
reverse sequences. Again, we strongly suggests to check
dada2 tutorial for more information. We will visualize
quality plots for the first two forward and reverse sequences (indicated
by [1:2]
).
## VIEW AND SAVE QUALITY PLOT FOR FW AND RV ##
plotQualityProfile(fnFs[1:2])
plotQualityProfile(fnRs[1:2])
Figaro: selection of trimming parameters
In this tutorial, we will use the Figaro tool.
It can be downloaded and installed following instructions from the
provided link. Don’t forget that in this tutorial we are running
it on Ubuntu, so some changes would be necessary for Windows
distributions. This tool makes the selection of trimming
parameters for dada2 more objective. Sadly, figaro has not yet
been updated to lead with variations in sequence lengths. We will solve
this following author’s recommendation (which you can find here). It
consists in performing a pre-trimming step at about 295 bp.This
pre-trimming will be used only for figaro, so we will store these
sequences in a new folder called ‘figaro’. Pre-trimming will be done
making use of filterAndTrim
function from
dada2. Let’s see how it works!
##FIGARO
##Creates a new folder called 'figaro' inside our path to store reads after trimming for fígaro
<- file.path(path, "figaro", basename(fnFs))
figFs <- file.path(path, "figaro", basename(fnRs))
figRs
##TRIMMING AT 295 pb
<- filterAndTrim(fnFs, figFs, fnRs, figRs, #inputFreads,outputFreads,inputRreads,outputRreads
out.figaro compress=TRUE, multithread=TRUE, truncLen=c(295,295))
Once we have the reads stored in figaro folder and cutted at 295 bp,
we will run figaro. For calling figaro from R in Ubuntu, we use the
function system
, which allows us to access to Ubuntu
command line. We have to call python3, as figaro works with it. Then,
all the appropiete arguments for figaro should be given: path to
figaro program, -i path to input, -o path to output, -a length of your
amplicon without primers, -f primer forward length, -r primer reverse
length. As you can see, before using figaro, you will need to know
your expected amplicon length without primers (you can calculate it
based on E. coli 16S rRNA gene) as well as both primers
lengths.
##RUN FIGARO
<- system(("python3 /home/programs/figaro/figaro/figaro.py -i /home/nuria/Escritorio/data_micro4all/figaro -o /home/nuria/Escritorio/data_micro4all/figaro -a 426 -f 17 -r 21"),
figaro intern=TRUE) #path to figaro program -i path to input - path to output -a length of your amplicon without primers, -f primer forward length, -r primer reverse length
head(figaro)
#> [1] "Forward read length: 295"
#> [2] "Reverse read length: 295"
#> [3] "{\"trimPosition\": [279, 205], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.69, \"score\": 76.68937485688116}"
#> [4] "{\"trimPosition\": [278, 206], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.68, \"score\": 76.6750629722922}"
#> [5] "{\"trimPosition\": [277, 207], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.67, \"score\": 76.66838409281735}"
#> [6] "{\"trimPosition\": [280, 204], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.66, \"score\": 76.6617052133425}"
We will select the first result, which gives us the trimming positions and maximum expected errors that produces the greater score according to Figaro model. These are returned as [forward,reverse].
ITS considerations
ITS region has a very variable length. Consequently, no length trimming
(truncLen
argument) is performed in
filterAndTrim
function from dada2. Therefore, it would not
be necessary to use Figaro.
Filter and trim sequences
Now that we have the best trimming and expected errors values
according to figaro, we can apply it to our original sequences (fnFs and
fnRs, remember that figFs and figRs were used only for figaro). Let’s
apply figaro parameters to filterAndTrim
function from
dada2! You will see that we produce
filtFs
and filtRs
objects first. These are the
files for storing the filtered sequences after applying
filterAndTrim
over raw sequences (fnFs
and
fnRs
). Afterwards, primers will be removed from
filtFs
and filtRs
with cutadapt tool.
### FILTER AND TRIM SEQUENCES ACCORDING TO FIGARO ####
## Place filtered files in filtered/ subdirectory
<- file.path(path, "filtered", basename(fnFs))
filtFs <- file.path(path, "filtered", basename(fnRs))
filtRs
<- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(279,205),
out maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, minLen=50)
head(out)
#> reads.in reads.out
#> L1-1_S102_R1_001.fastq 80054 61741
#> L1-2_S112_R1_001.fastq 76776 60781
#> L1-3_S121_R1_001.fastq 84947 67563
#> L1-4_S130_R1_001.fastq 87190 68997
#> L1-5_S138_R1_001.fastq 114810 93159
#> L1-6_S145_R1_001.fastq 88519 70075
ITS considerations
As explained in the last _ ITS considerations_ box, we do not use figaro in ITS reads. However, we should use the ‘filerAndTrim’ step to remove ambiguous bases so that we can better recognize the primers afterwards. Remember not to use trimming parameters.
Cutadapt: efficient removal of primers
Now that we have run Figaro and performed quality trimming, we can
follow with primer removing, an obligatory step for
dada2 workflow. This will be done with cutapat tool. You
can also remove your primers with the argument
trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LEN
from
filterAndTrim
function (dada2). However, we consider
cutadapt a more accurate tool for this purpose, as trimLeft
indistinctly cut the sequences at the given primer lengths, whether or
not these primers are found in the sequence. As it was said above,
we are working with Ubuntu, so if you are using
cutadapt on Windows, you should make some changes to the code below.
Another consideration is that cutadapt should not be applied before
figaro because it can modify the quality profiles in which figaro is
based.
First of all, we have to store our primers sequences
in two different R objets (FWD and REV). Then, we will produce all the
possible orientations of these primers (further details can be found in
dada2 tutorial). This is possible thanks to the
function allOrients
, which we create below.
#### IDENTIFY PRIMERS ####
<- "CCTACGGGNBGCASCAG" ## CHANGE ME to your forward primer sequence
FWD <- "GACTACNVGGGTATCTAATCC" ## CHANGE ME to your reverse primer sequence
REV
## VERIFY PRESENCE AND ORENTATION OF PRIMERS ##
<- function(primer) {
allOrients # Create all orientations of the input sequence
require(Biostrings)
<- DNAString(primer) # The Biostrings works with DNAString objects rather than character vectors
dna <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = reverse(dna),
orients RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}<- allOrients(FWD)
FWD.orients <- allOrients(REV)
REV.orients FWD.orients
#> Forward Complement Reverse RevComp
#> "CCTACGGGNBGCASCAG" "GGATGCCCNVCGTSGTC" "GACSACGBNGGGCATCC" "CTGSTGCVNCCCGTAGG"
REV.orients
#> Forward Complement Reverse RevComp
#> "GACTACNVGGGTATCTAATCC" "CTGATGNBCCCATAGATTAGG" "CCTAATCTATGGGVNCATCAG" "GGATTAGATACCCBNGTAGTC"
ITS considerations
For ITS analysis, don’t forget to change primers to the appropiate ones, in our case, we use fITS7 (FWD <- “GTGARTCATCGAATCTTTG”) and ITS4 (REV <- “TCCTCCGCTTATTGATATGC”) for ITS2 region.
Afterwards, we want to count the appearence of the primers in our
sequences with the function primerHits
:
## COUNT THE APPEARENCE AND ORIENTATION OF PRIMERS ##
<- function(primer, fn) {
primerHits # Counts number of reads in which the primer is found
<- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
nhits return(sum(nhits > 0))
}rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = filtFs[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = filtRs[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = filtFs[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = filtRs[[1]]))
#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 59926 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 2 0 0 0
#> REV.ReverseReads 60270 0 0 0
But… what is this table all about? On the left, we have the presence of forward and reverse primers (FWD and REV) in forward and reverse reads (FWD.ForwardReads, FWD.ReverseReads, REV.ForwardReads and REV.ReverseReads). On the top of the table, we see the four possible orientations of the primers (Forward, Complement, Reverse and RevComp). Taking all these information into account, in our data set, we found 59926 sequences with the FWD primer in the forward reads in its forward orientation. As expected, the REV primer was found in the reverse reads in its forward orientation. Cutadapt is used in dada2 ITS tutorial, so check it if you want more information.
ITS considerations
Change filtFs
and filtRs
for fnFs
and fnRs
, respectively, since you didn’t perform the
previous filtering step.
Now, let’s run cutadapt. The process is similar to what we did with
figaro. First, we save the path to cutadapt program in
cutadapt
. Then, we use system2
to call
cutadapt.
## RUN CUTADAPT
<- "/usr/local/bin/cutadapt" #path to cutadapt
cutadapt
system2(cutadapt, args = c("--version")) # Run shell commands from R
Now, it is crucial to create a variable call path.cut
where we will store the sequences after being processed by cutadapt. We
will give these sequences the same names as filtFs
and
filtRs
. For an easier use of the tool, we create some of
its arguments with paste0
. The arguments we use are very
well explained in cutadapt
user guide. Specifically, we implement arguments for linked
adapters. In this sense, -a
and -A
are
preceding the adapters sequences, respectively for R1 (forward) and R2
(reverse) reads. We use argument ^ for anchoring primers, that is,
primers are only removed if the anchored primers is found. Moreover, we
combine it with the –discard-untrimed argument. In this way, we
make sure that a sequence is not retained if only the reverse primer is
found. Forward and reverse primers should be present in order to keep
the sequences. At the end, we use a for loop over system2
to run cutadapt in every file.
##Create path to cutadapt sequences
<- file.path(path, "cutadapt")
path.cut
if(!dir.exists(path.cut)) dir.create(path.cut)
<- file.path(path.cut, basename(filtFs))
fnFs.cut <- file.path(path.cut, basename(filtRs))
fnRs.cut
##Produce arguments for cutadapt. rc function creates the reverse complementary of the provided sequence (FWD).
<- dada2:::rc(FWD)
FWD.RC <- dada2:::rc(REV)
REV.RC
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
<- paste0("-a", " ", "^",FWD,"...", REV.RC)
R1.flags
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
<- paste0("-A"," ","^", REV, "...", FWD.RC)
R2.flags
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,"-m", 1, # -n 2 required to remove FWD and REV from reads
#-m 1 is required to remove empty sequences for plotting quality plots
"--discard-untrimmed",
"-j",0,#automatically detect number of cores
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
# input files
filtFs[i], filtRs[i],"--report=minimal")) #Report minimal reports a summary
}
ITS considerations
Change filtFs
and filtRs
for fnFs
and fnRs
, respectively, since you didn’t perform the
previous filtering step.
Now, we have the filtered sequences without primer in
fnFs.cut
and fnRs.cut
. From now on, these will
be the sequences that we will use for analysis.
Let’s see if cutadapt has properly removed the primers with
primerHits
#Check primer presence after cutadapt
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 1 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 0 0 0 0
#> REV.ReverseReads 0 0 0 0
As we can see, there are still some forward primers to be found on
forward reads. We should not worry about that. Cutadapt
and
primerHits
have different ways to look for primers in the
sequence. Moreover, if primers are still to be found in low quality
sequences, they will probably be removed in next steps.
ITS considerations
If analysing ITS sequences, remember to apply now the filtering step. See the code below:
filtFs <- file.path(path.cut, "filtered", basename(fnFs.cut))
filtRs <- file.path(path.cut, "filtered", basename(fnRs.cut))
out <- filterAndTrim(fnFs.cut, filtFs, fnRs.cut, filtRs, maxN = 0, maxEE = c(2, 5), truncQ = 2,
minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)
head(out)
head(out)
you can check if the maxEE
you
set was too strict. If you are loosing too many sequences, consider to
increase its values.
In next steps, we will apply the main core of DADA2 pipeline. As it is very well detailed in its tutorial, we will not make a lot of comments about it. Good luck! 🕓
Learn error rates
#### LEARN ERROR RATES ####
<- learnErrors(fnFs.cut, multithread=T, verbose=1) # errF
#> 109967023 total bases in 419736 reads from 6 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ......
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> Convergence after 5 rounds.
<- learnErrors(fnRs.cut, multithread=T, verbose=1) # errR
#> 102040114 total bases in 554558 reads from 8 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ........
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> selfConsist step 6
#> Convergence after 6 rounds.
ITS considerations
Change fnFs.cut
and fnRs.cut
for
filtFs
and filtRs
, respectively. Do it also in
the next steps: dadaFs, dadaRs and mergers.
View error plots
As we can see, the plots follow the tendency adequately.
#Plot errors
plotErrors(errF, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
plotErrors(errR, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis
Sample inference
#Sample inference
<- dada(fnFs.cut, err=errF, multithread=TRUE) dadaFs
#> Sample 1 - 61506 reads in 15424 unique sequences.
#> Sample 2 - 60287 reads in 12454 unique sequences.
#> Sample 3 - 67108 reads in 13471 unique sequences.
#> Sample 4 - 68541 reads in 13392 unique sequences.
#> Sample 5 - 92672 reads in 19174 unique sequences.
#> Sample 6 - 69622 reads in 14664 unique sequences.
#> Sample 7 - 66191 reads in 14858 unique sequences.
#> Sample 8 - 68631 reads in 17064 unique sequences.
#> Sample 9 - 64360 reads in 12997 unique sequences.
#> Sample 10 - 73537 reads in 16302 unique sequences.
#> Sample 11 - 64206 reads in 12105 unique sequences.
#> Sample 12 - 74353 reads in 16945 unique sequences.
#> Sample 13 - 62585 reads in 14748 unique sequences.
#> Sample 14 - 72255 reads in 16401 unique sequences.
#> Sample 15 - 69370 reads in 15705 unique sequences.
#> Sample 16 - 56068 reads in 13202 unique sequences.
#> Sample 17 - 73197 reads in 14786 unique sequences.
#> Sample 18 - 58236 reads in 12100 unique sequences.
#> Sample 19 - 56801 reads in 13257 unique sequences.
#> Sample 20 - 56072 reads in 12635 unique sequences.
#> Sample 21 - 57847 reads in 12828 unique sequences.
#> Sample 22 - 65716 reads in 11388 unique sequences.
#> Sample 23 - 69528 reads in 13095 unique sequences.
#> Sample 24 - 52479 reads in 9714 unique sequences.
#> Sample 25 - 89584 reads in 7824 unique sequences.
#> Sample 26 - 93427 reads in 8209 unique sequences.
#> Sample 27 - 106500 reads in 9724 unique sequences.
<- dada(fnRs.cut, err=errR, multithread=TRUE) dadaRs
#> Sample 1 - 61506 reads in 15090 unique sequences.
#> Sample 2 - 60287 reads in 8972 unique sequences.
#> Sample 3 - 67108 reads in 10240 unique sequences.
#> Sample 4 - 68541 reads in 11098 unique sequences.
#> Sample 5 - 92672 reads in 15470 unique sequences.
#> Sample 6 - 69622 reads in 11810 unique sequences.
#> Sample 7 - 66191 reads in 10626 unique sequences.
#> Sample 8 - 68631 reads in 13787 unique sequences.
#> Sample 9 - 64360 reads in 9862 unique sequences.
#> Sample 10 - 73537 reads in 11339 unique sequences.
#> Sample 11 - 64206 reads in 8219 unique sequences.
#> Sample 12 - 74353 reads in 12900 unique sequences.
#> Sample 13 - 62585 reads in 11369 unique sequences.
#> Sample 14 - 72255 reads in 12843 unique sequences.
#> Sample 15 - 69370 reads in 12557 unique sequences.
#> Sample 16 - 56068 reads in 11103 unique sequences.
#> Sample 17 - 73197 reads in 12138 unique sequences.
#> Sample 18 - 58236 reads in 9899 unique sequences.
#> Sample 19 - 56801 reads in 9540 unique sequences.
#> Sample 20 - 56072 reads in 10430 unique sequences.
#> Sample 21 - 57847 reads in 9716 unique sequences.
#> Sample 22 - 65716 reads in 9249 unique sequences.
#> Sample 23 - 69528 reads in 11676 unique sequences.
#> Sample 24 - 52479 reads in 8641 unique sequences.
#> Sample 25 - 89584 reads in 6378 unique sequences.
#> Sample 26 - 93427 reads in 6198 unique sequences.
#> Sample 27 - 106500 reads in 7362 unique sequences.
1]] dadaFs[[
#> dada-class: object describing DADA2 denoising results
#> 758 sequence variants were inferred from 15424 input unique sequences.
#> Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
As we can see, thanks to the DADA2 algorithm, 758 ASVs were inferred,
coming from 15424 unique sequences (remember that this is only for the
frist sample, as we wrote [[1]]
. Next, we will give
dadaFs
and dadaRs
the
sample.names
.
#Set sample names
names(dadaFs) <- sample.names
names(dadaRs) <- sample.names
Merge paired-end reads
#Merge sequences
<- mergePairs(dadaFs, fnFs.cut, dadaRs, fnRs.cut, verbose=TRUE)
mergers #> 58056 paired-reads (in 841 unique pairings) successfully merged out of 59510 (in 1465 pairings) input.
#> 57884 paired-reads (in 527 unique pairings) successfully merged out of 58969 (in 973 pairings) input.
#> 65180 paired-reads (in 492 unique pairings) successfully merged out of 66042 (in 881 pairings) input.
#> 66703 paired-reads (in 486 unique pairings) successfully merged out of 67552 (in 865 pairings) input.
#> 89414 paired-reads (in 932 unique pairings) successfully merged out of 90777 (in 1561 pairings) input.
#> 66897 paired-reads (in 665 unique pairings) successfully merged out of 68102 (in 1223 pairings) input.
#> 63906 paired-reads (in 609 unique pairings) successfully merged out of 64927 (in 1106 pairings) input.
#> 65300 paired-reads (in 860 unique pairings) successfully merged out of 66704 (in 1524 pairings) input.
#> 62656 paired-reads (in 606 unique pairings) successfully merged out of 63300 (in 889 pairings) input.
#> 71291 paired-reads (in 635 unique pairings) successfully merged out of 72263 (in 1082 pairings) input.
#> 62506 paired-reads (in 443 unique pairings) successfully merged out of 63204 (in 757 pairings) input.
#> 71825 paired-reads (in 733 unique pairings) successfully merged out of 72805 (in 1226 pairings) input.
#> 59867 paired-reads (in 626 unique pairings) successfully merged out of 61145 (in 1295 pairings) input.
#> 69397 paired-reads (in 807 unique pairings) successfully merged out of 70680 (in 1361 pairings) input.
#> 66800 paired-reads (in 773 unique pairings) successfully merged out of 67865 (in 1198 pairings) input.
#> 53498 paired-reads (in 691 unique pairings) successfully merged out of 54585 (in 1149 pairings) input.
#> 70984 paired-reads (in 668 unique pairings) successfully merged out of 71999 (in 1094 pairings) input.
#> 56672 paired-reads (in 449 unique pairings) successfully merged out of 57345 (in 744 pairings) input.
#> 54758 paired-reads (in 705 unique pairings) successfully merged out of 55509 (in 1006 pairings) input.
#> 54202 paired-reads (in 648 unique pairings) successfully merged out of 54911 (in 986 pairings) input.
#> 55901 paired-reads (in 585 unique pairings) successfully merged out of 56778 (in 1000 pairings) input.
#> 64397 paired-reads (in 475 unique pairings) successfully merged out of 65119 (in 738 pairings) input.
#> 67867 paired-reads (in 532 unique pairings) successfully merged out of 68681 (in 897 pairings) input.
#> 51487 paired-reads (in 353 unique pairings) successfully merged out of 51953 (in 540 pairings) input.
#> 89364 paired-reads (in 74 unique pairings) successfully merged out of 89388 (in 85 pairings) input.
#> 93212 paired-reads (in 82 unique pairings) successfully merged out of 93267 (in 96 pairings) input.
#> 106318 paired-reads (in 139 unique pairings) successfully merged out of 106393 (in 160 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
#> sequence
#> 1 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCGCGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 2 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGCTTTCGGGTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 3 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 4 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGGGTGACTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 5 TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCTAACGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCCGTCGTGAAAGCCCACGGCTTAACTGTGGGTCTGCGGTGGATACGGGCAGACTAGAGGCAGGTAGGGGAGCATGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTGCTCTGGGCCTGTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 6 TGGGGAATCTTGGACAATGGGCGAAAGCCCGATCCAGCAATATCGCGTGAGTGAAGAAGGGCAATGCCGCTTGTAAAGCTCTTTCATCGAGTGCGCGATCATGACAAGACTCGAGGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGGAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACA
#> abundance forward reverse nmatch nmismatch nindel prefer accept
#> 1 6997 1 1 39 0 0 2 TRUE
#> 2 5696 2 2 37 0 0 2 TRUE
#> 3 3738 3 1 39 0 0 2 TRUE
#> 4 2282 4 2 39 0 0 2 TRUE
#> 5 1950 5 5 39 0 0 2 TRUE
#> 6 1931 6 3 42 0 0 2 TRUE
Construct sequence table
#Construct sequence table
<- makeSequenceTable(mergers)
seqtab dim(seqtab)
#> [1] 27 7080
Our ASV table (seqtab
) contains 7080 ASVs from 27
samples (24 samples + 3 MOCK).
Remove chimeras
#Remove chimeras
<- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE) seqtab.nochim
#> Identified 2163 bimeras out of 7080 input sequences.
dim(seqtab.nochim)
#> [1] 27 4917
After chimera removing, we have reduced the number of ASV from 7080 to 4917. It may seem a lot, but they do not account for a great number of sequences. Don’t worry, we will check that in a few steps later!
Inspect ASV length and abundance
After all this process, we have an ASV table without chimeras
(seqtab.nochim
). Before further analysis, it is interesting
to check the distribution of ASV length as well as the number of ASV for
each length. This will give us some valuable information as a check
point of the process. Moreover, we will remove those lengths that are
not expected from our sequencing process and that represent a small
number of ASV and sequences, as they could probably be a result of
non-specific priming.
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim))) #Number of ASV of each length
#>
#> 244 247 253 255 259 262 264 269 273 276 299 304 311 316 321 327 338 339 356
#> 2 2 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 359 361 364 369 370 373 376 380 382 384 385 386 390 396 397 400 401 402 403
#> 1 1 2 1 1 1 1 1 2 2 6 8 1 5 2 4 44 984 294
#> 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
#> 202 156 20 387 82 44 24 20 3 11 7 5 50 16 18 32 27 216 552
#> 423 424 425 426 427 428 429 430 431 432
#> 40 29 57 265 1026 218 14 3 2 6
This table is really simple, you can see the ASV length in the first line and below the number of ASV. As it was expected, most ASV are found in a length of 427 bp.
However, in this way we don’t know how many sequences these ASVs
account for in each length. Let’s calculate it and make a plot! We will
store the number of sequences in reads.per.seqlen
#Get number of sequences per length
<- tapply(colSums(seqtab.nochim), factor(nchar(getSequences(seqtab.nochim))), sum) #number of sequences for each length
reads.per.seqlen reads.per.seqlen
#> 244 247 253 255 259 262 264 269 273 276 299 304 311
#> 4 104 2 4 2 175 6 3 2 2 2 4 2
#> 316 321 327 338 339 356 359 361 364 369 370 373 376
#> 2 2 4 2 2 2 2 4 50 4 2 6 4
#> 380 382 384 385 386 390 396 397 400 401 402 403 404
#> 4 36451 8 40 1082 2 15 13 661 504 130086 99017 277172
#> 405 406 407 408 409 410 411 412 413 414 415 416 417
#> 3998 995 592438 116044 58285 1601 1328 4790 179 2455 93 2878 133
#> 418 419 420 421 422 423 424 425 426 427 428 429 430
#> 612 5778 1023 12092 29632 268 448 4098 12680 370689 13647 119 8
#> 431 432
#> 4 213
Now, we will produce a table ready for plotting
## Plot length distribution
<- data.frame(length=as.numeric(names(reads.per.seqlen)), count=reads.per.seqlen) #Create a table
table_reads_seqlen
ggplot(data=table_reads_seqlen, aes(x=length, y=count)) + geom_col() #plot it!
As we can see with the table and the plot, we will choose the interval ranging from 402 to 428 pb.
#Filter ASV length
<- seqtab.nochim[,nchar(colnames(seqtab.nochim)) %in% seq(402,428)] seqtab.nochim
ITS considerations
The above analysis should be avoided because of the variable length of ITS2 region.
Check number of reads
At this point, it is important to check how many sequences we have
lost in each step. Apart from filtering, not other step should have led
to a substantial decreased in sequence number. First, and as it is shown
in dada2 tutorial, we create getN
function and then apply
it to produce a track
table.
#Check number of sequences at each step
<- function(x) sum(getUniques(x))
getN <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
track # If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
#Let's apply a mutate to get percentage, making it easier to analyze.
<- track %>%as.data.frame() %>% mutate(Perc.filtered=filtered*100/input,
track Perc.denoisedF= denoisedF*100/filtered,
Perc.denoisedR= denoisedR*100/filtered,
Perc.merged= merged*100/filtered,
Perc.nonchim= nonchim*100/merged,
Perc.retained=nonchim*100/input
)
head(track)
#> input filtered denoisedF denoisedR merged nonchim Perc.filtered Perc.denoisedF
#> L1-1 80054 61741 60003 60662 58056 55455 77.12419 97.18501
#> L1-2 76776 60781 59308 59694 57884 56153 79.16667 97.57655
#> L1-3 84947 67563 66299 66670 65180 63021 79.53548 98.12915
#> L1-4 87190 68997 67828 68092 66703 63831 79.13408 98.30572
#> L1-5 114810 93159 91321 91801 89414 84491 81.14189 98.02703
#> L1-6 88519 70075 68479 68988 66897 64905 79.16380 97.72244
#> Perc.denoisedR Perc.merged Perc.nonchim Perc.retained
#> L1-1 98.25238 94.03152 95.51984 69.27199
#> L1-2 98.21161 95.23371 97.00954 73.13874
#> L1-3 98.67827 96.47292 96.68763 74.18861
#> L1-4 98.68835 96.67522 95.69435 73.20908
#> L1-5 98.54228 95.97999 94.49415 73.59202
#> L1-6 98.44880 95.46486 97.02229 73.32324
Everything is as expected: the only step were we lose a greater
number of sequences is the filtering step. If you want to visualize the
full table and save it, remember to use View(track)
and
write.table(track, file="dada2_track.txt", sep="\t")
. Let’s
keep on with classification!
Taxonomy classification with RDP
As it was mentioned at the beginning of the tutorial, we will use a
modfied RDP database. We usually work with root samples, so the
detection of mithocondria sequences from host cells was crucial for us.
Because of that, we have implemented mithocondrial sequences in the
database. Moreover, RDP database has the drawbak that some taxa have not
all classification levels. This is the case of some
Acidobacteria. We found phylum acidobacteria,
class Acidobacteria_gp1 and genus gp1.
However, no family or order was found for this record, giving an
unclassified result at these levels. Classification is easy to
apply with assignTaxonomy
, you only have to change the path
to your RDP database file.
#Classification with RDP
<- assignTaxonomy(seqtab.nochim, "/home/nuria/Escritorio/data_micro4all/rdp_train_set_18_H.fa", multithread=TRUE) taxa_rdp
ITS considerations
For ITS classification, you can download the UNITE database in DADA2 taxonomic reference data website
Get ASV table
Now that we have a classification (taxa_rdp
) table and
an ASV table (seqtab.nochim
), we will produce a table
combining classification and ASV abundance. First, lets assign
seqtab.nochim
to a new variable called ASV
.
Then, we will substitute ‘NA’ in taxonomy_rdp
with
unclassified.
#Combine taxonomy and ASV abundance
<- seqtab.nochim
ASV <- t(ASV)
ASVt
## SUBSTITUTE 'NA' WITH 'UNCLASSIFIED'and remove species column
<- apply(taxa_rdp,2, tidyr::replace_na, "unclassified")[,-7] taxa_rdp_na
The next lines are used to create ASV names. We do it in a way that if there are 1000 ASV, they will be named from ASV0001 to ASV1000. The code below takes automatically into consideration the number of 0 to add according to the total number of ASV.
##Create names for each ASV: if there are 1000 ASVs, call them from ASV0001 to ASV1000
<- nchar(as.integer(nrow(ASVt)))
number.digit<- paste0("ASV%0", number.digit, "d") #As many 0 as digits
names <- sprintf(names, 1:nrow(ASVt)) ASV_names
At last, we have everything to create our ASV table. We are going to
bind its different elements and format it a little. Then, we save it as
.txt
. This allows us to manipulate the data manually if
needed (you can open it with tools as Excel), as well as saving a raw
data before further steps are performed.
## CREATE AND SAVE ASV TABLE
<- cbind(as.data.frame(taxa_rdp_na,stringsAsFactors = FALSE),as.data.frame(ASV_names, stringsAsFactors = FALSE),as.data.frame(ASVt,stringsAsFactors = FALSE))
ASV_table_classified
##Make rownames a new column, in order to keep sequences during the filtering process
<- rownames(ASV_table_classified)
ASV_seqs rownames(ASV_table_classified) <- NULL
<- cbind(ASV_seqs, ASV_table_classified)
ASV_table_classified
write.table(ASV_table_classified, file="ASV_table_classified.txt", sep="\t")
If you already have an ASV table, you can load it here with
ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE)
and follow the tutorial from now on.
MOCK community
In this step we want to filter out those ASV that are misassigned to
another samples. This can be done with the analysis of MOCK community
samples that were introduced in the same sequencing run. In this sense,
we always introduce in our sequencing three samples of a commercial MOCK
community (ZymoBIOMICS Microbial Community Standard II (Log
Distribution), ZYMORESEARCH, CA, USA). The logic of this analysis is as
follows: because we already know its composition, we can check if there
is any ASV identified by the sequencing process that is not a real
member of the MOCK community. Once identified, we calculate the
percentage of sequences it represents in MOCK samples. We make the
assumption that ASVs that represent less than this percentage in the
sequencing run should be discarded from our data. In order to automate
this analysis, micro4all
implements a function called
MockCommunity
. If you want to know more about this
function, you can always read the help information with
?MockCommunity
. Basically, you have to give it an ASV table
where the samples are columns. MOCK samples should include the word
‘MOCK’ in capital letters. Moreover, the second argument corresponds to
a MOCK composition file which includes the members of the community at
genus level in decreasing order of abundance. You can see the structure
of this file as it is loaded with micro4all package
(use View(micro4all::mock_composition)
). If you want to
create a different MOCK composition object, you can easily do it with
mock_comp <- setNames(as.data.frame(c("genera1", "genera2","genera3")), c("MockGenera"))
.
Lastly, you should change the argument ASV_column
to the
name of the column where your ASV names are stored.
MockCommunity
function is user-interactive, it will ask
you wheter you want to select the first misassigned ASV detected to
calculate the percentage or not. If answer fits ‘no’, then it will look
for the next one. One the answer fits ‘yes’, it will remove the MOCK
community samples, apply the filtering and return a cleaned ASV table
sorted in decreasing ASV abundance.
Let’s see how it works:
#MOCK community cleaning.
<-MockCommunity(ASV_table_classified, mock_composition, ASV_column = "ASV_names") #mock_composition is a data frame included as data in this package. See documentation for details. ASV_filtered_MOCK
#> ASV0811 does not belong to the MOCK community. It representes a 0.031757 perc. of the sequences
#> and it classifies as Limosilactobacillus
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]ASV0002 does not belong to the MOCK community. It representes a 0.003881 perc. of the sequences
#> and it classifies as Streptomyces
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]You made a decision! ASV0002 which classifies as Streptomyces
#> and represents a 0.003881 perc. of the sequences, was used to calculate the percentage
As explained above, MockCommunity
makes a question to
user before making the decision. Why? Sometimes, like in the example
above, an ASV could be classified with a different name, still being the
same genera. In this example, the first ASV which is not detected in the
MOCK community composition table, is Limosilactobacillus.
However, we can see that it represents a high percentage of sequences.
This is probably a bacillus from MOCK samples. Because of that,
we decide to choose the second option.
ITS considerations
If analysing ITS data, MOCK samples with a good number of fungi representatives should be included. If your bacteria and fungi are sequenced in the same run, the calculate percentage for bacteria can be applied to fungi dataset.
When a MOCK community has not been analyzed, it is recommended to filter out those ASV which represent less than a 0.005% of the sequences, as it was studied by Bokulich et al. (2013)3. We can do it easily in R as follow:
###CLEAN according to Bokulich####
# Get number of sequences of each ASV
<- rowSums(ASV_table_classified[,9:ncol(ASV_table_classified)])
ASV_sums # Get total number of sequences
<-sum(ASV_sums)
sum.total# Apply percentage to sequence number
<-(0.005/100)*sum.total
nseq_cutoff# Filter table.
<- ASV_table_classified[which(ASV_sums>nseq_cutoff),]
ASV_filtered_bokulich# Sort table in ascending order of ASV names
<-ASV_filtered_bokulich[order(ASV_filtered_bokulich[["ASV_names"]]),] ASV_table_bokulich
Once the percentage is applied, we can remove sequences coming from chloroplast and mitochondria at various levels
#### CLEAN CHLOROPLAST SEQUENCES ####
<-ASV_filtered_MOCK[(which(ASV_filtered_MOCK$Genus!="Streptophyta"
ASV_final& ASV_filtered_MOCK$Genus!="Chlorophyta"
&ASV_filtered_MOCK$Genus!="Bacillariophyta"
&ASV_filtered_MOCK$Family!="Streptophyta"
& ASV_filtered_MOCK$Family!="Chlorophyta"
&ASV_filtered_MOCK$Family!="Bacillariophyta"
& ASV_filtered_MOCK$Family!="Mitochondria"
& ASV_filtered_MOCK$Class!="Chloroplast"
& ASV_filtered_MOCK$Order!="Chloroplast"
& ASV_filtered_MOCK$Kingdom!="Eukaryota"
& ASV_filtered_MOCK$Kingdom!="unclassified")),]
ITS considerations
For ITS analysis, we should also remove those ASVs that were not
classified as _k__Fungi_ at kingdom level. We could do it as follows:
ASV_final<-ASV_filtered_MOCK[which(ASV_filtered_MOCK$Kingdom!="k__Fungi"),]
After this, the presence of Cyanobacteria/Chloroplast at phylum level
should be checked. If we only find Cyanobacteria/Chloroplast at phylum
level which are not further classified, we remove it as they can be as
well coming from host DNA. To check this, we can use these lines (if
working on RStudio, add a View(...)
to visualize it
easier):
#Check Cyanobacteria
which(ASV_final$Phylum=="Cyanobacteria/Chloroplast"),] ASV_final[
#> [1] ASV_seqs Kingdom Phylum Class Order Family Genus ASV_names L1-1
#> [10] L1-2 L1-3 L1-4 L1-5 L1-6 L1-7 L1-8 L2-1 L2-2
#> [19] L2-3 L2-4 L2-5 L2-6 L2-7 L2-8 L3-1 L3-2 L3-3
#> [28] L3-4 L3-5 L3-6 L3-7 L3-8
#> <0 rows> (or 0-length row.names)
In this particular case, we find one ASV classified as Cyanobacteria/Chloroplast at phylum level but not classified at deeper levels. We can remove them as follows:
#Remove cyanobacteria at phylum level
<-ASV_final[(which(ASV_final$Phylum!="Cyanobacteria/Chloroplast"
ASV_final )),]
We will write this table on our local machine
##Save table
write.table(ASV_final, file="ASV_final.txt", sep="\t")
Furthermore, if working with easily host contaminated samples (as it
could be endosphere samples or seed samples), it could happen that
unclassified sequences at phylum level are chimeric products of
microorganisms and host DNA. We will not take this into account in this
tutorial, but above we detail the lines to check it. It consists on
getting the ASV sequences from unclassified phyla. We saved it to the
local machine in case you want to perform an BLASTn manually. However,
it can also be performed from R thanks to system
, running a
local blast. Let’s install some packages beforehand:
#Install seqinr for fasta writing
install.packages("seqinr", repos="http://R-Forge.R-project.org")
#Install annotate
::install("annotate") BiocManager
First, we will store the unclassified phyla in a variable, then the
sequences and write them as a FASTA file to your local machine.
Furthermore, we will make an BLASTn with blastSequence
(in
this function, change the [[1]]
to analyze another
sequence).
##Analyze unclassified phyla
#Load libraries
library(seqinr); packageVersion("seqinr"); library(annotate); packageVersion("annotate")
#> [1] '4.2.23'
#> [1] '1.74.0'
#Get sequences of unclassified phyla
<-ASV_final[which(ASV_final$Phylum=="unclassified"),]
unclassified_phyla <- as.list(unclassified_phyla$ASV_seqs)
seq_unclassified_phyla#write fasta
write.fasta(seq_unclassified_phyla, names=unclassified_phyla$ASV_names, file.out="unclassified_phyla.fas", open = "w", nbchar =1000 , as.string = FALSE)
With this file you can perform a BLASTn online. Nonetheless, you can
also perform a local blast as follows. Remember that we are working in
Ubuntu, so first you should download BLASTn in your machine, as well as
the corresponding databases (in this case, the nt
database).
The arguments are:
- query. The path to the FASTA file to analyze.
- db. Path to database on your local machine.
- out. Name of the output file
- outfmt. Output format. We use format 6 (tabular without headers). This format does not give you the species name, that’s why we add “stitle”.
- show_gis. It shows the NCBI GIs in deflines.
- max_target_seqs. Number of hits to return for each sequences.
- num_threads. Number of threads to use.
#Local BLASTn
system(("/home/programs/ncbi-blast-2.11.0+/bin/blastn -query unclassified_phyla.fas -db /mnt/datos/databases/nt -out unclassified_phyla_hits.txt -outfmt '6 std stitle' -show_gis -max_target_seqs 20 -parse_deflines -num_threads 10"),intern = TRUE)
In the code above, remember to change "/home/..."
to the
path where BLASTn program is stored in your machine.
Diversity analysis
We have finished the first part of the tutorial. Now, we should have
a final ASV table, already cleaned from misassigned ASV and host ASV. We
are ready to start with diversity analysis! If you already have an ASV
table, you can load it here with
ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE)
and follow the tutorial from now on.
For this purpose, we need a phyloseq object. phyloseq package provides multiple tools for analyzing microbial communities data. The first challenge that phyloseq solves is to wrap different information into a single object, easy to use with multiple packages. First of all, let’s install phyloseq and load the library:
##Install phyloseq
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("phyloseq") BiocManager
##Load library
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'
Specifically for this tutorial, we will need an ASV table with number of reads per sample, their taxonomy classification, metadata table, a phylogenetic tree and DNA sequences. As we can see, we should produce a phylogenetic tree. This will allows us to use UniFrac and WeightedUnifrac distances later on in β diversity analysis. For this purpose, we will use tools as MAFFT and FastTreeMP. Again, we should remind you that we are using these tools under a Ubuntu distribution and they should be appropriately installed beforehand. Moreover, if you have any problem with FastTreeMP, you can always switch to FastTree.
We will start with getting a FASTA file with all ASV sequences. As
mentioned in the last section, we will use seqinr (if
you haven’t loaded it yet, do it now with library(seqinr)
).
With this file, we will perform a multiple sequence
alignment with MAFFT (storing first the path
to the program and the using system2
for calling it into
the shell). We will use default alignment options with argument
--auto
. The output file will be called
alignment
. This file will be used by
FastTreeMP to produce a phylogenetic tree. The steps to
execute this program are the same as with MAFFT.
####GET PHYLOGENETIC TREE####
## GET THE FASTA FILE ##
<- as.list(ASV_final$ASV_seqs)
ASVseqs_fastawrite.fasta(ASVseqs_fasta, names=ASV_final$ASV_names, file.out="ASV_tree.fas", open = "w", nbchar =1000 , as.string = FALSE)
##Get the alignment ##
<- "/usr/bin/mafft" #path to program
mafft
system2(mafft, args="--version")
system2(mafft, args=c("--auto", "ASV_tree.fas>", "alignment"))
## Get the tree ##
<- "/home/programs/FastTreeMP/FastTreeMP"
FastTreeMP
system2(FastTreeMP, args="--version" )
system2(FastTreeMP, args = c("-gamma", "-nt", "-gtr", "-spr",4 ,"-mlacc", 2, "-slownni", "<alignment>", "phy_tree"))# Run FastTreeMP
ITS considerations
ITS region is not appropriate for calculating phylogenetic trees. As a result, Unifrac nor Weighted Unifrac distances can be applied.
The only thing left to produce our phyloseq object is to have the metadata table, which includes further information about your samples. This information is used later on in order to group samples together (for example, replicates from the same location). In this tutorial, we will use the metadata table that is loaded with micro4all. It simply is a data frame containing sample names and, in another column, its location (location1, location2 and location3). We will store it in another variable and make some modifications to remove “-” and replace it with “_“, both in sample columns and in row names.
##Metadata from micro4all
<-micro4all::metadata
metadata.micro $samples <- str_replace_all(metadata.micro$samples,"-", "_")
metadata.microrownames(metadata.micro) <- str_replace_all(rownames(metadata.micro),"-", "_")
If you are using your own metadata table, you can create it outside R
with any text editor and save it as a tabular separated file. Then, you
could load it with
mt <- read.table("yourmtfile.txt", sep="\t", header=TRUE)
and row.names(mt)<- mt$samples
. Remember that it
must include a column with sample names (exactly the same
sample names and in the same order as in the ASV table). In the next
lines, remember to replace metadata.mciro
with your own
mt
object.
Create phyloseq
We will put all the information we have (classification, abundance, metadata, ASV sequences and phylogenetic tree) in one object thanks to phyloseq. First, each variable will be transformed to phyloseq and then we will merge them.
####CREATE PHYLOSEQ OBJECT FROM TABLES
<- ASV_final[,2:8] #Tax
tax <- ASV_final[,9:ncol(ASV_final)] #ASV
OTU colnames(OTU) <- str_replace_all(colnames(OTU),"-", "_")
<-Biostrings::DNAStringSet(ASV_final$ASV_seqs) #Sequences (BioString is imported with phyloseq)
dnanames(dna)<- ASV_final$ASV_names
##ADD ASV NAMES
row.names(tax)<-ASV_final$ASV_names
row.names(OTU)<-ASV_final$ASV_names
#Check rownames are equal
identical(rownames(OTU), rownames(tax))
#> [1] TRUE
##Introduce phylogenetic tree
<- phyloseq::read_tree("phy_tree")
phy_tree <- phy_tree
unrooted_tree::is.rooted(unrooted_tree)
ape#> [1] FALSE
##Produce root (we need a root for distance calculation)
<-ape::root(unrooted_tree, 1, resolve.root = T)
tree_root
tree_root#>
#> Phylogenetic tree with 1035 tips and 1034 internal nodes.
#>
#> Tip labels:
#> ASV0449, ASV0153, ASV0306, ASV0086, ASV0347, ASV0343, ...
#> Node labels:
#> Root, , 0.867, 0.000, 0.789, 0.900, ...
#>
#> Rooted; includes branch lengths.
::is.rooted(tree_root)
ape#> [1] TRUE
##CONVERT TO PHYLOSEQ FORMART
<-otu_table(OTU, taxa_are_rows = T)
phy_OTUtable<-tax_table(as.matrix(tax))
phy_taxonomy<-sample_data(metadata.micro) #Change here metadata.micro with mt (your own metadata object)
phy_metadata
#Put everything into a phyloseq object
<-phyloseq(phy_OTUtable,phy_taxonomy,phy_metadata,dna,tree_root)#Remove Tree_root when working with ITS loc_phyloseq
Now we will load the libraries needed to carry out the diversity analysis. Warning! If some of these packages are missing, install them with install.packages() function.
##Load packages
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'
library(Biostrings); packageVersion("Biostrings")
#> [1] '2.64.1'
library(GUniFrac); packageVersion("GUniFrac")
#> [1] '1.7'
library(phangorn); packageVersion("phangorn")
#> [1] '2.11.1'
library(vegan); packageVersion("vegan")
#> [1] '2.6.4'
library(pheatmap); packageVersion("pheatmap")
#> [1] '1.0.12'
library(colorspace); packageVersion("colorspace")
#> [1] '2.1.0'
\(\alpha\) diversity
In this section, we will study the diversity of our community in terms of richness, diversity and evenness. For this purpose, several indices will be used, like Shannon index, Inverse of Simpson and Pilou’s index.
Data preparation
First, we will produce a table with \(\alpha\) diversity indices. To achieve this,we will rarefy the data to the number of sequences of the smallest sample. This is done in order to avoid a bias in estimation of diversity only because of variability in sequencing depth among samples.
Rarefaction curves should be produced first. This makes it easy to visualize if rarefaction process encountered a great loss of information or not. This plot represents number of ASV (richness) against number of sequences. Moreover, a vertical line will show the number of sequences where normalization occurred. If this line is found in the asymptote of every rarefaction curve, that means we are safe: although we would have more sequences for every sample, we would not have a significantly greater number of ASV.
#### RARECURVE ####
<- as.data.frame(t(otu_table(loc_phyloseq)))
ASV <- rownames(ASV)
sample_names
#Generate rarefaction curves
<- rarecurve(ASV, step = 100, label = F,) rarecurve
Function rarecurve
produces a simple plot. However, we
can produce a nicer plot with ggplot2
. This package was
already loaded in the DADA2 section, if not, load it with
library(ggplot2)
.
Our aim is to produce lines colored by location (or any other
condition in your metadata table). But to achieve this we need to
prepare our data. First of all, we will separate each rarefaction curve
from rarecurve
into a data frame and then add a column with
sample names. For this step, we will use a function from
purrr package, but it doesn’t need installation as it
is included in the tidyverse collection, which was
installed in the first section of the tutorial.
#For each rarefaction curve, transform rarecurve output to a dataframe.
<- lapply(rarecurve, function(x){
rarecurve_table <- as.data.frame(x)
b <- data.frame(ASV = b[,1], raw.read = rownames(b))
b $raw.read <- as.numeric(gsub("N", "", b$raw.read))
breturn(b)
})
#Produce a column with sample names and put everything in one data frame
names(rarecurve_table) <- sample_names
<- purrr::map_dfr(rarecurve_table, function(x){ #for each rarecurve element, we produce a dataframe and bind them all together.
rarecurve_table <- data.frame(x)
z return(z)
.id = "Sample") },
Now, we will add to rarecurve_table
a new column. In
this column, we will have the grouping variable of interest (for
example, location1) repeated as many times as necessary. Remember that,
in rarecurve_table
, sample names are repeated, as it have
the information of every rarefaction curve. If you want to adapt this to
your data, you should change metadata.micro
with your
metadata variable name (for example, mt
) and the variable
of interest (´$location´ to the user variable of interest).
#To color lines according to group, let's create a new column
#Coloring
<- NULL
color for (i in (1:length(rarecurve_table$Sample))) {
<- rbind(color,metadata.micro$location[which(metadata.micro$samples==rarecurve_table$Sample[i])])##Change metadata.micro to user mt and "Location" to variable of interest
color
}
#Bind this column
<- cbind(rarecurve_table, color)
rarecurve_table_color colnames(rarecurve_table_color) <- c(colnames(rarecurve_table), "Location")#Change Location to variable of interest
With all data prepared, it is moment to use ggplot for graphical
representation. When using your own data, remember to change
colour=Location
to your variable of interest. You will find
where to change the code next to each line, followed by a
#
.
## RARECURVE WITH GGPLOT ##
<-ggplot(rarecurve_table_color, aes(x=raw.read, y=ASV, colour=Location, group=Sample)) + ### Change location for your variable of interest and IMPORTANT TO GROUP BY SAMPLE
rareggplottheme_bw()+
geom_point(aes(colour=Location), size=0.01)+ #Change location to your variable of interest
geom_line(aes(colour=Location),size=0.5)+ ##Change location to your variable of interest, Change size for line thickness
geom_vline(aes(xintercept = min(sample_sums(loc_phyloseq))), lty=1, colour="black")+
scale_fill_manual(values = c("location1"= "#33CC00", "location2"= "#ffd624", "location3"="#80CC88"))+ #Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
scale_color_manual(values = c("location1"= "#33CC00","location2"= "#ffd624", "location3"="blue"),#Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
name="Location", #change Location to variable of interest
breaks=c("location1", "location2","location3"),#change to values of your variable of interest
labels=c("Location 1","Location 2","Location 3"))+#change to values of your variable of interest
labs(title="Rarefaction curves", x="Number of sequences", y="Number of ASV")+
guides(alpha="none")+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold",size=14),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold", size=16),
legend.text = element_text(size = 12))
rareggplot
As we can see, the rarefaction process took place in the asymptote of all samples.
For saving plots in your machine, an easy and versatile option is to
use ggsave
form ggplot
. For publication, you
can save it in .TIFF, but there are also others interesting formats. For
example, .svg allows you to open the file in
PowerPoint and easily modify it. If you have more
knowledge of graphical design programs as Adobe
Illustrator, you can save it in .eps
#tiff
ggsave(filename = "rarefaction_curve.tiff", plot = rareggplot,device = tiff(),width = 18, height = 16, units = "cm", dpi = 800)
#svg
library(svglite)
ggsave(filename = "rarefaction_curve.svg", plot = rareggplot,device = svg())
#eps
postscript("curve_wito_RE.1.10.eps")
rareggplotdev.off() # Close the device
Let’s continue! Now, we will compute α diversity indices. As everything worked well with rarefaction, we can now create a rarefied object from our phyloseq (loc_phyloseq).
#Rarefy phyloseq
set.seed(10403)
<- rarefy_even_depth(loc_phyloseq, sample.size = min(sample_sums(loc_phyloseq)), rngseed = FALSE) rarefaction
With this rarefaction object, we will compute diversity indices
making use of estimate_richness
function. Pilou’s index is
calculated manually from Shannon and Observed richness.
#Estimate alpha indices and save it
<- estimate_richness(rarefaction,measures=c("InvSimpson", "Shannon", "Observed"))
alpha_diversity_rarefied
## CALCULATE EVENNESS
<- as.data.frame(alpha_diversity_rarefied$Shannon/log(alpha_diversity_rarefied$Observed))
Evenness_index <- cbind(Evenness_index, rownames(alpha_diversity_rarefied))
Evenness colnames(Evenness) <- c("Evenness", "Samples")
In this step, it is worth mentioning that we have not used diversity
estimates of richness. These estimates rely on singletons. However,
DADA2 doesn’t produce singletons as ASV, they can only appear because of
the merging process (thus, being not a real singleton for a richness
estimate to rely on). For more details, check dada2 issue #92.
In the code below, remember to change metadata.micro
for
your metadata variable if analysing your own data.
#Generate final table
<- cbind(alpha_diversity_rarefied, Evenness$Evenness, metadata.micro) #change metadata to user provided metadata
alpha_table colnames(alpha_table)<- c("Observed", "Shannon", "InvSimpson", "Evenness",colnames(metadata.micro))#change metadata to user provided metadata
#Write final table on local machine
write.table(alpha_table, file="alpha_table.txt", sep="\t")
We got the final table with the diversity indices we want to study, but it is also interesting to produce a table ready for publication, i.e., with mean and standard deviation grouped by variable of interest (in this case, location). Remember to change location and write your variable of interest.
#### ALPHA PUB TABLE
#Select numeric columns and compute mean and SD
<- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), mean)%>% mutate_if(is.numeric, round, digits=2) #Change location to variable of interest
alpha_mean <- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), sd)%>% mutate_if(is.numeric, round, digits=2)#Change location to variable of interest
alpha_sd
#Paste mean ± SD
<- NULL
mean_sd for (i in 2:5){
<- cbind(mean_sd,paste0(alpha_mean[,i], " +/- ", alpha_sd[,i]))}
mean_sd
#generate table and give it columns names
<- cbind(alpha_mean$grouping, mean_sd)
alpha_pub_table colnames(alpha_pub_table) <- c("Location","Observed richness", "Shannon", "InvSimpson", "Evenness")#Change location to variable of interest
#Write table on your local machine
write.table(alpha_pub_table, file="alpha_pub_table.txt", sep="\t")
Statistical analysis
Once diversity indices are calculated and stored in a data frame
(alpha_table
), we can make statistical comparisons between
experimental conditions. In this tutorial, we are interested in
differences among locations. Here we will introduce several functions
from micro4all package. These functions are
levene.test.alpha
, Shapiro
,
BalancedAnova
, UnbalancedAnova
,
kruskal.wallis
, Tukey.test
, dunnT
and wilcoxon.test
. They all rely on another statistical
functions, like levene.test
, shapiro.test
,
aov
, kruskal.test
, TukeyHSD
,
HSD
, dunn.test
and
pairwise.wilcoxon.test
) from packages as
car, stats, agricolae
and dunn.test. The utility of these functions from
micro4all is that they loop over the different
calculated indices. In this way, we can perform the desired test in one
line for all indices and get them in a unique table, making analyzing
easier and faster. It is strongly recommend to check the help
page of each function with ?functionname
. All
statistical analysis will be explained and exemplified with
micro4all package, but a section will be as well
included at the end of this tutorial to show statistical analysis
without this package.
The basic arguments of all these functions are the same. First, we
have to provide an α diversity table with indices and metadata
variables (as the one produced here). If you have not followed the
tutorial with the example dataset, you can always check a similar table
with View(micro4all::alpha_diversity_table)
, as well as
checking the examples with ?functionOfInterest
. Then, we
have to give the number of indices to be tested and the name of the
variable of interest. Specific details of each function will be provided
in due course.
First, we will check the homogeneity of variances and normality of
data, in order to choose a proper statistical method later on. We will
use levene.test.alpha
and Shapiro
functions
from micro4all
. The result of these functions is the same
as its original functions (levene.test
and
shapiro.test
) but for every index.
#### Apply levene and shapiro test ####
<- levene.test.alpha(alpha_table, 4, "location") #change location to your variable of interest levene
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
levene
#> Df F value Pr(>F) IndexColumn
#> group 2 0.4798205 0.6255148 Observed
#> 21 NA NA Observed
#> group1 2 0.2423080 0.7869784 Shannon
#> 1 21 NA NA Shannon
#> group2 2 0.5007044 0.6131581 InvSimpson
#> 2 21 NA NA InvSimpson
#> group3 2 0.3925887 0.6801599 Evenness
#> 3 21 NA NA Evenness
#Write table on local machine
write.table(levene, file="levene_test.txt", sep="\t")
<- Shapiro(alpha_table, 4, "location") #change location to your variable of interest
shapiro shapiro
#> p.value Index
#> shapiro 0.5529423 Observed
#> shapiro.1 0.8088936 Shannon
#> shapiro.2 0.4396685 InvSimpson
#> shapiro.3 0.9296042 Evenness
#Write table on local machine
write.table(shapiro, file="shapiro_test.txt", sep="\t")
None of them returns a significant value. That means, according to
these statistics tests, every diversity index from our data follows a
normal distribution with no significant differences in variance
(homoscedasticity). With this information we can perform an ANOVA test.
With BalancedAnova
function from micro4all
, it
is possible to run an ANOVA test for every index at once.
BalancedAnova
returns a list with two elements, first
element is a data frame with summary results for all indices.
#### Balanced anova ####
<- BalancedAnova(alpha_table, numberOfIndexes = 4, formula = "location") #change location to your variable of interest
balanced_anova
1]] balanced_anova[[
#> Df Sum Sq Mean Sq F value Pr(>F) IndexColum
#> data[, formula] 2 6.190750e+03 3.095375e+03 1.125528 0.34328482 Observed
#> Residuals 21 5.775325e+04 2.750155e+03 NA NA Observed
#> data[, formula]1 2 2.739446e-01 1.369723e-01 1.242577 0.30900953 Shannon
#> Residuals 1 21 2.314882e+00 1.102325e-01 NA NA Shannon
#> data[, formula]2 2 2.837867e+02 1.418933e+02 3.693011 0.04223816 InvSimpson
#> Residuals 2 21 8.068648e+02 3.842213e+01 NA NA InvSimpson
#> data[, formula]3 2 3.978665e-03 1.989332e-03 1.155852 0.33402229 Evenness
#> Residuals 3 21 3.614302e-02 1.721096e-03 NA NA Evenness
#Write table on local machine
write.table(balanced_anova[[1]], file="balancedAnova_test.txt", sep="\t")
According to this ANOVA test, the only differences are found for the Inverse of Simpson index, but the result is marginally significant.
Remember that, in case you have an unbalanced experiment, you can
always use the function UnbalancedAnova
in a similar way.
Moreover, if you prefer to use a non-parametrical test, the function
kruskal.wallis
is also available. These could be use as
follows: kruskal.wallis(alpha_table, 4, "location")
and
UnbalancedAnova(alpha_table, 4, "location")
Now that we have an ANOVA result, imagine we want to test differences
between every two groups, that is, location1 vs location2, location2 vs
location3 and location1 vs location3. Let’s do that, again for every
index, with Tukey.test
from micro4all
.
#### TUKEY ####
<- Tukey.test(alpha_table, 4, "location", balanced=TRUE,)#change location to your variable of interest
tukey
tukey
#> diff lwr upr p adj IndexColumn
#> location2.location1 23.12500000 -42.96676903 89.21676903 0.65731496 Observed
#> location3.location1 -16.00000000 -82.09176903 50.09176903 0.81616965 Observed
#> location3.location2 -39.12500000 -105.21676903 26.96676903 0.31475156 Observed
#> location2.location1.1 0.23860162 -0.17982909 0.65703233 0.34061351 Shannon
#> location3.location1.1 0.02620648 -0.39222423 0.44463718 0.98636386 Shannon
#> location3.location2.1 -0.21239514 -0.63082585 0.20603556 0.42190599 Shannon
#> location2.location1.2 7.74026543 -0.07168602 15.55221688 0.05242611 InvSimpson
#> location3.location1.2 0.99328997 -6.81866147 8.80524142 0.94510063 InvSimpson
#> location3.location2.2 -6.74697545 -14.55892690 1.06497600 0.09867956 InvSimpson
#> location2.location1.3 0.03124359 -0.02104072 0.08352790 0.30831191 Evenness
#> location3.location1.3 0.01189647 -0.04038784 0.06418078 0.83555221 Evenness
#> location3.location2.3 -0.01934712 -0.07163144 0.03293719 0.62618013 Evenness
#Write table on local machine
write.table(tukey, file="tukey_test.txt", sep="\t")
As we can see, there is not any statistically significant comparison
among groups. Another pairwise test option is the Dunn’s test, which is
less powerful but make no assumptions on data. This will be done in a
similar way (dunn <- dunnT(alpha_table, 4, "location")
).
In this function, the result is composed by two elements,
dunn[1]
represents all results from dunn.test
function and dunn[2] summarizes results in a data- frame.
When using KruskalWallis
function, the pairwise test to
be implemented should not be Tukey’s test, because it assumes normal
distribution of data. Instead, it is recommended to use Dunn’s test or
Mann–Whitney–Wilcoxon. micro4all
also implements a
wilcoxon.test
function, which would be used like this:
wilcoxon <- wilcoxon.test(alpha_table, 4, "location", p.adjust.method = "BH")
Plot \(\alpha\) diversity indices
We will produce a ggplot graphic with α-diversity indices. It
is necessary to have a formatted table for ggplot, which we can achieve
using pivot_longer
function from
tidyr.
#### PRODUCE A TABLE FOR ALPHA PLOT ####
<- tidyr::pivot_longer(data = alpha_table, names_to = "Measure", values_to = "Value", cols=c(Observed, Shannon, InvSimpson, Evenness)) alpha_plot_table
In the next lines, remember to change location
to your
variable of interest and also its values in limits
,
labels
and colors. For example, in
scale_fill_manual(values=c("location1"=...))
, your grouping
variable could be health status, thus, colors should be assigned to
healthy and infected. Furthermore, the order of the panels can be
changed in the second line, specifically, changing the order of levels
in facet_wrap
### ALPHA GRAPHIC GGPLOT ###
<- ggplot(data = alpha_plot_table, aes(x = location, y = Value)) + #Change location to variable of interest
alpha_graphic facet_wrap(~factor(Measure, levels=c("Observed", "InvSimpson","Shannon", "Evenness")), scale = "free") + #change levels order to reorder panels
geom_boxplot()+
theme(axis.text.x = element_text(size=13),
legend.position="bottom", strip.text = element_text(size = 20), axis.text.y = element_text(size=15), axis.title.y=element_text(size=20)) +
scale_x_discrete(limits=c("location1",
"location2","location3"), breaks=c("location1",
"location2","location3"), ##With breaks and labels you can change the name displayed on labels, #Change locations to values of variable of interest
labels=c("Location 1",#Change locations to values of variable of interest
"Location 2","Location 3")) +
aes(fill=location)+ scale_fill_manual(values = c( "location1"= "#33CC00",#Change locations to values of variable of interest
"location2"= "#ffd624", "location3"="blue"), na.translate=FALSE) +
theme(legend.key.size = unit(1, 'cm')) +
ylab("Alpha Diversity Measure")
alpha_graphic
#use ggsave for saving ggplots
ggsave("alpha_plot.tiff", plot = alpha_graphic,width = 17, height = 30, units = "cm", dpi = 800)
Both statistical analysis and graphic show no significant differences between groups. Yet there could be distinct groups when studying β diversity, i.e. integrating abundance and taxonomic information.
\(\beta\) diversity analysis
Data preparation
A \(\beta\) diversity analysis
allows us to study the structure of communities along a environmental
gradient, e.g., quantify differences between locations. Before computing
these differences or distances, it is crucial to normalize data. This is
mainly needed in order to account for differences in library sizes
between samples (sequencing depth), as well as to consider its
compositionality, an inherent characteristic of high-throughput
sequencing (HTS) data1. In this tutorial, normalization will
be done with edgeR package. When processing data with
this package, variations in sequencing depth are automatically adjusted.
Moreover, it implements the method of trimmed
mean of M-values
(TMM) through the function
calcNormFactors
.This method will be used because our data
always sum up to 1 (compositionality). Because of that, an ASV could be
under-represented in a sample only because it was under-sampled due to
other abundant ASV consuming the library size, a situation that could
lead to non-trustful statistical differences.
We will install the package first:
#Install edgeR
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("edgeR") BiocManager
And load the library:
#Load library
library(edgeR); packageVersion("edgeR")
#> [1] '3.38.4'
WithDGEList
we create a DGEList object needed for
edgeR processing. With this object,
calcNormFactors
computes normalization factors to account
for effective library size (compositionality). Then, we will extract the
normalized counts with cpm
.
#### NORMALIZE ACCORDING TO EDGER ####
<- DGEList(counts = OTU, samples = metadata.micro, genes = tax)#Change metadata.micro for your own mt object
edgeR <- calcNormFactors(edgeR)
edgeR
##EXTRACT NORMALIZED COUNTS
<- cpm(edgeR, normalized.lib.sizes=T, log=F) ASV_norm
One normalized counts are produced, we need to construct a normalized
phyloseq object for further analysis. Remember again to change
metadata.micro
to your own mt
file and to
remove tree_root
when working with ITS data.
##CREATE NORMALIZED PHYLOSEQ OBJECT
<-otu_table(as.data.frame(ASV_norm,row.names=F), taxa_are_rows = T)
phy_OTU_norm<-tax_table(as.matrix(tax))
phy_taxonomy_norm<-sample_data(metadata.micro) #Change metadata.micro to your own mt file
phy_metadata_norm
##Add taxa names
taxa_names(phy_OTU_norm)<- taxa_names(phy_taxonomy_norm)
#Check
identical(rownames(ASV_norm), rownames(tax))
#> [1] TRUE
##Merge
<-phyloseq(phy_OTU_norm,phy_taxonomy_norm,phy_metadata_norm,tree_root) #Remove tree_root when working with ITS norm_phyloseq
Everything is now ready to analyze \(\beta\) diversity. A distance method will be needed to measure the the dissimilarities between samples. There are several ways to compute these dissimilarities. We will mainly use Bray-Curtis, Unifrac and Weighted-Unifrac distances. Bray-Curtis only takes into account the presence or not of shared members between samples, while Unifrac distances is based on shared phylogenetic branches. Weighted-Unifrac is similar to Unifrac, but it also includes information about relative abundances. A very detailed and readable information about this can be found in O Paliy et al., 20162
With these distance measures, a graphical representation can be
obtained to visualize if our samples are differently grouped according
to our variable of interest. In our case, we would like to know if
location
shapes bacterial communities differently. However,
a statistical test is needed to confirm results. For this purpose, we
will use a permutational multivariate
analysis of variance
(PERMANOVA). adonis
function from vegan is
implemented in Permanova
function of
micro4all with the aim of computing a PERMANOVA test
with different distances at one. This function works with a normalized
phyloseq object, a character vector with distances and a formula, where
the variable of interest should be included. Once performed,
permanova[[1]]
returns a data frame with summary results
from adonis function. For a more detailed result, check
permanova[[2]]
.
#### PERMANOVA ####
<- Permanova(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #Change location to variable of interest permanova
The formula
argument can be used to analyze more than
one factor at once. This can be done as follows:
formula="variable1+variable2"
(in case of independent
factors) or formula="variable1*variable2"
(in case of
dependent factors).
Next, let’s see the result from Permanova
:
##Permanova result
1]] permanova[[
#> Df SumOfSqs R2 F Pr(>F) distances
#> location 2 1.4040287 0.3004895 4.510496 0.001 bray
#> Residual 21 3.2684435 0.6995105 NA NA bray
#> Total 23 4.6724722 1.0000000 NA NA bray
#> location1 2 0.5468104 0.2156555 2.886975 0.005 unifrac
#> Residual1 21 1.9887628 0.7843445 NA NA unifrac
#> Total1 23 2.5355732 1.0000000 NA NA unifrac
#> location2 2 0.1343614 0.1647734 2.071439 0.025 wunifrac
#> Residual2 21 0.6810699 0.8352266 NA NA wunifrac
#> Total2 23 0.8154313 1.0000000 NA NA wunifrac
#Write table to local machine
write.table(permanova[[1]], file="permanova.txt",sep="\t")
With the table above, we can see that differences are statistically
significant with every distance method used. Moreover, Bray-Curtis
distance seems to explain the most variance (expressed as
R2). Nevertheless, another test is necessary
to confirm that this significant results are not due to differences
among group variances (dispersion of samples). Again, we can perform
multiple betadisper
test (from vegan
package) with several distances measures thanks to
micro4all
function Betadispersion
. It is very
similar to they way Permanova
works, so do not forget to
change location
to your variable of interest.
#### BETADISPERSION ####
<- Betadispersion(loc_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #change location to variable of interest
betadisper
1] #data frame with results for every distance method betadisper[
#> [[1]]
#> Df Sum Sq Mean Sq F N.Perm Pr(>F) distances
#> Groups 2 0.009769604 0.004884802 0.7280460 999 0.482 bray
#> Residuals 21 0.140898839 0.006709469 NA NA NA bray
#> Groups1 2 0.009190666 0.004595333 0.8473928 999 0.460 unifrac
#> Residuals1 21 0.113881059 0.005422908 NA NA NA unifrac
#> Groups2 2 0.002940247 0.001470124 0.5231941 999 0.615 wunifrac
#> Residuals2 21 0.059007916 0.002809901 NA NA NA wunifrac
#write file to your local machine
write.table(betadisper[1], file="betadisper.txt", sep="\t")
In this example, dispersion was not statistically significantly
different between the three groups, so we can be sure of the PERMANOVA
results. Nonetheless, it would be interesting to know the pairwise
differences between groups. We can make a pairwiseAdonis
test to check for differences. Once more, micro4all
gives us
the option to loop over distances with PairwiseAdonisFun
.
pval
argument refers to p-value threshold to filter
results.
#### PAIRWISE ADONIS ####
<- PairwiseAdonisFun(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location", pval=0.05) #Change location to your variable of interest
pairwise
#data frame with all significant results pairwise
#> pairs Df SumsOfSqs F.Model R2 p.value p.adjusted sig distances
#> 1 location1 vs location2 1 0.3391022 2.076899 0.1291853 0.005 0.0050 * bray
#> 2 location1 vs location3 1 0.9248550 6.081317 0.3028346 0.001 0.0015 * bray
#> 3 location2 vs location3 1 0.8420859 5.555908 0.2841038 0.001 0.0015 * bray
#> 4 location1 vs location2 1 0.2696235 2.823642 0.1678377 0.029 0.0370 . unifrac
#> 5 location1 vs location3 1 0.2773369 2.785471 0.1659454 0.011 0.0330 . unifrac
#> 6 location2 vs location3 1 0.2732552 3.068367 0.1797692 0.037 0.0370 . unifrac
#write file to your local machine
write.table(pairwise, file="pairwise.txt", sep="\t")
It seems that, according to statistical analysis used and with Bray-Curtis distances, statistically significant differences are found among all locations.
For visualization, we can use PCoA and
NMDS representations with every measure. While PCoA
produces artificial variables to explain the variance among samples,
NMDS tries to represent all the variance in two axes. In the code below,
we will see how to use a loop to produces every possible plot, combining
PCoA, NMDS and the three distances.
Every plot will be assigned to a variable called plot_list
.
Then, calling plot_list[1]
will give you the first plot. If
you want to use more types of distances, modify var_list
.
Pay attention to the comments (#
) to see where changes
should be applied when using your own data. To save the plots, you can
choose between several options which were detailed at the end of
Data preparation section in \(\alpha\) diversity.
#### BETA DIVERSITY PLOTS ####
<- c("bray", "unifrac", "wunifrac")
var_list <- c("PCoA", "NMDS")
plot_type <- purrr::cross2(plot_type,var_list )
combination_plot
# Make plots.
= list()
plot_list
for (i in 1:length(combination_plot)){
<- ordinate(norm_phyloseq,combination_plot[[i]][[1]],combination_plot[[i]][[2]]) #get distance matrix
pcoa <- plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
plot geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location variable of interest (e.g., health status)
geom_point(size=4)
if (combination_plot[[i]][[1]]=="NMDS"){ ## graphic details for NMDS
<- plot + xlab(paste("Axis 1"))+
plot ylab(paste("Axis 2"))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances", round(pcoa$stress, digits = 3)))
= plot
plot_list[[i]]
}else {## graphic details for PCoA
= plot + xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
plotylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances"))
= plot
plot_list[[i]]
}
}
#Visualize plot
1]] plot_list[[
#Save plot on local machine
ggsave("bray-curtis_pcoa.tiff", plot = plot_list[[1]],width = 17, height = 30, units = "cm", dpi = 800 )
As it was shown by PERMANOVA, there are differences between all groups (locations) and location3 samples seem to differ more from other groups (i.e., the distance is greater).
If you want to perform only one plot, you can change
var_list
and plot_type
or use the code below,
which shows you how to do a NMDS and PCoA separately. Pay attention to
the comments (#
) in order to adapt the code to your own
data:
####PCOA####
#Calculate distances
<- ordinate(norm_phyloseq,"PCoA","bray") #get distance matrix, change bray to choosen distance
pcoa
#Produce plot
<- plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest
pcoa_bray geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
geom_point(size=4) +
xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle("Bacterial Community PCoA on Bray Curtis distances") #change title according to distance used
#Visualize plot
pcoa_bray
#Save it to local machine
ggsave("bray-curtis_pcoa.tiff", plot = pcoa_bray,width = 17, height = 30, units = "cm", dpi = 800 )
Let’s see how it works for a NMDS!
#### NMDS####
#Calculate distances
<- ordinate(norm_phyloseq,"NMDS","bray") #get distance matrix, change bray to choosen distance
nmds #> Square root transformation
#> Wisconsin double standardization
#> Run 0 stress 0.1895576
#> Run 1 stress 0.2136299
#> Run 2 stress 0.1907108
#> Run 3 stress 0.203114
#> Run 4 stress 0.1907108
#> Run 5 stress 0.1925871
#> Run 6 stress 0.2273544
#> Run 7 stress 0.1889755
#> ... New best solution
#> ... Procrustes: rmse 0.05124936 max resid 0.2188175
#> Run 8 stress 0.2846452
#> Run 9 stress 0.1904783
#> Run 10 stress 0.217401
#> Run 11 stress 0.1889754
#> ... New best solution
#> ... Procrustes: rmse 6.24282e-05 max resid 0.0001315906
#> ... Similar to previous best
#> Run 12 stress 0.1907108
#> Run 13 stress 0.1925871
#> Run 14 stress 0.203114
#> Run 15 stress 0.1925872
#> Run 16 stress 0.1889755
#> ... Procrustes: rmse 0.0001726498 max resid 0.0005084086
#> ... Similar to previous best
#> Run 17 stress 0.2277738
#> Run 18 stress 0.1889755
#> ... Procrustes: rmse 0.0002819056 max resid 0.0007794508
#> ... Similar to previous best
#> Run 19 stress 0.1895576
#> Run 20 stress 0.2647637
#> *** Best solution repeated 3 times
#Produce plot
<- plot_ordination(norm_phyloseq, nmds, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
nmds_bray geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
geom_point(size=4)+
xlab(paste("Axis 1"))+
ylab(paste("Axis 2"))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="blue"))+ #change locations to values of variable of interest (eg. healthy, infected)
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community NMDS on Bray-Curtis distances", round(nmds$stress, digits = 3))) #change title according to distance used
#Visualize plot
nmds_bray
#Save it to local machine
ggsave("bray-curtis_nmds.tiff", plot = nmds_bray,width = 17, height = 30, units = "cm", dpi = 800 )
As it was mentioned before, PCoA produces aritifical variables that
are represented as axis. Each of them explains a certain amount of
variance. Going deeper in the analysis, we can get a bar plot with the
variance explained by every axis. Moreover, we will use
length
to see how many axes were needed for PCoA
ordination.
#### ORDINATE PHYLOSEQ OBJECT ####
<- ordinate(norm_phyloseq, "PCoA", "bray", formula="location") #Change to your variable of interest
PCOA_bray #Get number of axes
length(PCOA_bray$values$Relative_eig)
#> [1] 23
#Get plot
plot_scree(PCOA_bray)
And with the next line, we can get the percentage of variance explained by the number of axes we choose. The number of axes needed to explain a certain amount of variance can give us an idea of the multidimensionality of our data
##Get explained variance in first 10 axis
sum(PCOA_bray$values$Relative_eig[1:10]) #change 1:10 to select the number of axes
#> [1] 0.833356
Produce tables with relative abundance
With α-diversity and β-diversity analysis, we got a global
perspective of the structure a diversity of microbial communities.
Nevertheless, it is interesting as well to get information about
taxonomic profile, i.e., which phyla and genera are more abundant in our
samples. In this line, several analysis can be performed. We will start
with producing and saving tables with relative abundances at different
levels using our phyloseq object without normalization
(loc_phyloseq
). For example, we can produce a table with
relative abundance of ASV for each sample. Moreover, we can group this
by location, thus having ASV relative abundance for each location. This
can be done as well at genus and phylum level. We will start at ASV and
sample level. First, with the function
transform_sample_counts
, our phyloseq object
(loc_phyloseq) will be converted to relative
abundances. Then, each element of this phyloseq object will be extracted
and bind into a table.
##CALCULATE RELAIVE ABUNDANCE, ASV, SAMPLES
<- transform_sample_counts(loc_phyloseq, function(x){x/sum(x)}*100)
sample_relabun_ASV
##CONSTRUCT THE TABLE
<- as.data.frame((otu_table(sample_relabun_ASV)))
OTU_sample <- as.data.frame(tax_table(sample_relabun_ASV))
taxonomy_sample identical(rownames(OTU_sample),rownames(taxonomy_sample))
#> [1] TRUE
<- cbind(taxonomy_sample, OTU_sample)
sample_relabun_ASV_table #SORT IT
<- sample_relabun_ASV_table[sort(sample_relabun_ASV_table$ASV_names, decreasing = FALSE),] sample_relabun_ASV_table
With colSums
we should check if summing ASV abundances
for every sample (i.e., each column) gives us a result of 100
#Check relative abundance sums 100
colSums(sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)])
#> L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_4 L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100 100
We can have a look at the beginning of the table and save it on our local machine as follows:
##See table
head(sample_relabun_ASV_table)
#> Kingdom Phylum Class Order Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0003 Bacteria Proteobacteria unclassified unclassified unclassified
#> ASV0006 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0008 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> Genus ASV_names L1_1 L1_2 L1_3 L1_4 L1_5
#> ASV0002 Streptomyces ASV0002 13.8889992 4.937848 10.893415 22.1990687 9.5526244
#> ASV0003 unclassified ASV0003 3.8330224 27.683538 26.364099 15.4691786 10.6119119
#> ASV0006 unclassified ASV0006 2.5943864 6.496807 7.246131 10.9847116 19.6124558
#> ASV0007 Pseudonocardia ASV0007 1.3855254 3.001168 2.136059 0.9440983 0.4732119
#> ASV0008 unclassified ASV0008 11.3065227 2.218254 2.586520 1.1120798 7.1675279
#> ASV0009 Promicromonospora ASV0009 0.1905594 1.504018 2.310431 1.8605541 0.1060647
#> L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5
#> ASV0002 6.9848271 20.4381884 5.620269 7.6930658 7.2420046 11.2892520 3.0031259 14.4508131
#> ASV0003 24.8232220 10.3801344 5.665003 10.5471676 3.6577360 22.6933878 7.1731339 1.5160285
#> ASV0006 6.2825092 5.3759593 1.043176 6.6385461 4.5674805 7.4010722 7.1764071 1.9043707
#> ASV0007 1.8744457 2.3142073 0.925081 2.0439942 0.3016851 0.5514424 1.8673388 0.6254551
#> ASV0008 4.2882140 3.0677164 6.514932 14.3000749 12.2581049 7.5312739 1.3452695 4.4099251
#> ASV0009 0.4674129 0.3419619 3.689588 0.2030197 7.7359552 0.1582844 0.5007937 7.1880659
#> L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5
#> ASV0002 12.829278 8.654647 9.041784 13.6220214 5.024374 17.70502153 7.8901345 8.995795
#> ASV0003 0.921740 9.530792 9.079588 8.5899753 11.863936 10.48276173 11.3430493 3.270292
#> ASV0006 1.774789 6.913218 5.461540 0.3648316 1.406825 0.62233647 0.7892377 1.592300
#> ASV0007 1.442515 4.163499 3.624719 14.8167625 16.529120 5.98209655 9.9327354 4.862592
#> ASV0008 8.960207 1.828319 6.101981 0.0000000 0.000000 0.00000000 0.0000000 0.000000
#> ASV0009 2.551159 1.178451 1.461006 7.1783073 1.233644 0.03156779 1.8609865 21.526934
#> L3_6 L3_7 L3_8
#> ASV0002 19.96646697 19.7996690 5.78009910
#> ASV0003 14.42738258 9.9207386 21.88830856
#> ASV0006 2.58040771 0.1428447 0.82764162
#> ASV0007 16.93827060 3.8899051 13.41368689
#> ASV0008 0.00000000 0.0000000 0.00000000
#> ASV0009 0.01840227 0.5853149 0.03214142
#Save ASV sample table on your local machine
write.table(sample_relabun_ASV_table, file="ASV_sample_relabun.txt", sep="\t")
So, how could we do this at genus and phylum level? First, the
phyloseq object has to be grouped at the desired taxonomic level with
tax_glom
and follow the same steps as before.
### GET TABLE BY SAMPLE AT GENUS LEVEL ###
#Glom phyloseq
<- tax_glom(loc_phyloseq, taxrank = "Genus") #Change Genus to the desired taxonomic rank
loc_phyloseq_genus <- transform_sample_counts(loc_phyloseq_genus, function(x){x/sum(x)}*100)
sample_relabun_genus ##Extract elements from phyloseq
<- as.data.frame((otu_table(sample_relabun_genus)))
OTU_sample_genus <- as.data.frame(tax_table(sample_relabun_genus)[,-7])
taxonomy_sample_genus identical(rownames(OTU_sample_genus),rownames(taxonomy_sample_genus))
#> [1] TRUE
#Bind table
<- cbind(taxonomy_sample_genus, OTU_sample_genus) sample_table_genus
Furthermore, if not interested in the taxonomic profile of
unclassified sequences at genus level, we can group unclassified genera
together. First, we save the unclassified genera in a new variable
(sample_talbe_genus_unclass
) where we sum up its
abundances. Then we bind sample_table_genus
without
unclassifieds with this new variable:
##Agglomerate unclassified in otu table
#Get sum of unclassified sequences
<- sample_table_genus %>% subset(Genus=="unclassified", select=c(7:ncol(sample_table_genus))) %>%
sample_table_genus_unclass colSums() %>% t() %>% as.data.frame() %>% cbind(Kingdom="unclassified", Phylum="unclassified", Class="unclassified",
Order="unclassified", Family="unclassified", Genus="unclassified", .)
#Bind sample_table_genus wito unclassified with sample_table_genus_class
<- rbind(subset(sample_table_genus, Genus!="unclassified"), sample_table_genus_unclass)
sample_table_genus_final
#Check relative abundance sums 100
colSums(sample_table_genus_final[,8:ncol(sample_table_genus_final)])
#> L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
<- sample_table_genus_final[order(rowSums(sample_table_genus_final[,7:ncol(sample_table_genus_final)]), decreasing=TRUE),]
sample_table_genus_final
#See first lines of this table
head(sample_table_genus_final)
#> Kingdom Phylum Class Order Family
#> 1 unclassified unclassified unclassified unclassified unclassified
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0022 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae
#> Genus L1_1 L1_2 L1_3 L1_4 L1_5 L1_6
#> 1 unclassified 27.8534281 47.8401209 52.6289022 42.899062 50.3290726 53.4049234
#> ASV0002 Streptomyces 24.5027591 13.3266946 15.1413141 28.718450 10.8362796 10.6258539
#> ASV0007 Pseudonocardia 6.9494621 4.7215164 8.2560364 4.986285 8.2186565 6.6037057
#> ASV0010 Amycolatopsis 7.5211402 2.8123068 5.9238091 5.628442 5.3848246 8.7346293
#> ASV0009 Promicromonospora 0.2679741 1.5040176 2.4508973 1.860554 0.1060647 0.4674129
#> ASV0022 Neorhizobium 0.4069237 0.1201841 0.2567145 1.599013 0.1033451 0.4698099
#> L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6
#> 1 38.2838284 36.6489524 56.2679860 39.491043 49.1881542 41.2663863 19.876403 33.165067
#> ASV0002 29.6135035 11.0705531 9.3605866 14.072905 18.6341588 5.4039900 23.175445 13.396380
#> ASV0007 5.3998171 7.6189454 9.5892301 4.206396 5.5450600 3.7559531 4.025317 6.603939
#> ASV0010 5.1254523 7.5634763 11.6923562 11.939225 11.5905029 7.6575618 7.787383 11.818080
#> ASV0009 0.4731798 4.3910033 0.3666180 7.735955 0.1582844 0.5007937 7.188066 2.551159
#> ASV0022 0.3280449 0.3453397 0.3409942 4.917623 0.4952770 4.8377330 5.132466 6.952188
#> L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5 L3_6
#> 1 38.769415 47.3192644 18.7625308 30.366459 34.09772486 27.253363 17.798282 24.87169526
#> ASV0002 13.127693 10.1047388 21.5250616 7.904302 23.50222102 10.071749 10.554216 24.77968389
#> ASV0007 9.020311 8.9928617 22.2925226 23.293851 11.42528580 16.302691 9.041631 25.36855665
#> ASV0010 3.939032 7.2983611 2.7198028 9.501411 4.71713004 5.627803 4.430140 5.16899421
#> ASV0009 1.178451 1.4610065 7.1783073 1.233644 0.03156779 1.860987 21.526934 0.01840227
#> ASV0022 2.463705 0.7560764 0.9235826 1.026255 0.11048727 1.004484 1.377070 1.25748870
#> L3_7 L3_8
#> 1 19.3136486 43.26771126
#> ASV0002 31.9118544 8.33266372
#> ASV0007 6.3827193 22.18829517
#> ASV0010 2.5067503 13.26101513
#> ASV0009 0.5853149 0.03214142
#> ASV0022 2.2715791 0.16606401
#Save to your local machine
write.table(sample_table_genus_final, file="sample_table_genus.txt", sep="\t")
If you want to keep the table with all unclassified taxa in
individual rows, you can always use write.table
with
sample_table_genus
.
With the tables produced above we have information about relative
abundance in each sample. However, it is very useful to group this
tables by the variable of interest. That would mean to know the mean
abundances of members of a taxonomic rank in each location with standard
deviations. First, we will save genus abundance and taxonomy from
sample_table_genus_final
in two variables. Then, mean and
SD will be computed according to the grouping variable of interest. For
your own data, it is crucial to substitute metadata.micro
by your own metadata variable (e.g., mt
) and
location
by your metadata variable of interest.
### GROUP GENUS TABLE BY LOCATION
#Save genus abundances
<- sample_table_genus_final[,7:ncol(sample_table_genus_final)] %>% t() %>% as.data.frame()
otu_genus #Save taxonomy data
<- sample_table_genus_final[,1:6]
tax_genus
#Calculate OTU mean abundance based on grouping factor (e.g., location)
<- aggregate(otu_genus, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
location_mean_genus #Calculate OTU SD based on grouping factor (e.g., location) and change colnames
<- aggregate(otu_genus, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>% #Change metadata.micro and location to your metadata and variable of interest
location_SD_genus as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_mean_genus), "SD"))
#Merge mean abundance, SD and taxonomy.
<- merge(tax_genus, location_mean_genus, by=0) %>%column_to_rownames("Row.names") %>%
genus_location_table merge(location_SD_genus, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(genus_location_table[,7:ncol(genus_location_table)])
#> location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 51.40124 59.14292 66.73249
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
<- genus_location_table[order(rowSums(genus_location_table[,7:ncol(genus_location_table)]), decreasing=TRUE),]
genus_location_table
#View table
head(genus_location_table)
#> Kingdom Phylum Class Order Family
#> 1 unclassified unclassified unclassified unclassified unclassified
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0023 Bacteria Actinobacteria Actinobacteria Micromonosporales Micromonosporaceae
#> Genus location1 location2 location3 location1SD location2SD location3SD
#> 1 unclassified 43.736036 40.6679647 26.966427 8.990417 11.046165 8.803745
#> ASV0002 Streptomyces 17.979426 13.4094871 17.322719 8.244194 5.536809 9.197919
#> ASV0007 Pseudonocardia 6.594303 6.4673835 17.036944 1.419046 2.447691 7.291772
#> ASV0010 Amycolatopsis 6.086760 9.2153128 5.991631 1.837111 2.974914 3.642673
#> ASV0009 Promicromonospora 1.440138 2.6425417 4.058412 1.461164 3.072715 7.449846
#> ASV0023 Micromonospora 1.409967 0.8506067 2.076555 2.588402 1.083355 2.368180
#Save table on your local machine
write.table(genus_location_table, file="genus_abund_location.txt", sep="\t")
If you want to do this at other taxonomic levels, you should simply
change sample_table_genus_final
to the grouped table, e.g.,
for ASV level it would be sample_relabun_ASV_table
.
Remember to change metadata.micro
to your metadata file and
location
by your variable of interest:
#Save OTU data
<- sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)] %>% t() %>% as.data.frame()
otu_location_ASV #Save Taxnomy data
<- sample_relabun_ASV_table[,1:8]
tax_location_ASV
#Calculate OTU mean abundance based on grouping factor (e.g., PLOT)
<- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
location_ASV_mean #Calculate OTU SD based on grouping factor (e.g., PLOT) and change colnames
<- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>% #Change metadata.micro and location to your metadata and variable of interest
location_ASV_SD as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_ASV_mean), "SD"))
#Merge mean abundance, SD and taxonomy.
<- merge(tax_location_ASV, location_ASV_mean, by=0) %>%column_to_rownames("Row.names") %>%
ASV_location_table merge(location_ASV_SD, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(ASV_location_table[,8:ncol(ASV_location_table)])
#> L1_1 location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 100.00000 111.04717 115.16294 98.63739
#Save to local machine
write.table(ASV_location_table, file="ASV_abund_location.txt", sep="\t")
Taxonomic profile graphics
This section will cover the graphic representation of taxonomic
profiles at genus level. These kind of graphics can be interesting to
give a visual idea of the main taxonomic groups present in our
communities. It will be exemplified at genus level, but it can be easily
switched to phylum and specific changes will be explained after comments
#
(check also the previous section, as it explains how to
produce the table).
First, we prepare a table for plotting. We create a column with the
mean abundance of each taxa between conditions. With this column, we can
sort the table and filter the number of taxa to plot. In our case, we
usually represent the 20 most abundant genera. Check the number of
columns for your own data. For your own data, this number could differ
depending on the number of samples or experimental settings that you
have (check first your table on your local machine or in RStudio with
View(genus_location_table)
).
###TAXONOMICAL PROFILE GENUS LEVEL###
#Format table.
#Add a column with the mean abundance between conditions and arrange it un decreasing order.
<- genus_location_table %>% mutate(mean=rowMeans(genus_location_table[,7:ncol(genus_location_table)])) %>% arrange(desc(mean))
sample_table_genus_sort
#We have to use pivot_longer to get a table in the right format for ggplot
#We filter out unclassified
<- sample_table_genus_sort[1:21,] %>% filter(Genus!="unclassified") %>% dplyr::select(6:9) %>% pivot_longer(!Genus, names_to="Samples", values_to="Abundance")##we select 2:21 because the unclassified sequences are the most abundant. taxonomic_genus_location
The data is ready but we still need to arrange colors and legend labels. We need to produce an automatized expression for ggplot in order to show genera and phyla names in italics. We will use the package gdata. Let’s install it and load the library:
#Install gdata
install.packages("gdata")
#Load library
library(gdata);packageVersion("gdata")
#> [1] '2.18.0.1'
Now, we will produce the legend labels for genera in italics. First,
we get the location names and genera names. With sapply
we
will paste the name of genera, a “=” symbol, and italic(“name of
genera”) for every genera. If plotting at phylum level, make proper
changes (pay attention to #
comments).
#Get labels for location
<- levels(as.factor(unique(taxonomic_genus_location$Samples)))
location_label
#Get unique names for genera
<- unique(as.vector(taxonomic_genus_location$Genus)) #Change Genus to Phylum when needed
unique_genera
#REORDER FACTORS LEVELS IN DATA FRAME
$Genus=reorder.factor(taxonomic_genus_location$Genus,new.order=rev(unique_genera))#Change Genus to Phylum when needed
taxonomic_genus_location
<- as.data.frame(unique_genera)
sorted_labels_genus
##CREATE AN EXPRESSION FOR GGPLOT WITH ITALICS WHEN NEEDED.
<- sapply(sorted_labels_genus$unique_genera,
sorted_labels_ggplot function(x) parse(text = paste0("italic('",as.character(x), "')")))
When having to color a great number of objects,
randomcoloR
package can be very useful. It allows us to
produce a number of random color and it is possible to modify them later
to not repeat them. You can combine it with package scales
to display the random generated colors, as well as its codes.
#Install randomcoloR
install.packages("randomcoloR")
#Install scales
install.packages("scales")
#Load libraries
library(randomcoloR);packageVersion("randomcoloR")
#> [1] '1.1.0.1'
library(scales);packageVersion("scales")
#> [1] '1.2.1'
With randomColor
you can produce a number of random
colors. The length of it will be set by
length(unique_genera)
, which tells us how many different
genera there are.
#Produce random color and visualize them
#Get number of genera
length(unique_genera) #This will be the number of genera to color
#> [1] 20
<- randomColor(count=length(unique_genera), hue=c("random"))
colors_genus show_col(colors_genus)
If you want to change some of these colors, it can be easily done as follows:
#Modify colors
5] <- "#448800" colors_genus[
In this tutorial, we will use the next palette, because it has been previously optimized to include very distinctive colors.
#Produce a palette of colours
<- c(
c24 "dodgerblue2", "#E31A1C", # red
"green4",
"#6A3D9A", # purple
"pink", # orange
"black", "gold1",
"skyblue2", "#FB9A99", # lt pink
"#CAB2D6", # lt purple
"#FDBF6F", # lt orange
"gray70", "khaki2",
"maroon", "steelblue4",
"darkturquoise", "green1", "yellow4", "yellow3",
"darkorange4", "blue","orange", "darkgrey"
)
Main taxa plotting
At last, let’s plot taxonomic profile at genus level! Comments
(#
) include information to change the code for phylum
profile.
#### BACTERIAL GENERA TAXONOMIC PROFILE ####
<- "Bacterial genera by location"
ggplot_title
=ggplot(taxonomic_genus_location, aes(x=Samples, y=Abundance, fill=Genus, order=Genus)) + #Change genus to phylum
ggGenerageom_bar(stat="identity", position="stack")+
scale_fill_manual(values=c24,#Change c24 to colors_genus if you are using ramdom colors
breaks=unique_genera, #Include genera names
labels=sorted_labels_ggplot)+ #Include italic tags
labs(y="Mean relative abundance (%)", x=NULL, fill="Genus",title=ggplot_title)+
guides(fill = guide_legend(reverse = TRUE))+
scale_x_discrete(limits=location_label,
labels=location_label)+
scale_y_continuous(expand=c(0.01,0.01),
breaks=c(0,10,20,30,40,50,60,70),
labels=c("0","10", "20","30","40","50","60","70"),
limits = c(NA, 70))+
theme_bw()+
theme(panel.border = element_rect(colour="black"), axis.title.x=element_blank(),
plot.title = element_text(face="bold", hjust = 0.5, size=16), axis.text = element_text(size = 14),
axis.text.x = element_text(face="bold", size=10, angle = 45, vjust=1, hjust = 1), axis.title.y = element_text(size = 16),
legend.key.size = unit(0.9, "cm"), legend.text = element_text(size = 8),
legend.title = element_text(size=7, face="bold"), legend.title.align=0.5)
#Visualize plot
ggGenera
#Save it
ggsave("genera_profile.tiff", plot = ggGenera,width = 35, height = 30, units = "cm", dpi = 800 )
ANCOM-BC analysis
There are several methods to perform differential abundance analysis
on microbial community data. Among them, we choose ANCOMBC because it is
specifically designed for this kind of data and accounts for its
compositional structure. Unfortunately, when having multiple groups to
compare, ancombc
function only compares the first group
against all others. This is a big drawback and requires the user to
change the input phyloseq object for every other combination between
groups. To solve this problem, micro4all has
implemented the function ancomloop
. Apart from looping over
groups, it also returns a table with ANCOM corrected logarithmic
abundances. First, we will install microbiome package,
as it is required by ancombc
function from
micro4all. ANCOMBC package is already
installed with micro4all. Although ANCOMBC has recently
been updated to ANCOMBC-II, we continue to use ‘ancombc’ instead of
‘ancombc2’ as some of its functionality is still being revised.
Therefore, a normal warning message is displayed stating that “ancombc
is deprecated”. Also, the new function ‘ancombc’ contains the argument
‘tax_level’. Thus, we do not need to glom
the phyloseq
object at the specific taxonomic level. However, it is not possible to
implement ANCOM at ASV level, and the ‘tax.level’ argument must always
be specified (including one of the ranks allowed by taxRanks()),
otherwise an error will occur.
##Install microbiome package
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("microbiome") BiocManager
#Load microbiome library
library(microbiome);packageVersion("microbiome")
#> [1] '1.18.0'
Now we will see how to use ancomloop
function. The input
has to be a phyloseq object. First of all, we have to use a phyloseq
object (loc_phyloseq). Moreover, we have to give the
grouping variable (change it to your variable of
interest). If we do not want to analyze the unclassified genera, we can
set out.unclassified
to TRUE
. When this
argument is turned to TRUE
, a tax.level
should
be provided. Finally, we can also add more arguments from the original
ANCOMBC
function with ancom.options
. When
running this function, the comparisons performed will be printed on
screen (for example,
“location1”-“location2”-“location3” means that
location1 is being compared with location2 and location3). After it is
finished, it returns a list with as many elements as different groups
were in the grouping variable, in this case, 3. Each element is a data
frame containing the taxonomy, relative abundance corrected by ANCOMBC,
standard deviations and all ANCOMBC results for every possible
comparison. If you want to visualize each element, use
View(ANCOM_location_genus[["location1"]])
.
#### ANCOMBC COMPARISONS ####
<- ancomloop(input_object_phyloseq = loc_phyloseq, grouping = "location", ancom.options = list(global=FALSE, struc_zero=TRUE, n_cl=8),out.unclassified = TRUE, tax.level="Genus") #When performing ancombc at genus level, we can filter unclassified genera out with out.unclassified and tax.level arguments #REMEMBER! Change location1 according to your metadata ANCOM_location_genus
#> [1] "location1" "location2" "location3"
#> [1] "location2" "location1" "location3"
#> [1] "location3" "location1" "location2"
#Visualize first lines
head(ANCOM_location_genus[["location1"]])
#> Genus Kingdom Phylum Class Order
#> 1 Achromobacter Bacteria Proteobacteria Betaproteobacteria Burkholderiales
#> 2 Acidaminococcus Bacteria Firmicutes Negativicutes Acidaminococcales
#> 3 Actinocorallia Bacteria Actinobacteria Actinobacteria Streptosporangiales
#> 4 Actinomadura Bacteria Actinobacteria Actinobacteria Streptosporangiales
#> 5 Actinophytocola Bacteria Actinobacteria Actinobacteria Pseudonocardiales
#> 6 Actinoplanes Bacteria Actinobacteria Actinobacteria Micromonosporales
#> Family ASV_names Intercept_lfc Intercept_se Intercept_W Intercept_pval
#> 1 Alcaligenaceae <NA> 2.8600907 0.2510751 11.3913736 4.616539e-30
#> 2 Acidaminococcaceae <NA> -0.1595666 0.3490052 -0.4572041 6.475244e-01
#> 3 Thermomonosporaceae <NA> -0.9505587 0.2369933 -4.0109101 6.048515e-05
#> 4 Thermomonosporaceae <NA> 4.0296225 0.5369732 7.5043276 6.174469e-14
#> 5 Pseudonocardiaceae <NA> 1.6747693 0.7152141 2.3416334 1.919956e-02
#> 6 Micromonosporaceae <NA> 3.4144532 0.4015234 8.5037460 1.835689e-17
#> Intercept_qval Intercept_diff location1vslocation2_lfc location1vslocation2_se
#> 1 7.109471e-28 TRUE -2.5689349 0.7038106
#> 2 1.000000e+00 FALSE 1.4555339 0.5739782
#> 3 6.955792e-03 TRUE 0.0675060 0.3812191
#> 4 8.582512e-12 TRUE 0.3953266 0.6812116
#> 5 1.000000e+00 FALSE 0.6683542 1.0329619
#> 6 2.588322e-15 TRUE 0.2784884 0.5147770
#> location1vslocation2_W location1vslocation2_p_val location1vslocation2_q_val
#> 1 -3.6500374 0.0002622022 0.04431216
#> 2 2.5358696 0.0112168507 1.00000000
#> 3 0.1770793 0.8594461208 1.00000000
#> 4 0.5803286 0.5616930167 1.00000000
#> 5 0.6470270 0.5176144794 1.00000000
#> 6 0.5409885 0.5885155182 1.00000000
#> location1vslocation2_diff_abn location1vslocation3_lfc location1vslocation3_se
#> 1 TRUE -4.1372748 0.0000000
#> 2 FALSE 0.2987000 0.6613357
#> 3 FALSE 2.0390103 0.4893351
#> 4 FALSE 0.1400161 0.7890357
#> 5 FALSE 1.3378918 0.8607808
#> 6 FALSE 1.3220703 0.4769997
#> location1vslocation3_W location1vslocation3_p_val location1vslocation3_q_val
#> 1 -14.1889557 0.000000000 0.000000000
#> 2 0.4516616 0.651512781 1.000000000
#> 3 4.1668999 0.000030877 0.005063829
#> 4 0.1774521 0.859153279 1.000000000
#> 5 1.5542770 0.120118350 1.000000000
#> 6 2.7716375 0.005577511 0.792006625
#> location1vslocation3_diff_abn
#> 1 TRUE
#> 2 FALSE
#> 3 TRUE
#> 4 FALSE
#> 5 FALSE
#> 6 FALSE
#Save any of the table on your local machine
write.table(ANCOM_location_genus[["location1"]], file="ancom_location1_genus.txt", sep="\t")
Once saved to your local machine, the ANCOM table can be manually filtered to get only significant results. However, this can be done as well in R. You will see that we select a specific number of columns (1:13) in order to save only location1 vs. location2 comparison.
###Filter ANCOM Results
#Get results for location1
<- ANCOM_location_genus[["location1"]] #Change location1 to your variable of interest
ANCOM_location1_genus #Get only significant results
<- ANCOM_location1_genus[,1:19] %>% filter(location1vslocation2_diff_abn,TRUE) #Change [,1:19] according to the comparison of interest
ANCOM_loc1vsloc2_genus_sig
#Write results on your local machine
write.table(ANCOM_loc1vsloc2_genus_sig, file="ANCOM_loc1loc2_sig.txt", sep="\t")
ANCOM-BC graphic
Graphical representation of differential analysis results can be achieved with two approaches: log fold change with corrected abundances or relative abundance without correction. Here, we will show how to accomplish both graphics. For publication purposes, we rather represent both graphics in one figure.
We will start with log fold change represented in a waterfall plot. As we did ANCOM at genus level, we will bind phylum and genus name for an easier identification. This is most useful when representing differential abundances at ASV level. The graphic will be exemplified with location1 vs location2 comparison.
#Bind phylum name to genus name
#Create a new variable for graphics
<-ANCOM_loc1vsloc2_genus_sig
graphic_loc1vsloc2_genus
#Loop to add a new column called Classification and paste phylum and genus names
for (i in 1:nrow(graphic_loc1vsloc2_genus)){
$Classification[i]<- paste(graphic_loc1vsloc2_genus$Phylum[i],graphic_loc1vsloc2_genus$Genus[i],sep="|")
graphic_loc1vsloc2_genus }
When using unclassified genera or DA analysis at ASV level, it would be interesting to show classification at higher levels (family, class, order or phylum). The next loop will allow us to paste names according to classification:
#Get the last classified taxonomy level
for (i in 1:nrow(graphic_loc1vsloc2_genus)){ #Loop over rows
if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_genus$Family[i]!="unclassified")){
$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_genus$Family[i],
graphic_loc1vsloc2_genus"|", graphic_loc1vsloc2_genus$Genus[i])}
else if (isTRUE(graphic_loc1vsloc2_genus$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Order[i]!="unclassified")){
$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_genus$Order[i],
graphic_loc1vsloc2_genus"|", graphic_loc1vsloc2_genus$Genus[i])
}else if (isTRUE(graphic_loc1vsloc2_genus$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Class[i]!="unclassified")){
$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_genus$Class[i],
graphic_loc1vsloc2_genus"|", graphic_loc1vsloc2_genus$Genus[i])
}else if (isTRUE(graphic_loc1vsloc2_genus$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Phylum[i]!="unclassified")){
$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
graphic_loc1vsloc2_genus"|", graphic_loc1vsloc2_genus$Genus[i])
}else if (isTRUE(graphic_loc1vsloc2_genus$Phylum[i]=="unclassified")){
$Classification[i] <- paste0("Phylum", " ", "unclassified")
graphic_loc1vsloc2_genus
}else if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]!="unclassified")) {
$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
graphic_loc1vsloc2_genus"|", graphic_loc1vsloc2_genus$Genus[i])
}}
Next, we can filter out unclassified phyla:
#Filter out unclassified phyla
<- graphic_loc1vsloc2_genus %>% filter(Phylum!="unclassified") graphic_loc1vsloc2_genus
If using ANCOM at ASV level, the code would be as follows:
#Get the last classified taxonomy level for ANCOM at ASV level
for (i in 1:nrow(graphic_loc1vsloc2_ASV)){
if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_ASV$Family[i]!="unclassified")){
$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_ASV$Family[i],
graphic_loc1vsloc2_ASV"|", graphic_loc1vsloc2_ASV$Genus[i], "|", graphic_loc1vsloc2_ASV$ASV_names[i])}
else if (isTRUE(graphic_loc1vsloc2_ASV$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Order[i]!="unclassified")){
$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_ASV$Order[i],
graphic_loc1vsloc2_ASV"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}else if (isTRUE(graphic_loc1vsloc2_ASV$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Class[i]!="unclassified")){
$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_ASV$Class[i],
graphic_loc1vsloc2_ASV"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}else if (isTRUE(graphic_loc1vsloc2_ASV$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]!="unclassified")){
$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
graphic_loc1vsloc2_ASV"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}else if (isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]=="unclassified")){
$Classification[i] <- paste0("Phylum", " ", "unclassified")
graphic_loc1vsloc2_ASV
}else if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]!="unclassified")) {
$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
graphic_loc1vsloc2_ASV"|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}}
Now we have a graphic_loc1vsloc2_genus
variable with
ANCOM results and a new column called Classification
. Next,
we create a data frame including only data we want to represent. This is
the classification column, the log fold change (represented as beta in
ANCOM results) and the standard deviation (represented as SE in ANCOM
results). In this data frame those genera where log fold change is 0 or
less than a specific value will be filter out to reduce the size of
graphic.
##Create the data frame for representation, with Classification, Log fold change and standard deviation
<- graphic_loc1vsloc2_genus %>% dplyr::select(Classification, location1vslocation2_lfc, location1vslocation2_se)
logfold_df
colnames(logfold_df)<- c("Genus", "LogFold_location1vslocation2", "SD")
##Filter out 0 log fold change and assign name according to the direction of change
<- logfold_df %>%
logfold_df filter(LogFold_location1vslocation2 != 0) %>% arrange(desc(LogFold_location1vslocation2)) %>%
mutate(group = ifelse(LogFold_location1vslocation2 > 0, "Location2", "Location1")) #Change location to your variable of interest from your mt object
$Genus = factor(logfold_df$Genus, levels = logfold_df$Genus) logfold_df
Let’s write the table to your local machine:
write.table(file="waterfall_table.txt", logfold_df, sep="\t")
Here we have the filtering step. First, range
function
is used to see the minimum and maximum log fold change. Then, we filter
the data frame. Remember to change the number to your desired threshold,
in this tutorial we will use a range between -1.5 and 1.5.
#FILTER WHEN NEEDED. This can be used to filter the results according to log fold change level
range(logfold_df$LogFold_location1vslocation2)
#> [1] -2.880552 3.241747
<- rbind(logfold_df[which(logfold_df$LogFold_location1vslocation2>=1.5),], logfold_df[which(logfold_df$LogFold_location1vslocation2<=-1.5),]) logfold_df_filtered
Now that we have prepared all data, we can use ggplot
for plotting log fold change.
##Waterfall plot
<- ggplot(data = logfold_df_filtered,#If not filtering, use logfold_df in plotting
waterfall_location1 aes(x = Genus, y = LogFold_location1vslocation2, fill = group, color = group)) + #Change Genus to taxonomic rank of interest and logFold_location1vslocation2 to your object
geom_bar(stat = "identity", width = 0.7,
position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = LogFold_location1vslocation2 - SD, ymax = LogFold_location1vslocation2 + SD), width = 0.2, #Change LogFold_location1vslocation2 to your object
position = position_dodge(0.05), color = "black") +
labs(x = NULL, y = "Log fold change",
title = "Waterfall Plot for the Location Effect") + #change title
theme_bw() +
theme(legend.position = "right",
legend.title =element_blank(),
plot.title = element_text(hjust = 0.5),
panel.grid.minor.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1), plot.margin=margin(10,10,10,100))
#Visualize plot
waterfall_location1
#Save it
ggsave("waterfall_location1.tiff", plot = waterfall_location1,width = 17, height = 30, units = "cm", dpi = 800 )
As it was said at the beginning of this section, we will also produce
a graphic with relative abundance, specifically, a back-to-back plot or
pyramidal plot. Again, we will show how to do this plot with location1
vs. location2 comparison. However, we need to start with a relative
abundance table, in our case, genus_location_table
(relative abundance at genus level and grouped by location). First, we
will filter out those genera that were not statistically different
according to ANCOM and sort the table as
logfold_df_filtered
(waterfall plot table). Then, we will
select columns with classification and abundance and use
pivot_longer
to give it a format suitable for
ggplot2.
#Again, bind phylum name and genus name
<- genus_location_table
pyramidal_dffor (i in 1:nrow(pyramidal_df)){
$Classification[i]<- paste(pyramidal_df$Phylum[i],pyramidal_df$Genus[i],sep="|")
pyramidal_df
}#Filter to include only significant genera according to ANCOM
<- pyramidal_df[which(pyramidal_df$Classification %in%logfold_df_filtered$Genus ),]
pyramidal_loc1vsloc2_filt
#Sort it in the same way as waterfall plot
<- pyramidal_loc1vsloc2_filt[match(logfold_df_filtered$Genus,pyramidal_loc1vsloc2_filt$Classification),]
pyramidal_loc1vsloc2_filt
#Now, transform it to ggplot format
<- pyramidal_loc1vsloc2_filt %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
pyramidal_loc1vsloc2 pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#
We can write this table to our local machine:
write.table(file="pyramidal_table.txt", pyramidal_loc1vsloc2, sep="\t")
Finally, the data frame (pyramidal_loc1vsloc2_filt
) has
to be split into two data frames for each location and then create a
vector variable with factors (this is needed for keeping the order in
ggplot). This is essential for the back-to-back plot.
#Split table according to location
<- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location1"),]#Change location1 to variable of interest
loc1pyrm <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location2"),]#Change location1 to variable of interest
loc2pyrm
#Create vector with factors
$Classification <- factor(loc1pyrm$Classification, levels=loc1pyrm$Classification)
loc1pyrm$Classification <- factor(loc2pyrm$Classification, levels=loc2pyrm$Classification) loc2pyrm
Before plotting, it is important to know the maximum and minimum values of abundances to set the limits of the graphic:
#Get limits for plotting
min(pyramidal_loc1vsloc2$Abundance)
#> [1] 0
max(pyramidal_loc1vsloc2$Abundance)
#> [1] 3.237008
As we can see, we should change the limits to a maximum of 4. Check
the comments below (#
):
#### PYRAMIDAL PLOT ####
=ggplot(data = pyramidal_loc1vsloc2, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
pyramidalggplotgeom_bar(loc1pyrm, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
geom_bar(data=loc2pyrm, stat="identity")+ #in data, give the second split dataframe
theme_bw()+
scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
labs(y="Relative abundance (%)")+
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
coord_flip()
#Visualize plot
pyramidalggplot
#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )
Another way to represent the previous plot would be by abundance order. We will generate the tables needed using the original genus table in reversed order and avoiding the step to order the pyramidal data frame as the waterfall plot.
#Get genus location table in reverse order
<- genus_location_table[nrow(genus_location_table):1,]
genus_location_table_rev #Again, bind phylum name and genus name
<- genus_location_table_rev
pyramidal_df_abunfor (i in 1:nrow(pyramidal_df_abun)){
$Classification[i]<- paste(pyramidal_df_abun$Phylum[i],pyramidal_df_abun$Genus[i],sep="|")
pyramidal_df_abun
}#Filter to include only significant genera according to ANCOM
<- pyramidal_df_abun[which(pyramidal_df_abun$Classification %in%logfold_df_filtered$Genus ),]
pyramidal_loc1vsloc2_filt_abund
#Now, transform it to ggplot format
<- pyramidal_loc1vsloc2_filt_abund %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
pyramidal_loc1vsloc2_abund pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#
#Split table according to location
<- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location1"),]#Change location1 to variable of interest
loc1pyrm_abund <- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location2"),]#Change location1 to variable of interest
loc2pyrm_abund
#Create vector with factors
$Classification <- factor(loc1pyrm_abund$Classification, levels=loc1pyrm_abund$Classification)
loc1pyrm_abund$Classification <- factor(loc2pyrm_abund$Classification, levels=loc2pyrm_abund$Classification) loc2pyrm_abund
Let’s see how it works
#### PYRAMIDAL PLOT ####
=ggplot(data = pyramidal_loc1vsloc2_abund, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
pyramidalggplot_abundgeom_bar(loc1pyrm_abund, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
geom_bar(data=loc2pyrm_abund, stat="identity")+ #in data, give the second split dataframe
theme_bw()+
scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
labs(y="Relative abundance (%)")+
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
coord_flip()
#Visualize plot
pyramidalggplot_abund
#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )
Constrained analysis of proximities (CAP)
Microbial communities are greatly influenced by environmental factors. Thus, it is interesting to analyze how these variables relate to community structure. This can be achieved by studying different soil physicochemical parameters and fitting them to an ordination plot. The package vegan includes several functions and a very useful tutorial.
Among the options offered by vegan, we have implemented them in a way it is logical for our daily studies. We will exemplify it with some data from pine trees (which can be downloaded in the link at the beggining of the tutorial).
We will read this phyloseq file (saved as .RDS), convert the abundances to relative abundances and get the soil physicochemical parameters from metadata.
### PREPARE DATA FOR CAP ####
#Read phyloseq and get relative abundance
<- readRDS(file = "cap_phyloseq")
cap_phyloseq <- transform_sample_counts(cap_phyloseq,function(x){x/sum(x)}*100)
cap_phyloseq_rel
#Get ASV and metadata
<-as.data.frame(t(otu_table(cap_phyloseq_rel)))
ASV_cap<- as.matrix(sample_data(cap_phyloseq))
metadata_cap <- as.data.frame(metadata_cap)
metadata_cap
#metadata_cap <- metadata_cap[1:ncol(metadata_cap)-1] #Remove column Season_NucleicAcid
If using your own data, you can use your previously generated
phyloseq object. If physicochemical metadata was not originally in your
metadata file, you can always read it into a variable in R with
metadata_cap <- read.table("yourmtfile.txt", sep="\t", header=TRUE)
and row.names(metadata_cap)<- metadata_cap$samples
. In
this file, row names should fit sample names and columns should fit
physicochemical parameters.
Now, we will generate two models. CAP1
will include all
variables to be part of constrained analysis, that is, it will explain
the variance produced by all variables, and CAP0
, which is
the opposite of CAP1
, i.e., fully unconstrained (explained
variance without influence of physicochemical parameters).
These two models will be the input for ordistep
function. This function allows us to get a model with physicochemical
variables that significantly explain a part of the variance in the
community and that are not colinear between them. The function starts
with a model with all unconstrained variables (CAP0
) and
then, in each step, it adds a new variable to the constrained model.
Then, it performs a checking step. If the variables left are no longer
significant, or add no more explained variance to the model, the
function stops. In the end, the output will be the formula with those
variables that explain the majority of the variance of data. In this
way, when having multiple variables, the model can be reduced and remove
those that are interdependent.
#Compute CAP1 and CAP0 models
<-capscale(ASV_cap~Water+ln_P+SOM+N+ph_KCl+CN+Na_Exch+Clay+Sand+Slime+Texture, data=metadata_cap,distance = "bray")#Add here, after ASV_cap, all your parameters
CAP1
<-capscale(ASV_cap~1, data=metadata_cap, distance = "bray")
CAP0
#Get formula with ordistep
<-ordistep(CAP0, scope=formula(CAP1)) CAP_ordi
#>
#> Start: ASV_cap ~ 1
#>
#> Df AIC F Pr(>F)
#> + ph_KCl 11 28 1.2601 0.010 **
#> + Clay 15 10 1.0955 0.335
#> + Water 15 10 1.0896 0.355
#> + ln_P 8 33 1.0101 0.495
#> + Na_Exch 10 33 0.9478 0.735
#> + SOM 14 23 0.8758 0.905
#> + N 13 28 0.8611 0.910
#> + CN 16 -Inf
#> + Sand 16 -Inf
#> + Slime 16 -Inf
#> + Texture 16 -Inf
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Step: ASV_cap ~ ph_KCl
#>
#> Df AIC F Pr(>F)
#> - ph_KCl 11 28.737 1.2601 0.02 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Df AIC F Pr(>F)
#> + Na_Exch 4 8 1.0277 0.465
#> + ln_P 4 10 0.9176 0.540
#> + Water 5 -Inf
#> + SOM 5 -Inf
#> + N 5 -Inf
#> + CN 5 -Inf
#> + Clay 5 -Inf
#> + Sand 5 -Inf
#> + Slime 5 -Inf
#> + Texture 5 -Inf
Thus, we know now which variables influence our microbial community
according to ordistep
, in this case, pH. With this
parameter, we will use the function envfit
to fit this
environmental factor as a vector into an ordination plot and get the
correlation with each ordination axis. Remember that this correlation
values correspond to an unconstrained analysis, i.e., correlation with
ordination plots performed in β-diversity analysis.
#### ENVFIT ####
<- envfit (CAP1~ph_KCl, data = metadata_cap, perm = 999) # AFter CAP1, introduce the significant parameters ef
Now, we can adjust p-values for multiple comparisons and visualize the result
#Now we can adjust pvalues of envfit
<- ef
ef.adj <- p.adjust (ef$vectors$pvals, method = 'BH')
pvals.adj $vectors$pvals <- pvals.adj
ef.adj ef.adj
#>
#> ***VECTORS
#>
#> $pvals
#> numeric(0)
#>
#>
#> ***FACTORS:
#>
#> Centroids:
#> CAP1 CAP2
#> ph_KCl5.6 -0.5680 -1.0802
#> ph_KCl5.7 -0.6385 -1.0075
#> ph_KCl5.8 -0.4246 -0.1969
#> ph_KCl5.9 -0.5018 0.1034
#> ph_KCl6.0 0.6368 0.3278
#> ph_KCl6.1 0.1524 0.6719
#> ph_KCl6.2 -0.3268 1.0491
#> ph_KCl6.4 -0.0529 1.3443
#> ph_KCl6.5 1.0748 0.1100
#> ph_KCl6.6 0.9464 -0.2098
#> ph_KCl7.0 0.4793 -0.3622
#> ph_KCl7.7 1.6547 -0.9666
#>
#> Goodness of fit:
#> r2 Pr(>r)
#> ph_KCl 0.818 0.078 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Permutation: free
#> Number of permutations: 999
With plot_ordination
we can get a graphic representation
of constrained ordination. In this sense, we can visualize the direction
of change of environmental variables. We will introduce another
variable, although it was not significant, for showing how to plot
various parameters.
#### CAP PLOT ####
#First, get distance matrix
<- phyloseq::distance(physeq = cap_phyloseq_rel, method = "wunifrac")
distance_matrix
#Then, generate ordination with CAP (that is, constrained ordination to ph_KCl)
<- ordinate(physeq = cap_phyloseq_rel, method = "CAP",distance = distance_matrix,
CAP_wunifrac formula = ~ph_KCl+Clay)#Change parameters to yours
#Produce plot
<- plot_ordination(physeq = cap_phyloseq_rel, ordination = CAP_wunifrac,
CAP_wunifrac_plot color = "Species", axes = c(1,2)) +
aes(shape = Species) +
geom_point(aes(colour = Species), alpha = 1, size = 3.5) +
scale_shape_manual(values=c("Pinaster"=16,"Other"=15), #Change Pinaster and other according to your variable of interest
breaks=c("Pinaster","Other"))+#Change Pinaster and other according to your variable of interest
scale_color_manual(values = c("Pinaster"="brown", "Other"="orange"), #Change Pinaster and other according to your variable of interest
name="Host genotype",
breaks=c("Pinaster", "Other"),
labels=expression(paste("Confirmed ", italic("P. pinaster")), "Pine forest samples"))+ #This is used to set the legend, change it according to your preference
guides(shape="none")+
ggtitle("CAP on Weighted Unifrac distance")+
theme_bw()+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold"),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold"),
legend.text = element_text(size = 16))+
geom_hline(aes(yintercept = c(0.00)), lty=2, colour="grey")+
geom_vline(aes(xintercept = c(0.00)), lty=2, colour="grey")
#Define arrows aesthetic and labels
<- vegan::scores(CAP_wunifrac, display = "bp")
arrowmat <- data.frame(labels = rownames(arrowmat), arrowmat)
arrowdf <- aes(xend = CAP1, yend = CAP2,x = 0, y = 0, shape = NULL, color = NULL, label=labels)
arrow_map <- aes(x = 1.1 * CAP1, y = 1.1 * CAP2, shape = NULL,color = NULL, label=labels)
label_map <- arrow(length = unit(0.02, "npc"))
arrowhead
#Introduce them to plot
+
CAP_wunifrac_plot geom_segment(mapping = arrow_map, size = c(ph_KCl=1,Clay=1), data = arrowdf, color =
c(ph_KCl="black",Clay="black"),arrow = arrowhead) +#Change pH and Clay to your physycochemical variables of interest
geom_text(mapping = label_map, size = 4,data = arrowdf, show.legend = FALSE)
BONUS: statistical analysis whitout micro4all
In this section, we will show how it is possible to perform the statistical analysis shown in this tutorial without using functions from micro4all package. This section will not include explanations, as everything was clearly explained throughout the tutorial. We will start with α-diversity. First of all, let’s load some libraries (which were installed with micro4all):
##Load library
library(car);packageVersion("car")
#> Loading required package: carData
#>
#> Attaching package: 'car'
#> The following object is masked from 'package:DescTools':
#>
#> Recode
#> The following object is masked from 'package:dplyr':
#>
#> recode
#> The following object is masked from 'package:purrr':
#>
#> some
#> [1] '3.1.1'
library(stats);packageVersion("stats")
#> [1] '4.2.2'
library(BiocGenerics); packageVersion("BiocGenerics")
#> [1] '0.42.0'
library(agricolae);packageVersion("agricolae")
#> Registered S3 methods overwritten by 'klaR':
#> method from
#> predict.rda vegan
#> print.rda vegan
#> plot.rda vegan
#>
#> Attaching package: 'agricolae'
#> The following object is masked from 'package:ape':
#>
#> consensus
#> The following object is masked from 'package:seqinr':
#>
#> consensus
#> [1] '1.3.5'
library(pairwiseAdonis);packageVersion("pairwiseAdonis")
#> Loading required package: cluster
#> [1] '0.4.1'
library(dunn.test);packageVersion("dunn.test")
#> [1] '1.3.5'
For balanced ANOVA, the next lines should be repeated for every diversity index:
#### ALPHA DIVERSITY WITHOUT MICRO4ALL ####
##Balanced anova
<- aov(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest and location to mt variable
res.aov_observed summary(res.aov_observed)
#> Df Sum Sq Mean Sq F value Pr(>F)
#> location 2 6191 3095 1.126 0.343
#> Residuals 21 57753 2750
The same logic can be applied to unbalanced designs:
##Unbalanced anova
<- Anova(res.aov_observed, type = "III")# Change res.aov_observed for every index (res.aov_shannon, etc...)
res.aov_Observed_u res.aov_Observed_u
#> Anova Table (Type III tests)
#>
#> Response: Observed
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 568711 1 206.7924 2.421e-12 ***
#> location 6191 2 1.1255 0.3433
#> Residuals 57753 21
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a post-hoc, Tukey test or Dunn test can be applied. Again, this should be repeated for every index
##POST-HOC TEST, TUKEYHSD
<- TukeyHSD(res.aov_observed, which = "location")$location # Change res.aov_observed for every index (res.aov_shannon, etc...) and location to mt variable
post_observed which(post_observed[,"p adj"]<0.05)
#> named integer(0)
##Post-hoc, dunn Test
<- dunn.test(x=alpha_table$Observed, g=alpha_table$location, method = "BH") dunn.result
#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.5
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -0.901953
#> | 0.2753
#> |
#> location | 0.212224 1.114177
#> | 0.4160 0.3978
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
Now, normality and homocedasticity will be checked:
##Levene test and shapiro test
leveneTest(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest and location to mt variable
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Levene's Test for Homogeneity of Variance (center = median)
#> Df F value Pr(>F)
#> group 2 0.4798 0.6255
#> 21
#Shapiro
= residuals(object = res.aov_observed) # Change res.aov_observed for every index (res.aov_shannon, etc...)
aov_residuals_observed shapiro.test(x = aov_residuals_observed)
#>
#> Shapiro-Wilk normality test
#>
#> data: aov_residuals_observed
#> W = 0.96527, p-value = 0.5529
For Kruskal-Wallis and pairwise (Wilcoxon test), a similar procedure will be followed:
##Kruskal Wallis
<- kruskal.test(Observed ~ location, data = alpha_table) #Change OBSERVED for index of interest and location to mt variable
k_observed k_observed
#>
#> Kruskal-Wallis rank sum test
#>
#> data: Observed by location
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.4966
#Wilcoxon test
pairwise.wilcox.test(alpha_table$Observed, alpha_table$location, p.adjust.method = "BH") #Change OBSERVED for index of interest and location to mt variable
#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
#>
#> Pairwise comparisons using Wilcoxon rank sum exact test
#>
#> data: alpha_table$Observed and alpha_table$location
#>
#> location1 location2
#> location2 0.66 -
#> location3 0.92 0.66
#>
#> P value adjustment method: BH
For β-diversity analysis, PERMANOVA test can be applied to a single distance method.
#### BETA DIVERSITY WITHOUT MICRO4ALL ####
#PERMANOVA with bray curtis distances
<- data.frame(sample_data(norm_phyloseq))
df <- phyloseq::distance(norm_phyloseq, method="bray") #Change bray to include another distance method
bray <- adonis2(bray~ location, data=df) #Change location to mt variable of interest
permanova_all permanova_all
#> Permutation test for adonis under reduced model
#> Terms added sequentially (first to last)
#> Permutation: free
#> Number of permutations: 999
#>
#> adonis2(formula = bray ~ location, data = df)
#> Df SumOfSqs R2 F Pr(>F)
#> location 2 1.4040 0.30049 4.5105 0.001 ***
#> Residual 21 3.2684 0.69951
#> Total 23 4.6725 1.00000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion will be checked with betadisper
function. If
working with another distance, do the proper change
adonis
:
#Betadisper with bray curtis distances
<- betadisper(bray, df$location) #Change location to mt variable of interest
disper_d permutest(disper_d) # NOT SIGNIFICANT
#>
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#>
#> Response: Distances
#> Df Sum Sq Mean Sq F N.Perm Pr(>F)
#> Groups 2 0.004001 0.0020003 0.317 999 0.706
#> Residuals 21 0.132519 0.0063104
Next, pairwise.adonis
function will be applied to check
differences between groups:
#Pairwise adonis
<- data.frame(sample_data(norm_phyloseq))
d <- cbind(t(OTU), df)
table_vegan <- ncol(t(OTU))
ncolum
## PAIRWISE ADONIS ON BRAY CURTIS ##
<- pairwise.adonis(table_vegan[,1:ncolum], table_vegan$location, p.adjust.m = "BH") #Change location to mt variable of interest
pw_result <- pw_result[which(pw_result$p.adjusted<0.05),]
pw_result_sig # None pw_result_sig
#> pairs Df SumsOfSqs F.Model R2 p.value p.adjusted sig
#> 1 location1 vs location2 1 0.3254020 2.020915 0.1261423 0.011 0.0110 .
#> 2 location1 vs location3 1 0.8502918 5.871144 0.2954608 0.001 0.0015 *
#> 3 location2 vs location3 1 0.8446613 5.871946 0.2954892 0.001 0.0015 *
References
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology, 8, 2224.
Paliy, O., & Shankar, V. (2016). Application of multivariate statistical techniques in microbial ecology. Molecular ecology, 25(5), 1032-1057.
Bokulich, N. A., Subramanian, S., Faith, J. J., Gevers, D., Gordon, J. I., Knight, R., … & Caporaso, J. G. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature methods, 10(1), 57-59.