Micro4all workflow

Introduction

The following tutorial has been created to show the workflow we follow in our laboratory for the analysis of amplicon data. It is based in R and it is intented to be easy and readable for all users, even those with little knowledge about R. Moreover, it integrates the main utilities of micro4all package, which was created in order to record R functions we usually use. We will cover several steps, including:

  • Produce an amplicon sequence variant (ASV) table, making use of dada2 package.
  • Study microbial diversity (\(\alpha\) and \(\beta\) diversity).
  • Get taxonomic profiles at phylum and genus level.
  • Differential abundance analysis with ANCOMBC package.
  • Ordination graphics and environmental fitting of soil physicochemical parameters.

For following this tutorial, we will use an example dataset, which will be detailed in the next section. The results that you will see were produced with R version 4.1.2 (2021-11-01) on Ubuntu 20.04.3 LTS.It is important to know that we use Ubuntu, specifically for tools as figaro or cutadapt (we will see this in detail later on).

Authorship

This tutorial has been mainly created and maintained by Nuria M. Wentzien. Ana V. Lasa has greatly contributed to code development and Antonio J. Fernández-González has contributed to the text revision and proposal of ideas.

Dataset

In this tutorial, we will use a bunch of sequences of our own. Specifically, we collected endosphere samples of 8 olive trees (Picual variety) from three different locations (let’s call them location 1, location 2 and location 3) and sequenced the V3-V4 region of the 16S rRNA gene (with Illumina MiSeq technology and 300bp PE) in order to study the bacterial community. These paired-end fastq files (already “demultiplexed” and without barcodes) can be downloaded here. In this link you will also find a phyloseq object that will be used in environmental fitting section of this tutorial and a modified RDP database file for classification (explained later).

Install package micro4all

Let’s install micro4all package. This package is hosted in Github. Because of that, we will need devtools for installing the package from this respository. If it’s the first time installing this package, it will take a while to get the full installation done. If not, you don’t really need to install it again, just load the libraries before processing your dataset.

##Install devtools
install.packages("devtools")

Now, we are ready to install micro4all

##Install micro4all with devtools
library(devtools)
install_github("nuriamw/micro4all")

…and load the library

#Load micro4all
library(micro4all)

Produce an amplicon sequence variant (ASV) table

The first part of this tutorial is mainly based on the information provided by dada2 (we highly recommend to check its online tutorial) but with our own implementations. We will start with raw sequences, then we will get to know their quality profiles, clean them and infer Amplicon Sequence Variants (ASV). Thus, we will end this section with an ASV table, ready for microbial diversity analysis.

Prepare data: load libraries and read sequences files

We will make use of several packages, as dada2, ShortRead, ggplot2 and tidyverse . The first two mentioned packages can be installed from BiocManager as follows:

##dada2 installation
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2")
##ShortRead installation
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ShortRead")

When installing tidyverse, ggplot2 is automatically installed as it is part of the tidyverse package collection:

##Install tidyverse
install.packages("tidyverse")

Now, let’s load the libraries

##Load libraries
library(dada2); packageVersion("dada2")
#> [1] '1.24.0'
library(ShortRead);packageVersion("ShortRead")
#> [1] '1.54.0'
library(ggplot2);packageVersion("ggplot2")
#> [1] '3.4.1'
library(tidyverse);packageVersion("tidyverse")
#> [1] '2.0.0'

Great! We are ready to use dada2 workflow. First of all, we will set the path to FASTQ files. We will use them unzipped, but it is perfectly OK for dada2 to use them zipped. Just copy and paste your folder path next to path <-, remember that “~/” is only valid in Ubuntu and it is used to link to your home folder. Windows users should write something like: "C:/Users/user_name/Desktop/PATH_to_your_Data".

##SET PATH, FILES AND SAMPLE.NAMES
path <- "/home/nuria/Escritorio/data_micro4all/" # CHANGE ME to the directory containing the fastq files
list.files(path) #list files in this folder
#>  [1] "ASV_length_nochim.rds"    "cutadapt"                 "figaro"                  
#>  [4] "filtered"                 "filtN"                    "L1-1_S102_R1_001.fastq"  
#>  [7] "L1-1_S102_R2_001.fastq"   "L1-2_S112_R1_001.fastq"   "L1-2_S112_R2_001.fastq"  
#> [10] "L1-3_S121_R1_001.fastq"   "L1-3_S121_R2_001.fastq"   "L1-4_S130_R1_001.fastq"  
#> [13] "L1-4_S130_R2_001.fastq"   "L1-5_S138_R1_001.fastq"   "L1-5_S138_R2_001.fastq"  
#> [16] "L1-6_S145_R1_001.fastq"   "L1-6_S145_R2_001.fastq"   "L1-7_S152_R1_001.fastq"  
#> [19] "L1-7_S152_R2_001.fastq"   "L1-8_S159_R1_001.fastq"   "L1-8_S159_R2_001.fastq"  
#> [22] "L2-1_S139_R1_001.fastq"   "L2-1_S139_R2_001.fastq"   "L2-2_S146_R1_001.fastq"  
#> [25] "L2-2_S146_R2_001.fastq"   "L2-3_S153_R1_001.fastq"   "L2-3_S153_R2_001.fastq"  
#> [28] "L2-4_S160_R1_001.fastq"   "L2-4_S160_R2_001.fastq"   "L2-5_S104_R1_001.fastq"  
#> [31] "L2-5_S104_R2_001.fastq"   "L2-6_S114_R1_001.fastq"   "L2-6_S114_R2_001.fastq"  
#> [34] "L2-7_S123_R1_001.fastq"   "L2-7_S123_R2_001.fastq"   "L2-8_S132_R1_001.fastq"  
#> [37] "L2-8_S132_R2_001.fastq"   "L3-1_S136_R1_001.fastq"   "L3-1_S136_R2_001.fastq"  
#> [40] "L3-2_S143_R1_001.fastq"   "L3-2_S143_R2_001.fastq"   "L3-3_S150_R1_001.fastq"  
#> [43] "L3-3_S150_R2_001.fastq"   "L3-4_S157_R1_001.fastq"   "L3-4_S157_R2_001.fastq"  
#> [46] "L3-5_S101_R1_001.fastq"   "L3-5_S101_R2_001.fastq"   "L3-6_S111_R1_001.fastq"  
#> [49] "L3-6_S111_R2_001.fastq"   "L3-7_S120_R1_001.fastq"   "L3-7_S120_R2_001.fastq"  
#> [52] "L3-8_S129_R1_001.fastq"   "L3-8_S129_R2_001.fastq"   "MOCK-1_S169_R1_001.fastq"
#> [55] "MOCK-1_S169_R2_001.fastq" "MOCK-2_S176_R1_001.fastq" "MOCK-2_S176_R2_001.fastq"
#> [58] "MOCK-3_S183_R1_001.fastq" "MOCK-3_S183_R2_001.fastq" "nochim"                  
#> [61] "num_seqs.txt"             "rdp_train_set_18_H.fa"    "tabla_otu_inicial.rds"   
#> [64] "taxa_rdp_16S.rds"

Second, we will store path names for every fastq file. We will do this for forward reads (fnfFs) and reverse reads (fnRs), respectively.

## SORT FORWARD AND REVERSE READS SEPARETELY; forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq ##
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

📝TIP: We get sample names with basename(fnFs), a function that removes all the previous path up to the last path separator (/). Then, strsplit works over sample names and splits the name in every “_“. Afterwards, sapply will loop over the produced list of splitted names and take the first part (corresponding to SAMPLENAME, if file names have format SAMPLENAME_XXX.fastq). For example, our first sample has the following name:

##See name of the first sample (fnFs)
basename(fnFs[1])
#> [1] "L1-1_S102_R1_001.fastq"

Then, strsplit returns…

##Apply strsplit
strsplit(basename(fnFs[1]), "_")
#> [[1]]
#> [1] "L1-1"      "S102"      "R1"        "001.fastq"

…and we take the first part, which corresponds to “L1-1”. If your file names are more complex and you need to remove some previous pattern, you can always use gsub over sample.names. For example, to remove the last “-1” of sample.names[1], we would use gsub(pattern="-1", replacement="", sample.names[1]). Moreover, it is important for later analysis that your sample names are not only numerical. If that’s the case, you can easily add a letter (for example, an “S”) as follows: paste0("S",sample.names).

Check sequences quality

A crucial step in any sequencing analysis pipeline is quality checking. For this purpose, we will use several approaches in order to have several perspectives on this aspect. First, we will generate a table with the number of reads in each sample, just to check if anything went wrong during sequencing process. We will use ShortRead package. As we have already loaded the library, the use of ShortRead:: would not be necessary, but it is useful to know from which packages comes each function.

##COUNT NUMBER OF READS IN EACH SAMPLE BEFORE FILTERING 
raw_reads_count <- NULL #Creates an empty object

for (i in 1:length(fnFs)){ #loop over fnFs to count number of sequences with length over readFastq

  raw_reads_count <- rbind(raw_reads_count, length(ShortRead::readFastq(fnFs[i])))}

#Format table to give it sample.names
rownames(raw_reads_count)<- sample.names
colnames(raw_reads_count)<- "Number of reads"

With raw_reads_count we can easily now the minimum and maximum number of sequences as follows:

#Get min and max number of sequences
min(raw_reads_count)
#> [1] 66426
max(raw_reads_count)
#> [1] 125873

As we can see, even the minimum number of sequences is good enough for our analysis. However, if that’s not the case, we can visually check the table with View(raw_reads_count) and write it to your device with write.table(raw_reads_count, file="raw_reads_count.txt", sep="\t").

Another approach of quality checking is to plot the distribution of sequences length with a histogram. Let’s see how it works!

##Histogram of sequences length

reads <- ShortRead::readFastq(fnFs) #store sequences in an R object

#Get lengths with unique
uniques <- unique(reads@quality@quality@ranges@width) #get length accessing to width attribute within reads

#Count number of sequences for each length
counts <- NULL #Creates a null object for storing counts
for (i in 1:length(uniques)) {
  counts<- rbind(counts,length(which(reads@quality@quality@ranges@width==uniques[i])))

}

#format histogram table
histogram <-  cbind(uniques,counts)
colnames(histogram) <- c("Seq.length", "counts")

Again, we can visually check the histogram table with View(histogram) and write it to your device with write.table(histogram, file="histogram.txt", sep="\t). For now on, we will use head to see the main results, as well as hist for plotting.

#Check histogram matrix
head(histogram[order(histogram[,1],decreasing = TRUE),]) #Most sequences fall in expected sequence length
#>      Seq.length  counts
#> [1,]        301 2154642
#> [2,]        300   71753
#> [3,]        299   16251
#> [4,]        298   61651
#> [5,]        297   10119
#> [6,]        296     377

# PLOT HISTOGRAM
hist(reads@quality@quality@ranges@width, main="Forward length distribution", xlab="Sequence length", ylab="Raw reads")

Moreover, plotQualityProfile function of dada2 is very useful for a quick visualization of sequences quality profile. We can visualize both profiles for forward (fnFs) and reverse (fnRs) reads. It is a common fact to have worse reverse reads profiles because of the sequencing process. In general, we can see that the average quality is good, as it only falls below 20 at the end of the reverse sequences. Again, we strongly suggests to check dada2 tutorial for more information. We will visualize quality plots for the first two forward and reverse sequences (indicated by [1:2]).

## VIEW AND SAVE QUALITY PLOT FOR FW AND RV ##
plotQualityProfile(fnFs[1:2])

plotQualityProfile(fnRs[1:2])

Figaro: selection of trimming parameters

In this tutorial, we will use the Figaro tool. It can be downloaded and installed following instructions from the provided link. Don’t forget that in this tutorial we are running it on Ubuntu, so some changes would be necessary for Windows distributions. This tool makes the selection of trimming parameters for dada2 more objective. Sadly, figaro has not yet been updated to lead with variations in sequence lengths. We will solve this following author’s recommendation (which you can find here). It consists in performing a pre-trimming step at about 295 bp.This pre-trimming will be used only for figaro, so we will store these sequences in a new folder called ‘figaro’. Pre-trimming will be done making use of filterAndTrim function from dada2. Let’s see how it works!

##FIGARO

##Creates a new folder called 'figaro' inside our path to store reads after trimming for fígaro
figFs <- file.path(path, "figaro", basename(fnFs))
figRs <- file.path(path, "figaro", basename(fnRs))

##TRIMMING AT 295 pb
out.figaro <- filterAndTrim(fnFs, figFs, fnRs, figRs, #inputFreads,outputFreads,inputRreads,outputRreads
                     compress=TRUE, multithread=TRUE, truncLen=c(295,295)) 

Once we have the reads stored in figaro folder and cutted at 295 bp, we will run figaro. For calling figaro from R in Ubuntu, we use the function system, which allows us to access to Ubuntu command line. We have to call python3, as figaro works with it. Then, all the appropiete arguments for figaro should be given: path to figaro program, -i path to input, -o path to output, -a length of your amplicon without primers, -f primer forward length, -r primer reverse length. As you can see, before using figaro, you will need to know your expected amplicon length without primers (you can calculate it based on E. coli 16S rRNA gene) as well as both primers lengths.

##RUN FIGARO
figaro <- system(("python3 /home/programs/figaro/figaro/figaro.py -i /home/nuria/Escritorio/data_micro4all/figaro -o /home/nuria/Escritorio/data_micro4all/figaro -a 426 -f 17 -r 21"),
                 intern=TRUE) #path to figaro program -i path to input - path to output -a length of your amplicon without primers, -f primer forward length, -r primer reverse length
head(figaro)
#> [1] "Forward read length: 295"                                                                                                   
#> [2] "Reverse read length: 295"                                                                                                   
#> [3] "{\"trimPosition\": [279, 205], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.69, \"score\": 76.68937485688116}"
#> [4] "{\"trimPosition\": [278, 206], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.68, \"score\": 76.6750629722922}" 
#> [5] "{\"trimPosition\": [277, 207], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.67, \"score\": 76.66838409281735}"
#> [6] "{\"trimPosition\": [280, 204], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.66, \"score\": 76.6617052133425}"

We will select the first result, which gives us the trimming positions and maximum expected errors that produces the greater score according to Figaro model. These are returned as [forward,reverse].

ITS considerations

ITS region has a very variable length. Consequently, no length trimming (truncLen argument) is performed in filterAndTrim function from dada2. Therefore, it would not be necessary to use Figaro.

Filter and trim sequences

Now that we have the best trimming and expected errors values according to figaro, we can apply it to our original sequences (fnFs and fnRs, remember that figFs and figRs were used only for figaro). Let’s apply figaro parameters to filterAndTrim function from dada2! You will see that we produce filtFsand filtRs objects first. These are the files for storing the filtered sequences after applying filterAndTrim over raw sequences (fnFsand fnRs). Afterwards, primers will be removed from filtFsand filtRs with cutadapt tool.

### FILTER AND TRIM SEQUENCES ACCORDING TO FIGARO ####
## Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", basename(fnFs))
filtRs <- file.path(path, "filtered", basename(fnRs))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(279,205),
                     maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE, minLen=50)


head(out)
#>                        reads.in reads.out
#> L1-1_S102_R1_001.fastq    80054     61741
#> L1-2_S112_R1_001.fastq    76776     60781
#> L1-3_S121_R1_001.fastq    84947     67563
#> L1-4_S130_R1_001.fastq    87190     68997
#> L1-5_S138_R1_001.fastq   114810     93159
#> L1-6_S145_R1_001.fastq    88519     70075

ITS considerations

As explained in the last _ ITS considerations_ box, we do not use figaro in ITS reads. However, we should use the ‘filerAndTrim’ step to remove ambiguous bases so that we can better recognize the primers afterwards. Remember not to use trimming parameters.

Cutadapt: efficient removal of primers

Now that we have run Figaro and performed quality trimming, we can follow with primer removing, an obligatory step for dada2 workflow. This will be done with cutapat tool. You can also remove your primers with the argument trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LENfrom filterAndTrimfunction (dada2). However, we consider cutadapt a more accurate tool for this purpose, as trimLeft indistinctly cut the sequences at the given primer lengths, whether or not these primers are found in the sequence. As it was said above, we are working with Ubuntu, so if you are using cutadapt on Windows, you should make some changes to the code below. Another consideration is that cutadapt should not be applied before figaro because it can modify the quality profiles in which figaro is based.

First of all, we have to store our primers sequences in two different R objets (FWD and REV). Then, we will produce all the possible orientations of these primers (further details can be found in dada2 tutorial). This is possible thanks to the function allOrients, which we create below.

#### IDENTIFY PRIMERS ####

FWD <- "CCTACGGGNBGCASCAG"  ## CHANGE ME to your forward primer sequence
REV <- "GACTACNVGGGTATCTAATCC"  ## CHANGE ME to your reverse primer sequence

## VERIFY PRESENCE AND ORENTATION OF PRIMERS ##
allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works with DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = reverse(dna),
               RevComp = reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
#>             Forward          Complement             Reverse             RevComp 
#> "CCTACGGGNBGCASCAG" "GGATGCCCNVCGTSGTC" "GACSACGBNGGGCATCC" "CTGSTGCVNCCCGTAGG"
REV.orients
#>                 Forward              Complement                 Reverse                 RevComp 
#> "GACTACNVGGGTATCTAATCC" "CTGATGNBCCCATAGATTAGG" "CCTAATCTATGGGVNCATCAG" "GGATTAGATACCCBNGTAGTC"

ITS considerations

For ITS analysis, don’t forget to change primers to the appropiate ones, in our case, we use fITS7 (FWD <- “GTGARTCATCGAATCTTTG”) and ITS4 (REV <- “TCCTCCGCTTATTGATATGC”) for ITS2 region.

Afterwards, we want to count the appearence of the primers in our sequences with the function primerHits:

## COUNT THE APPEARENCE AND ORIENTATION OF PRIMERS ##
primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = filtFs[[1]]),
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = filtRs[[1]]),
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = filtFs[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = filtRs[[1]]))
#>                  Forward Complement Reverse RevComp
#> FWD.ForwardReads   59926          0       0       0
#> FWD.ReverseReads       0          0       0       0
#> REV.ForwardReads       2          0       0       0
#> REV.ReverseReads   60270          0       0       0

But… what is this table all about? On the left, we have the presence of forward and reverse primers (FWD and REV) in forward and reverse reads (FWD.ForwardReads, FWD.ReverseReads, REV.ForwardReads and REV.ReverseReads). On the top of the table, we see the four possible orientations of the primers (Forward, Complement, Reverse and RevComp). Taking all these information into account, in our data set, we found 59926 sequences with the FWD primer in the forward reads in its forward orientation. As expected, the REV primer was found in the reverse reads in its forward orientation. Cutadapt is used in dada2 ITS tutorial, so check it if you want more information.

ITS considerations

Change filtFs and filtRs for fnFs and fnRs, respectively, since you didn’t perform the previous filtering step.

Now, let’s run cutadapt. The process is similar to what we did with figaro. First, we save the path to cutadapt program in cutadapt. Then, we use system2 to call cutadapt.

## RUN CUTADAPT
cutadapt <- "/usr/local/bin/cutadapt" #path to cutadapt 

system2(cutadapt, args = c("--version")) # Run shell commands from R

Now, it is crucial to create a variable call path.cut where we will store the sequences after being processed by cutadapt. We will give these sequences the same names as filtFsand filtRs. For an easier use of the tool, we create some of its arguments with paste0. The arguments we use are very well explained in cutadapt user guide. Specifically, we implement arguments for linked adapters. In this sense, -a and -A are preceding the adapters sequences, respectively for R1 (forward) and R2 (reverse) reads. We use argument ^ for anchoring primers, that is, primers are only removed if the anchored primers is found. Moreover, we combine it with the –discard-untrimed argument. In this way, we make sure that a sequence is not retained if only the reverse primer is found. Forward and reverse primers should be present in order to keep the sequences. At the end, we use a for loop over system2 to run cutadapt in every file.

##Create path to cutadapt sequences
path.cut <- file.path(path, "cutadapt") 

if(!dir.exists(path.cut)) dir.create(path.cut)

fnFs.cut <- file.path(path.cut, basename(filtFs))
fnRs.cut <- file.path(path.cut, basename(filtRs))


##Produce arguments for cutadapt. rc function creates the reverse complementary of the provided sequence (FWD).  
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste0("-a", " ", "^",FWD,"...", REV.RC) 

# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste0("-A"," ","^", REV, "...", FWD.RC)


# Run Cutadapt

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,"-m", 1, # -n 2 required to remove FWD and REV from reads
                             #-m 1 is required to remove empty sequences for plotting quality plots
                             "--discard-untrimmed",
                             "-j",0,#automatically detect number of cores
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             filtFs[i], filtRs[i],# input files
                             "--report=minimal")) #Report minimal reports a summary

}

ITS considerations

Change filtFs and filtRs for fnFs and fnRs, respectively, since you didn’t perform the previous filtering step.

Now, we have the filtered sequences without primer in fnFs.cut and fnRs.cut. From now on, these will be the sequences that we will use for analysis.

Let’s see if cutadapt has properly removed the primers with primerHits

#Check primer presence after cutadapt
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
#>                  Forward Complement Reverse RevComp
#> FWD.ForwardReads       1          0       0       0
#> FWD.ReverseReads       0          0       0       0
#> REV.ForwardReads       0          0       0       0
#> REV.ReverseReads       0          0       0       0

As we can see, there are still some forward primers to be found on forward reads. We should not worry about that. Cutadapt and primerHits have different ways to look for primers in the sequence. Moreover, if primers are still to be found in low quality sequences, they will probably be removed in next steps.

ITS considerations

If analysing ITS sequences, remember to apply now the filtering step. See the code below:

filtFs <- file.path(path.cut, "filtered", basename(fnFs.cut)) filtRs <- file.path(path.cut, "filtered", basename(fnRs.cut)) out <- filterAndTrim(fnFs.cut, filtFs, fnRs.cut, filtRs, maxN = 0, maxEE = c(2, 5), truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE) head(out)

With head(out) you can check if the maxEEyou set was too strict. If you are loosing too many sequences, consider to increase its values.

In next steps, we will apply the main core of DADA2 pipeline. As it is very well detailed in its tutorial, we will not make a lot of comments about it. Good luck! 🕓

Learn error rates

#### LEARN ERROR RATES ####
errF <- learnErrors(fnFs.cut, multithread=T, verbose=1) #
#> 109967023 total bases in 419736 reads from 6 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ......
#>    selfConsist step 2
#>    selfConsist step 3
#>    selfConsist step 4
#>    selfConsist step 5
#> Convergence after  5  rounds.
errR <- learnErrors(fnRs.cut, multithread=T, verbose=1) #
#> 102040114 total bases in 554558 reads from 8 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ........
#>    selfConsist step 2
#>    selfConsist step 3
#>    selfConsist step 4
#>    selfConsist step 5
#>    selfConsist step 6
#> Convergence after  6  rounds.

ITS considerations

Change fnFs.cut and fnRs.cut for filtFs and filtRs, respectively. Do it also in the next steps: dadaFs, dadaRs and mergers.

View error plots

As we can see, the plots follow the tendency adequately.

#Plot errors
plotErrors(errF, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis

plotErrors(errR, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Transformation introduced infinite values in continuous y-axis

Sample inference

#Sample inference
dadaFs <- dada(fnFs.cut, err=errF, multithread=TRUE)
#> Sample 1 - 61506 reads in 15424 unique sequences.
#> Sample 2 - 60287 reads in 12454 unique sequences.
#> Sample 3 - 67108 reads in 13471 unique sequences.
#> Sample 4 - 68541 reads in 13392 unique sequences.
#> Sample 5 - 92672 reads in 19174 unique sequences.
#> Sample 6 - 69622 reads in 14664 unique sequences.
#> Sample 7 - 66191 reads in 14858 unique sequences.
#> Sample 8 - 68631 reads in 17064 unique sequences.
#> Sample 9 - 64360 reads in 12997 unique sequences.
#> Sample 10 - 73537 reads in 16302 unique sequences.
#> Sample 11 - 64206 reads in 12105 unique sequences.
#> Sample 12 - 74353 reads in 16945 unique sequences.
#> Sample 13 - 62585 reads in 14748 unique sequences.
#> Sample 14 - 72255 reads in 16401 unique sequences.
#> Sample 15 - 69370 reads in 15705 unique sequences.
#> Sample 16 - 56068 reads in 13202 unique sequences.
#> Sample 17 - 73197 reads in 14786 unique sequences.
#> Sample 18 - 58236 reads in 12100 unique sequences.
#> Sample 19 - 56801 reads in 13257 unique sequences.
#> Sample 20 - 56072 reads in 12635 unique sequences.
#> Sample 21 - 57847 reads in 12828 unique sequences.
#> Sample 22 - 65716 reads in 11388 unique sequences.
#> Sample 23 - 69528 reads in 13095 unique sequences.
#> Sample 24 - 52479 reads in 9714 unique sequences.
#> Sample 25 - 89584 reads in 7824 unique sequences.
#> Sample 26 - 93427 reads in 8209 unique sequences.
#> Sample 27 - 106500 reads in 9724 unique sequences.
dadaRs <- dada(fnRs.cut, err=errR, multithread=TRUE)
#> Sample 1 - 61506 reads in 15090 unique sequences.
#> Sample 2 - 60287 reads in 8972 unique sequences.
#> Sample 3 - 67108 reads in 10240 unique sequences.
#> Sample 4 - 68541 reads in 11098 unique sequences.
#> Sample 5 - 92672 reads in 15470 unique sequences.
#> Sample 6 - 69622 reads in 11810 unique sequences.
#> Sample 7 - 66191 reads in 10626 unique sequences.
#> Sample 8 - 68631 reads in 13787 unique sequences.
#> Sample 9 - 64360 reads in 9862 unique sequences.
#> Sample 10 - 73537 reads in 11339 unique sequences.
#> Sample 11 - 64206 reads in 8219 unique sequences.
#> Sample 12 - 74353 reads in 12900 unique sequences.
#> Sample 13 - 62585 reads in 11369 unique sequences.
#> Sample 14 - 72255 reads in 12843 unique sequences.
#> Sample 15 - 69370 reads in 12557 unique sequences.
#> Sample 16 - 56068 reads in 11103 unique sequences.
#> Sample 17 - 73197 reads in 12138 unique sequences.
#> Sample 18 - 58236 reads in 9899 unique sequences.
#> Sample 19 - 56801 reads in 9540 unique sequences.
#> Sample 20 - 56072 reads in 10430 unique sequences.
#> Sample 21 - 57847 reads in 9716 unique sequences.
#> Sample 22 - 65716 reads in 9249 unique sequences.
#> Sample 23 - 69528 reads in 11676 unique sequences.
#> Sample 24 - 52479 reads in 8641 unique sequences.
#> Sample 25 - 89584 reads in 6378 unique sequences.
#> Sample 26 - 93427 reads in 6198 unique sequences.
#> Sample 27 - 106500 reads in 7362 unique sequences.
dadaFs[[1]]
#> dada-class: object describing DADA2 denoising results
#> 758 sequence variants were inferred from 15424 input unique sequences.
#> Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

As we can see, thanks to the DADA2 algorithm, 758 ASVs were inferred, coming from 15424 unique sequences (remember that this is only for the frist sample, as we wrote [[1]]. Next, we will give dadaFsand dadaRs the sample.names.

#Set sample names
names(dadaFs) <- sample.names
names(dadaRs) <- sample.names

Merge paired-end reads

#Merge sequences
mergers <- mergePairs(dadaFs, fnFs.cut, dadaRs, fnRs.cut, verbose=TRUE)
#> 58056 paired-reads (in 841 unique pairings) successfully merged out of 59510 (in 1465 pairings) input.
#> 57884 paired-reads (in 527 unique pairings) successfully merged out of 58969 (in 973 pairings) input.
#> 65180 paired-reads (in 492 unique pairings) successfully merged out of 66042 (in 881 pairings) input.
#> 66703 paired-reads (in 486 unique pairings) successfully merged out of 67552 (in 865 pairings) input.
#> 89414 paired-reads (in 932 unique pairings) successfully merged out of 90777 (in 1561 pairings) input.
#> 66897 paired-reads (in 665 unique pairings) successfully merged out of 68102 (in 1223 pairings) input.
#> 63906 paired-reads (in 609 unique pairings) successfully merged out of 64927 (in 1106 pairings) input.
#> 65300 paired-reads (in 860 unique pairings) successfully merged out of 66704 (in 1524 pairings) input.
#> 62656 paired-reads (in 606 unique pairings) successfully merged out of 63300 (in 889 pairings) input.
#> 71291 paired-reads (in 635 unique pairings) successfully merged out of 72263 (in 1082 pairings) input.
#> 62506 paired-reads (in 443 unique pairings) successfully merged out of 63204 (in 757 pairings) input.
#> 71825 paired-reads (in 733 unique pairings) successfully merged out of 72805 (in 1226 pairings) input.
#> 59867 paired-reads (in 626 unique pairings) successfully merged out of 61145 (in 1295 pairings) input.
#> 69397 paired-reads (in 807 unique pairings) successfully merged out of 70680 (in 1361 pairings) input.
#> 66800 paired-reads (in 773 unique pairings) successfully merged out of 67865 (in 1198 pairings) input.
#> 53498 paired-reads (in 691 unique pairings) successfully merged out of 54585 (in 1149 pairings) input.
#> 70984 paired-reads (in 668 unique pairings) successfully merged out of 71999 (in 1094 pairings) input.
#> 56672 paired-reads (in 449 unique pairings) successfully merged out of 57345 (in 744 pairings) input.
#> 54758 paired-reads (in 705 unique pairings) successfully merged out of 55509 (in 1006 pairings) input.
#> 54202 paired-reads (in 648 unique pairings) successfully merged out of 54911 (in 986 pairings) input.
#> 55901 paired-reads (in 585 unique pairings) successfully merged out of 56778 (in 1000 pairings) input.
#> 64397 paired-reads (in 475 unique pairings) successfully merged out of 65119 (in 738 pairings) input.
#> 67867 paired-reads (in 532 unique pairings) successfully merged out of 68681 (in 897 pairings) input.
#> 51487 paired-reads (in 353 unique pairings) successfully merged out of 51953 (in 540 pairings) input.
#> 89364 paired-reads (in 74 unique pairings) successfully merged out of 89388 (in 85 pairings) input.
#> 93212 paired-reads (in 82 unique pairings) successfully merged out of 93267 (in 96 pairings) input.
#> 106318 paired-reads (in 139 unique pairings) successfully merged out of 106393 (in 160 pairings) input.

# Inspect the merger data.frame from the first sample
head(mergers[[1]])
#>                                                                                                                                                                                                                                                                                                                                                                                                                    sequence
#> 1   TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCGCGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 2 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGCTTTCGGGTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 3   TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 4   TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGGGTGACTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 5   TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCTAACGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCCGTCGTGAAAGCCCACGGCTTAACTGTGGGTCTGCGGTGGATACGGGCAGACTAGAGGCAGGTAGGGGAGCATGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTGCTCTGGGCCTGTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 6      TGGGGAATCTTGGACAATGGGCGAAAGCCCGATCCAGCAATATCGCGTGAGTGAAGAAGGGCAATGCCGCTTGTAAAGCTCTTTCATCGAGTGCGCGATCATGACAAGACTCGAGGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGGAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACA
#>   abundance forward reverse nmatch nmismatch nindel prefer accept
#> 1      6997       1       1     39         0      0      2   TRUE
#> 2      5696       2       2     37         0      0      2   TRUE
#> 3      3738       3       1     39         0      0      2   TRUE
#> 4      2282       4       2     39         0      0      2   TRUE
#> 5      1950       5       5     39         0      0      2   TRUE
#> 6      1931       6       3     42         0      0      2   TRUE

Construct sequence table

#Construct sequence table
seqtab <- makeSequenceTable(mergers)
dim(seqtab) 
#> [1]   27 7080

Our ASV table (seqtab) contains 7080 ASVs from 27 samples (24 samples + 3 MOCK).

Remove chimeras

#Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
#> Identified 2163 bimeras out of 7080 input sequences.
dim(seqtab.nochim)
#> [1]   27 4917

After chimera removing, we have reduced the number of ASV from 7080 to 4917. It may seem a lot, but they do not account for a great number of sequences. Don’t worry, we will check that in a few steps later!

Inspect ASV length and abundance

After all this process, we have an ASV table without chimeras (seqtab.nochim). Before further analysis, it is interesting to check the distribution of ASV length as well as the number of ASV for each length. This will give us some valuable information as a check point of the process. Moreover, we will remove those lengths that are not expected from our sequencing process and that represent a small number of ASV and sequences, as they could probably be a result of non-specific priming.

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim)))  #Number of ASV of each length
#> 
#>  244  247  253  255  259  262  264  269  273  276  299  304  311  316  321  327  338  339  356 
#>    2    2    1    1    1    4    1    1    1    1    1    1    1    1    1    1    1    1    1 
#>  359  361  364  369  370  373  376  380  382  384  385  386  390  396  397  400  401  402  403 
#>    1    1    2    1    1    1    1    1    2    2    6    8    1    5    2    4   44  984  294 
#>  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422 
#>  202  156   20  387   82   44   24   20    3   11    7    5   50   16   18   32   27  216  552 
#>  423  424  425  426  427  428  429  430  431  432 
#>   40   29   57  265 1026  218   14    3    2    6

This table is really simple, you can see the ASV length in the first line and below the number of ASV. As it was expected, most ASV are found in a length of 427 bp.

However, in this way we don’t know how many sequences these ASVs account for in each length. Let’s calculate it and make a plot! We will store the number of sequences in reads.per.seqlen

#Get number of sequences per length
reads.per.seqlen <- tapply(colSums(seqtab.nochim), factor(nchar(getSequences(seqtab.nochim))), sum) #number of sequences for each length
reads.per.seqlen
#>    244    247    253    255    259    262    264    269    273    276    299    304    311 
#>      4    104      2      4      2    175      6      3      2      2      2      4      2 
#>    316    321    327    338    339    356    359    361    364    369    370    373    376 
#>      2      2      4      2      2      2      2      4     50      4      2      6      4 
#>    380    382    384    385    386    390    396    397    400    401    402    403    404 
#>      4  36451      8     40   1082      2     15     13    661    504 130086  99017 277172 
#>    405    406    407    408    409    410    411    412    413    414    415    416    417 
#>   3998    995 592438 116044  58285   1601   1328   4790    179   2455     93   2878    133 
#>    418    419    420    421    422    423    424    425    426    427    428    429    430 
#>    612   5778   1023  12092  29632    268    448   4098  12680 370689  13647    119      8 
#>    431    432 
#>      4    213

Now, we will produce a table ready for plotting

## Plot length distribution
table_reads_seqlen <- data.frame(length=as.numeric(names(reads.per.seqlen)), count=reads.per.seqlen) #Create a table

ggplot(data=table_reads_seqlen, aes(x=length, y=count)) + geom_col() #plot it!

As we can see with the table and the plot, we will choose the interval ranging from 402 to 428 pb.

#Filter ASV length
seqtab.nochim <- seqtab.nochim[,nchar(colnames(seqtab.nochim)) %in% seq(402,428)]

ITS considerations

The above analysis should be avoided because of the variable length of ITS2 region.

Check number of reads

At this point, it is important to check how many sequences we have lost in each step. Apart from filtering, not other step should have led to a substantial decreased in sequence number. First, and as it is shown in dada2 tutorial, we create getN function and then apply it to produce a track table.

#Check number of sequences at each step
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names


#Let's apply a mutate to get percentage, making it easier to analyze.
track <- track %>%as.data.frame() %>% mutate(Perc.filtered=filtered*100/input,
                                             Perc.denoisedF= denoisedF*100/filtered,
                                             Perc.denoisedR= denoisedR*100/filtered,
                                             Perc.merged= merged*100/filtered,
                                             Perc.nonchim= nonchim*100/merged,
                                             Perc.retained=nonchim*100/input
                                             )

head(track)
#>       input filtered denoisedF denoisedR merged nonchim Perc.filtered Perc.denoisedF
#> L1-1  80054    61741     60003     60662  58056   55455      77.12419       97.18501
#> L1-2  76776    60781     59308     59694  57884   56153      79.16667       97.57655
#> L1-3  84947    67563     66299     66670  65180   63021      79.53548       98.12915
#> L1-4  87190    68997     67828     68092  66703   63831      79.13408       98.30572
#> L1-5 114810    93159     91321     91801  89414   84491      81.14189       98.02703
#> L1-6  88519    70075     68479     68988  66897   64905      79.16380       97.72244
#>      Perc.denoisedR Perc.merged Perc.nonchim Perc.retained
#> L1-1       98.25238    94.03152     95.51984      69.27199
#> L1-2       98.21161    95.23371     97.00954      73.13874
#> L1-3       98.67827    96.47292     96.68763      74.18861
#> L1-4       98.68835    96.67522     95.69435      73.20908
#> L1-5       98.54228    95.97999     94.49415      73.59202
#> L1-6       98.44880    95.46486     97.02229      73.32324

Everything is as expected: the only step were we lose a greater number of sequences is the filtering step. If you want to visualize the full table and save it, remember to use View(track) and write.table(track, file="dada2_track.txt", sep="\t"). Let’s keep on with classification!

Taxonomy classification with RDP

As it was mentioned at the beginning of the tutorial, we will use a modfied RDP database. We usually work with root samples, so the detection of mithocondria sequences from host cells was crucial for us. Because of that, we have implemented mithocondrial sequences in the database. Moreover, RDP database has the drawbak that some taxa have not all classification levels. This is the case of some Acidobacteria. We found phylum acidobacteria, class Acidobacteria_gp1 and genus gp1. However, no family or order was found for this record, giving an unclassified result at these levels. Classification is easy to apply with assignTaxonomy, you only have to change the path to your RDP database file.

#Classification with RDP
taxa_rdp <- assignTaxonomy(seqtab.nochim, "/home/nuria/Escritorio/data_micro4all/rdp_train_set_18_H.fa", multithread=TRUE)

ITS considerations

For ITS classification, you can download the UNITE database in DADA2 taxonomic reference data website

Get ASV table

Now that we have a classification (taxa_rdp) table and an ASV table (seqtab.nochim), we will produce a table combining classification and ASV abundance. First, lets assign seqtab.nochim to a new variable called ASV. Then, we will substitute ‘NA’ in taxonomy_rdp with unclassified.

#Combine taxonomy and ASV abundance
ASV <- seqtab.nochim
ASVt <- t(ASV)

## SUBSTITUTE 'NA' WITH 'UNCLASSIFIED'and remove species column
taxa_rdp_na <- apply(taxa_rdp,2, tidyr::replace_na, "unclassified")[,-7]

The next lines are used to create ASV names. We do it in a way that if there are 1000 ASV, they will be named from ASV0001 to ASV1000. The code below takes automatically into consideration the number of 0 to add according to the total number of ASV.

##Create names for each ASV: if there are 1000 ASVs, call them from ASV0001 to ASV1000
number.digit<- nchar(as.integer(nrow(ASVt)))
names <- paste0("ASV%0", number.digit, "d") #As many 0 as digits
ASV_names<- sprintf(names, 1:nrow(ASVt))

At last, we have everything to create our ASV table. We are going to bind its different elements and format it a little. Then, we save it as .txt. This allows us to manipulate the data manually if needed (you can open it with tools as Excel), as well as saving a raw data before further steps are performed.

## CREATE AND SAVE ASV TABLE
ASV_table_classified<- cbind(as.data.frame(taxa_rdp_na,stringsAsFactors = FALSE),as.data.frame(ASV_names, stringsAsFactors = FALSE),as.data.frame(ASVt,stringsAsFactors = FALSE))

##Make rownames a new column, in order to keep sequences during the filtering process
ASV_seqs <- rownames(ASV_table_classified)
rownames(ASV_table_classified) <- NULL
ASV_table_classified <- cbind(ASV_seqs, ASV_table_classified)

write.table(ASV_table_classified, file="ASV_table_classified.txt", sep="\t")

If you already have an ASV table, you can load it here with ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE) and follow the tutorial from now on.

MOCK community

In this step we want to filter out those ASV that are misassigned to another samples. This can be done with the analysis of MOCK community samples that were introduced in the same sequencing run. In this sense, we always introduce in our sequencing three samples of a commercial MOCK community (ZymoBIOMICS Microbial Community Standard II (Log Distribution), ZYMORESEARCH, CA, USA). The logic of this analysis is as follows: because we already know its composition, we can check if there is any ASV identified by the sequencing process that is not a real member of the MOCK community. Once identified, we calculate the percentage of sequences it represents in MOCK samples. We make the assumption that ASVs that represent less than this percentage in the sequencing run should be discarded from our data. In order to automate this analysis, micro4all implements a function called MockCommunity. If you want to know more about this function, you can always read the help information with ?MockCommunity. Basically, you have to give it an ASV table where the samples are columns. MOCK samples should include the word ‘MOCK’ in capital letters. Moreover, the second argument corresponds to a MOCK composition file which includes the members of the community at genus level in decreasing order of abundance. You can see the structure of this file as it is loaded with micro4all package (use View(micro4all::mock_composition)). If you want to create a different MOCK composition object, you can easily do it with mock_comp <- setNames(as.data.frame(c("genera1", "genera2","genera3")), c("MockGenera")). Lastly, you should change the argument ASV_column to the name of the column where your ASV names are stored.

MockCommunity function is user-interactive, it will ask you wheter you want to select the first misassigned ASV detected to calculate the percentage or not. If answer fits ‘no’, then it will look for the next one. One the answer fits ‘yes’, it will remove the MOCK community samples, apply the filtering and return a cleaned ASV table sorted in decreasing ASV abundance.

Let’s see how it works:

#MOCK community cleaning.
ASV_filtered_MOCK<-MockCommunity(ASV_table_classified, mock_composition, ASV_column = "ASV_names") #mock_composition is a data frame included as data in this package. See documentation for details.
#> ASV0811 does not belong to the MOCK community. It representes a  0.031757  perc. of the sequences 
#>  and it classifies as Limosilactobacillus 
#>  Do you want to use this ASV to calculate the percentage? [answer yes or no]ASV0002 does not belong to the MOCK community. It representes a  0.003881  perc. of the sequences 
#>  and it classifies as Streptomyces 
#>  Do you want to use this ASV to calculate the percentage? [answer yes or no]You made a decision! ASV0002 which classifies as Streptomyces 
#>  and represents a 0.003881 perc. of the sequences, was used to calculate the percentage

As explained above, MockCommunity makes a question to user before making the decision. Why? Sometimes, like in the example above, an ASV could be classified with a different name, still being the same genera. In this example, the first ASV which is not detected in the MOCK community composition table, is Limosilactobacillus. However, we can see that it represents a high percentage of sequences. This is probably a bacillus from MOCK samples. Because of that, we decide to choose the second option.

ITS considerations

If analysing ITS data, MOCK samples with a good number of fungi representatives should be included. If your bacteria and fungi are sequenced in the same run, the calculate percentage for bacteria can be applied to fungi dataset.

When a MOCK community has not been analyzed, it is recommended to filter out those ASV which represent less than a 0.005% of the sequences, as it was studied by Bokulich et al. (2013)3. We can do it easily in R as follow:

###CLEAN according to Bokulich####
# Get number of sequences of each ASV 
ASV_sums <- rowSums(ASV_table_classified[,9:ncol(ASV_table_classified)])
# Get total number of sequences
sum.total<-sum(ASV_sums)
# Apply percentage to sequence number
nseq_cutoff<-(0.005/100)*sum.total
# Filter table.
ASV_filtered_bokulich<- ASV_table_classified[which(ASV_sums>nseq_cutoff),]
# Sort table in ascending order of ASV names
ASV_table_bokulich<-ASV_filtered_bokulich[order(ASV_filtered_bokulich[["ASV_names"]]),]

Once the percentage is applied, we can remove sequences coming from chloroplast and mitochondria at various levels

#### CLEAN CHLOROPLAST SEQUENCES ####
ASV_final<-ASV_filtered_MOCK[(which(ASV_filtered_MOCK$Genus!="Streptophyta"
                                            & ASV_filtered_MOCK$Genus!="Chlorophyta"
                                            &ASV_filtered_MOCK$Genus!="Bacillariophyta"
                                            &ASV_filtered_MOCK$Family!="Streptophyta"
                                            & ASV_filtered_MOCK$Family!="Chlorophyta"
                                            &ASV_filtered_MOCK$Family!="Bacillariophyta"
                                            & ASV_filtered_MOCK$Family!="Mitochondria"
                                            & ASV_filtered_MOCK$Class!="Chloroplast"
                                            & ASV_filtered_MOCK$Order!="Chloroplast"
                                            & ASV_filtered_MOCK$Kingdom!="Eukaryota"
                                            & ASV_filtered_MOCK$Kingdom!="unclassified")),]

ITS considerations

For ITS analysis, we should also remove those ASVs that were not classified as _k__Fungi_ at kingdom level. We could do it as follows: ASV_final<-ASV_filtered_MOCK[which(ASV_filtered_MOCK$Kingdom!="k__Fungi"),]

After this, the presence of Cyanobacteria/Chloroplast at phylum level should be checked. If we only find Cyanobacteria/Chloroplast at phylum level which are not further classified, we remove it as they can be as well coming from host DNA. To check this, we can use these lines (if working on RStudio, add a View(...) to visualize it easier):

#Check Cyanobacteria
ASV_final[which(ASV_final$Phylum=="Cyanobacteria/Chloroplast"),]  
#>  [1] ASV_seqs  Kingdom   Phylum    Class     Order     Family    Genus     ASV_names L1-1     
#> [10] L1-2      L1-3      L1-4      L1-5      L1-6      L1-7      L1-8      L2-1      L2-2     
#> [19] L2-3      L2-4      L2-5      L2-6      L2-7      L2-8      L3-1      L3-2      L3-3     
#> [28] L3-4      L3-5      L3-6      L3-7      L3-8     
#> <0 rows> (or 0-length row.names)

In this particular case, we find one ASV classified as Cyanobacteria/Chloroplast at phylum level but not classified at deeper levels. We can remove them as follows:

#Remove cyanobacteria at phylum level
ASV_final<-ASV_final[(which(ASV_final$Phylum!="Cyanobacteria/Chloroplast"
)),]

We will write this table on our local machine

##Save table
write.table(ASV_final, file="ASV_final.txt", sep="\t")

Furthermore, if working with easily host contaminated samples (as it could be endosphere samples or seed samples), it could happen that unclassified sequences at phylum level are chimeric products of microorganisms and host DNA. We will not take this into account in this tutorial, but above we detail the lines to check it. It consists on getting the ASV sequences from unclassified phyla. We saved it to the local machine in case you want to perform an BLASTn manually. However, it can also be performed from R thanks to system, running a local blast. Let’s install some packages beforehand:

#Install seqinr for fasta writing
install.packages("seqinr", repos="http://R-Forge.R-project.org")
#Install annotate
BiocManager::install("annotate")

First, we will store the unclassified phyla in a variable, then the sequences and write them as a FASTA file to your local machine. Furthermore, we will make an BLASTn with blastSequence (in this function, change the [[1]] to analyze another sequence).

##Analyze unclassified phyla
#Load libraries
library(seqinr); packageVersion("seqinr"); library(annotate); packageVersion("annotate")
#> [1] '4.2.23'
#> [1] '1.74.0'

#Get sequences of unclassified phyla
unclassified_phyla <-ASV_final[which(ASV_final$Phylum=="unclassified"),]
seq_unclassified_phyla<- as.list(unclassified_phyla$ASV_seqs)
#write fasta
write.fasta(seq_unclassified_phyla, names=unclassified_phyla$ASV_names, file.out="unclassified_phyla.fas", open = "w", nbchar =1000 , as.string = FALSE)

With this file you can perform a BLASTn online. Nonetheless, you can also perform a local blast as follows. Remember that we are working in Ubuntu, so first you should download BLASTn in your machine, as well as the corresponding databases (in this case, the ntdatabase). The arguments are:

  • query. The path to the FASTA file to analyze.
  • db. Path to database on your local machine.
  • out. Name of the output file
  • outfmt. Output format. We use format 6 (tabular without headers). This format does not give you the species name, that’s why we add “stitle”.
  • show_gis. It shows the NCBI GIs in deflines.
  • max_target_seqs. Number of hits to return for each sequences.
  • num_threads. Number of threads to use.
#Local BLASTn
system(("/home/programs/ncbi-blast-2.11.0+/bin/blastn -query unclassified_phyla.fas -db /mnt/datos/databases/nt -out unclassified_phyla_hits.txt -outfmt '6 std stitle' -show_gis -max_target_seqs 20 -parse_deflines -num_threads 10"),intern = TRUE)

In the code above, remember to change "/home/..." to the path where BLASTn program is stored in your machine.

Diversity analysis

We have finished the first part of the tutorial. Now, we should have a final ASV table, already cleaned from misassigned ASV and host ASV. We are ready to start with diversity analysis! If you already have an ASV table, you can load it here with ASV_table_classified <- read.table(file="yourfilepath",sep="yourseparationmethod", header=TRUE) and follow the tutorial from now on.

For this purpose, we need a phyloseq object. phyloseq package provides multiple tools for analyzing microbial communities data. The first challenge that phyloseq solves is to wrap different information into a single object, easy to use with multiple packages. First of all, let’s install phyloseq and load the library:

##Install phyloseq
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("phyloseq")
##Load library
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'

Specifically for this tutorial, we will need an ASV table with number of reads per sample, their taxonomy classification, metadata table, a phylogenetic tree and DNA sequences. As we can see, we should produce a phylogenetic tree. This will allows us to use UniFrac and WeightedUnifrac distances later on in β diversity analysis. For this purpose, we will use tools as MAFFT and FastTreeMP. Again, we should remind you that we are using these tools under a Ubuntu distribution and they should be appropriately installed beforehand. Moreover, if you have any problem with FastTreeMP, you can always switch to FastTree.

We will start with getting a FASTA file with all ASV sequences. As mentioned in the last section, we will use seqinr (if you haven’t loaded it yet, do it now with library(seqinr)). With this file, we will perform a multiple sequence alignment with MAFFT (storing first the path to the program and the using system2 for calling it into the shell). We will use default alignment options with argument --auto. The output file will be called alignment. This file will be used by FastTreeMP to produce a phylogenetic tree. The steps to execute this program are the same as with MAFFT.

####GET PHYLOGENETIC TREE####

## GET THE FASTA FILE ##
ASVseqs_fasta<- as.list(ASV_final$ASV_seqs)
write.fasta(ASVseqs_fasta, names=ASV_final$ASV_names, file.out="ASV_tree.fas", open = "w", nbchar =1000 , as.string = FALSE)

##Get the alignment ##
mafft <- "/usr/bin/mafft"     #path to program

system2(mafft, args="--version")
system2(mafft, args=c("--auto", "ASV_tree.fas>", "alignment"))

## Get the tree ##
FastTreeMP <- "/home/programs/FastTreeMP/FastTreeMP"

system2(FastTreeMP, args="--version" )
system2(FastTreeMP, args = c("-gamma", "-nt", "-gtr", "-spr",4 ,"-mlacc", 2,  "-slownni", "<alignment>", "phy_tree"))# Run FastTreeMP

ITS considerations

ITS region is not appropriate for calculating phylogenetic trees. As a result, Unifrac nor Weighted Unifrac distances can be applied.

The only thing left to produce our phyloseq object is to have the metadata table, which includes further information about your samples. This information is used later on in order to group samples together (for example, replicates from the same location). In this tutorial, we will use the metadata table that is loaded with micro4all. It simply is a data frame containing sample names and, in another column, its location (location1, location2 and location3). We will store it in another variable and make some modifications to remove “-” and replace it with “_“, both in sample columns and in row names.

##Metadata from micro4all
metadata.micro <-micro4all::metadata
metadata.micro$samples <- str_replace_all(metadata.micro$samples,"-", "_")
rownames(metadata.micro) <- str_replace_all(rownames(metadata.micro),"-", "_")

If you are using your own metadata table, you can create it outside R with any text editor and save it as a tabular separated file. Then, you could load it with mt <- read.table("yourmtfile.txt", sep="\t", header=TRUE) and row.names(mt)<- mt$samples. Remember that it must include a column with sample names (exactly the same sample names and in the same order as in the ASV table). In the next lines, remember to replace metadata.mciro with your own mt object.

Create phyloseq

We will put all the information we have (classification, abundance, metadata, ASV sequences and phylogenetic tree) in one object thanks to phyloseq. First, each variable will be transformed to phyloseq and then we will merge them.

####CREATE PHYLOSEQ OBJECT FROM TABLES

tax <- ASV_final[,2:8] #Tax
OTU <-  ASV_final[,9:ncol(ASV_final)] #ASV
colnames(OTU) <- str_replace_all(colnames(OTU),"-", "_")

dna<-Biostrings::DNAStringSet(ASV_final$ASV_seqs) #Sequences (BioString is imported with phyloseq)
names(dna)<- ASV_final$ASV_names

##ADD ASV NAMES
row.names(tax)<-ASV_final$ASV_names
row.names(OTU)<-ASV_final$ASV_names

#Check rownames are equal
identical(rownames(OTU), rownames(tax))
#> [1] TRUE

##Introduce phylogenetic tree
phy_tree <-  phyloseq::read_tree("phy_tree")
unrooted_tree<- phy_tree
ape::is.rooted(unrooted_tree)
#> [1] FALSE

##Produce root (we need a root for distance calculation)
tree_root<-ape::root(unrooted_tree, 1, resolve.root = T)
tree_root
#> 
#> Phylogenetic tree with 1035 tips and 1034 internal nodes.
#> 
#> Tip labels:
#>   ASV0449, ASV0153, ASV0306, ASV0086, ASV0347, ASV0343, ...
#> Node labels:
#>   Root, , 0.867, 0.000, 0.789, 0.900, ...
#> 
#> Rooted; includes branch lengths.
ape::is.rooted(tree_root)
#> [1] TRUE

##CONVERT TO PHYLOSEQ FORMART
phy_OTUtable<-otu_table(OTU, taxa_are_rows = T)
phy_taxonomy<-tax_table(as.matrix(tax))
phy_metadata<-sample_data(metadata.micro) #Change here metadata.micro with mt (your own metadata object)

#Put everything into a phyloseq object
loc_phyloseq<-phyloseq(phy_OTUtable,phy_taxonomy,phy_metadata,dna,tree_root)#Remove Tree_root when working with ITS

Now we will load the libraries needed to carry out the diversity analysis. Warning! If some of these packages are missing, install them with install.packages() function.

##Load packages
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.40.0'
library(Biostrings); packageVersion("Biostrings")
#> [1] '2.64.1'
library(GUniFrac); packageVersion("GUniFrac")
#> [1] '1.7'
library(phangorn); packageVersion("phangorn")
#> [1] '2.11.1'
library(vegan); packageVersion("vegan")
#> [1] '2.6.4'
library(pheatmap); packageVersion("pheatmap")
#> [1] '1.0.12'
library(colorspace); packageVersion("colorspace")
#> [1] '2.1.0'

\(\alpha\) diversity

In this section, we will study the diversity of our community in terms of richness, diversity and evenness. For this purpose, several indices will be used, like Shannon index, Inverse of Simpson and Pilou’s index.

Data preparation

First, we will produce a table with \(\alpha\) diversity indices. To achieve this,we will rarefy the data to the number of sequences of the smallest sample. This is done in order to avoid a bias in estimation of diversity only because of variability in sequencing depth among samples.

Rarefaction curves should be produced first. This makes it easy to visualize if rarefaction process encountered a great loss of information or not. This plot represents number of ASV (richness) against number of sequences. Moreover, a vertical line will show the number of sequences where normalization occurred. If this line is found in the asymptote of every rarefaction curve, that means we are safe: although we would have more sequences for every sample, we would not have a significantly greater number of ASV.

#### RARECURVE ####
ASV <- as.data.frame(t(otu_table(loc_phyloseq)))
sample_names <- rownames(ASV)

#Generate rarefaction curves
rarecurve <- rarecurve(ASV, step = 100, label = F,)

Function rarecurve produces a simple plot. However, we can produce a nicer plot with ggplot2. This package was already loaded in the DADA2 section, if not, load it with library(ggplot2).

Our aim is to produce lines colored by location (or any other condition in your metadata table). But to achieve this we need to prepare our data. First of all, we will separate each rarefaction curve from rarecurve into a data frame and then add a column with sample names. For this step, we will use a function from purrr package, but it doesn’t need installation as it is included in the tidyverse collection, which was installed in the first section of the tutorial.

#For each rarefaction curve, transform rarecurve output to a dataframe.
rarecurve_table <- lapply(rarecurve, function(x){
  b <- as.data.frame(x)
  b <- data.frame(ASV = b[,1], raw.read = rownames(b))
  b$raw.read <- as.numeric(gsub("N", "",  b$raw.read))
  return(b)
})

#Produce a column with sample names and put everything in one data frame
names(rarecurve_table) <- sample_names
rarecurve_table <- purrr::map_dfr(rarecurve_table, function(x){ #for each rarecurve element, we produce a dataframe and bind them all together.
  z <- data.frame(x)
  return(z)
}, .id = "Sample")

Now, we will add to rarecurve_table a new column. In this column, we will have the grouping variable of interest (for example, location1) repeated as many times as necessary. Remember that, in rarecurve_table, sample names are repeated, as it have the information of every rarefaction curve. If you want to adapt this to your data, you should change metadata.micro with your metadata variable name (for example, mt) and the variable of interest (´$location´ to the user variable of interest).

#To color lines according to group, let's create a new column
#Coloring
color <- NULL
for (i in (1:length(rarecurve_table$Sample))) {
  color <- rbind(color,metadata.micro$location[which(metadata.micro$samples==rarecurve_table$Sample[i])])##Change metadata.micro to user mt and "Location" to variable of interest
}

#Bind this column
rarecurve_table_color <- cbind(rarecurve_table, color)
colnames(rarecurve_table_color) <- c(colnames(rarecurve_table), "Location")#Change Location to variable of interest

With all data prepared, it is moment to use ggplot for graphical representation. When using your own data, remember to change colour=Location to your variable of interest. You will find where to change the code next to each line, followed by a #.

## RARECURVE WITH GGPLOT ##
rareggplot<-ggplot(rarecurve_table_color, aes(x=raw.read, y=ASV, colour=Location, group=Sample)) + ### Change location for your variable of interest and IMPORTANT TO GROUP BY SAMPLE
  theme_bw()+
  geom_point(aes(colour=Location), size=0.01)+ #Change location to your variable of interest
  geom_line(aes(colour=Location),size=0.5)+ ##Change location to your variable of interest, Change size for line thickness
  geom_vline(aes(xintercept = min(sample_sums(loc_phyloseq))), lty=1, colour="black")+
  scale_fill_manual(values = c("location1"= "#33CC00", "location2"= "#ffd624", "location3"="#80CC88"))+ #Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
  scale_color_manual(values = c("location1"= "#33CC00","location2"= "#ffd624", "location3"="blue"),#Change location1, location2 and location3 by the values of your variable of interest (for example, control and inoculated)
                     name="Location", #change Location to variable of interest
                     breaks=c("location1", "location2","location3"),#change to values of your variable of interest
                     labels=c("Location 1","Location 2","Location 3"))+#change to values of your variable of interest
  labs(title="Rarefaction curves", x="Number of sequences", y="Number of ASV")+
  guides(alpha="none")+
  theme(legend.key=element_blank(),
        legend.title.align = 0.85,
        legend.title = element_text(face="bold",size=14),
        axis.text = element_text(size=14),
        axis.title = element_text(size = 16),
        plot.title = element_text(hjust=0.5, face="bold", size=16),
        legend.text = element_text(size = 12))

rareggplot

As we can see, the rarefaction process took place in the asymptote of all samples.

For saving plots in your machine, an easy and versatile option is to use ggsave form ggplot. For publication, you can save it in .TIFF, but there are also others interesting formats. For example, .svg allows you to open the file in PowerPoint and easily modify it. If you have more knowledge of graphical design programs as Adobe Illustrator, you can save it in .eps

#tiff
ggsave(filename = "rarefaction_curve.tiff", plot = rareggplot,device = tiff(),width = 18, height = 16, units = "cm", dpi = 800)

#svg
library(svglite)
ggsave(filename = "rarefaction_curve.svg", plot = rareggplot,device = svg())

#eps
postscript("curve_wito_RE.1.10.eps")
rareggplot
dev.off()  # Close the device

Let’s continue! Now, we will compute α diversity indices. As everything worked well with rarefaction, we can now create a rarefied object from our phyloseq (loc_phyloseq).

#Rarefy phyloseq
set.seed(10403)
rarefaction <- rarefy_even_depth(loc_phyloseq, sample.size = min(sample_sums(loc_phyloseq)), rngseed = FALSE)

With this rarefaction object, we will compute diversity indices making use of estimate_richnessfunction. Pilou’s index is calculated manually from Shannon and Observed richness.

#Estimate alpha indices and save it
alpha_diversity_rarefied <- estimate_richness(rarefaction,measures=c("InvSimpson", "Shannon", "Observed"))

## CALCULATE EVENNESS
Evenness_index <- as.data.frame(alpha_diversity_rarefied$Shannon/log(alpha_diversity_rarefied$Observed))
Evenness <- cbind(Evenness_index, rownames(alpha_diversity_rarefied))
colnames(Evenness) <- c("Evenness", "Samples")

In this step, it is worth mentioning that we have not used diversity estimates of richness. These estimates rely on singletons. However, DADA2 doesn’t produce singletons as ASV, they can only appear because of the merging process (thus, being not a real singleton for a richness estimate to rely on). For more details, check dada2 issue #92. In the code below, remember to change metadata.microfor your metadata variable if analysing your own data.

#Generate final table
alpha_table <- cbind(alpha_diversity_rarefied, Evenness$Evenness, metadata.micro) #change metadata to user provided metadata
colnames(alpha_table)<- c("Observed", "Shannon", "InvSimpson", "Evenness",colnames(metadata.micro))#change metadata to user provided metadata

#Write final table on local machine
write.table(alpha_table, file="alpha_table.txt", sep="\t")

We got the final table with the diversity indices we want to study, but it is also interesting to produce a table ready for publication, i.e., with mean and standard deviation grouped by variable of interest (in this case, location). Remember to change location and write your variable of interest.

#### ALPHA PUB TABLE
#Select numeric columns and compute mean and SD
alpha_mean <- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), mean)%>%  mutate_if(is.numeric, round, digits=2) #Change location to variable of interest
alpha_sd <- aggregate(alpha_table[,1:4], list(grouping=alpha_table$location), sd)%>% mutate_if(is.numeric, round, digits=2)#Change location to variable of interest

#Paste mean ± SD 
mean_sd <- NULL
for (i in 2:5){
  mean_sd <- cbind(mean_sd,paste0(alpha_mean[,i], " +/- ", alpha_sd[,i]))}


#generate table and give it columns names
alpha_pub_table <- cbind(alpha_mean$grouping, mean_sd)
colnames(alpha_pub_table) <- c("Location","Observed richness", "Shannon", "InvSimpson", "Evenness")#Change location to variable of interest

#Write table on your local machine
write.table(alpha_pub_table, file="alpha_pub_table.txt", sep="\t")

Statistical analysis

Once diversity indices are calculated and stored in a data frame (alpha_table), we can make statistical comparisons between experimental conditions. In this tutorial, we are interested in differences among locations. Here we will introduce several functions from micro4all package. These functions are levene.test.alpha, Shapiro, BalancedAnova, UnbalancedAnova, kruskal.wallis, Tukey.test, dunnT and wilcoxon.test. They all rely on another statistical functions, like levene.test, shapiro.test, aov, kruskal.test, TukeyHSD, HSD, dunn.test and pairwise.wilcoxon.test) from packages as car, stats, agricolae and dunn.test. The utility of these functions from micro4all is that they loop over the different calculated indices. In this way, we can perform the desired test in one line for all indices and get them in a unique table, making analyzing easier and faster. It is strongly recommend to check the help page of each function with ?functionname. All statistical analysis will be explained and exemplified with micro4all package, but a section will be as well included at the end of this tutorial to show statistical analysis without this package.

The basic arguments of all these functions are the same. First, we have to provide an α diversity table with indices and metadata variables (as the one produced here). If you have not followed the tutorial with the example dataset, you can always check a similar table with View(micro4all::alpha_diversity_table), as well as checking the examples with ?functionOfInterest. Then, we have to give the number of indices to be tested and the name of the variable of interest. Specific details of each function will be provided in due course.

First, we will check the homogeneity of variances and normality of data, in order to choose a proper statistical method later on. We will use levene.test.alpha and Shapiro functions from micro4all. The result of these functions is the same as its original functions (levene.test and shapiro.test) but for every index.

#### Apply levene and shapiro test ####
levene<- levene.test.alpha(alpha_table, 4, "location") #change location to your variable of interest
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
levene
#>        Df   F value    Pr(>F) IndexColumn
#> group   2 0.4798205 0.6255148    Observed
#>        21        NA        NA    Observed
#> group1  2 0.2423080 0.7869784     Shannon
#>  1     21        NA        NA     Shannon
#> group2  2 0.5007044 0.6131581  InvSimpson
#>  2     21        NA        NA  InvSimpson
#> group3  2 0.3925887 0.6801599    Evenness
#>  3     21        NA        NA    Evenness
#Write table on local machine
write.table(levene, file="levene_test.txt", sep="\t")

shapiro <- Shapiro(alpha_table, 4, "location") #change location to your variable of interest
shapiro
#>             p.value      Index
#> shapiro   0.5529423   Observed
#> shapiro.1 0.8088936    Shannon
#> shapiro.2 0.4396685 InvSimpson
#> shapiro.3 0.9296042   Evenness
#Write table on local machine
write.table(shapiro, file="shapiro_test.txt", sep="\t")

None of them returns a significant value. That means, according to these statistics tests, every diversity index from our data follows a normal distribution with no significant differences in variance (homoscedasticity). With this information we can perform an ANOVA test. With BalancedAnovafunction from micro4all, it is possible to run an ANOVA test for every index at once. BalancedAnova returns a list with two elements, first element is a data frame with summary results for all indices.

#### Balanced anova ####
balanced_anova<- BalancedAnova(alpha_table, numberOfIndexes = 4, formula = "location") #change location to your variable of interest

balanced_anova[[1]]
#>                  Df       Sum Sq      Mean Sq  F value     Pr(>F) IndexColum
#> data[, formula]   2 6.190750e+03 3.095375e+03 1.125528 0.34328482   Observed
#> Residuals        21 5.775325e+04 2.750155e+03       NA         NA   Observed
#> data[, formula]1  2 2.739446e-01 1.369723e-01 1.242577 0.30900953    Shannon
#> Residuals      1 21 2.314882e+00 1.102325e-01       NA         NA    Shannon
#> data[, formula]2  2 2.837867e+02 1.418933e+02 3.693011 0.04223816 InvSimpson
#> Residuals      2 21 8.068648e+02 3.842213e+01       NA         NA InvSimpson
#> data[, formula]3  2 3.978665e-03 1.989332e-03 1.155852 0.33402229   Evenness
#> Residuals      3 21 3.614302e-02 1.721096e-03       NA         NA   Evenness
#Write table on local machine
write.table(balanced_anova[[1]], file="balancedAnova_test.txt", sep="\t")

According to this ANOVA test, the only differences are found for the Inverse of Simpson index, but the result is marginally significant.

Remember that, in case you have an unbalanced experiment, you can always use the function UnbalancedAnova in a similar way. Moreover, if you prefer to use a non-parametrical test, the function kruskal.wallis is also available. These could be use as follows: kruskal.wallis(alpha_table, 4, "location") and UnbalancedAnova(alpha_table, 4, "location")

Now that we have an ANOVA result, imagine we want to test differences between every two groups, that is, location1 vs location2, location2 vs location3 and location1 vs location3. Let’s do that, again for every index, with Tukey.test from micro4all.

#### TUKEY ####
tukey <- Tukey.test(alpha_table, 4, "location", balanced=TRUE,)#change location to your variable of interest

tukey
#>                               diff           lwr         upr      p adj IndexColumn
#> location2.location1    23.12500000  -42.96676903 89.21676903 0.65731496    Observed
#> location3.location1   -16.00000000  -82.09176903 50.09176903 0.81616965    Observed
#> location3.location2   -39.12500000 -105.21676903 26.96676903 0.31475156    Observed
#> location2.location1.1   0.23860162   -0.17982909  0.65703233 0.34061351     Shannon
#> location3.location1.1   0.02620648   -0.39222423  0.44463718 0.98636386     Shannon
#> location3.location2.1  -0.21239514   -0.63082585  0.20603556 0.42190599     Shannon
#> location2.location1.2   7.74026543   -0.07168602 15.55221688 0.05242611  InvSimpson
#> location3.location1.2   0.99328997   -6.81866147  8.80524142 0.94510063  InvSimpson
#> location3.location2.2  -6.74697545  -14.55892690  1.06497600 0.09867956  InvSimpson
#> location2.location1.3   0.03124359   -0.02104072  0.08352790 0.30831191    Evenness
#> location3.location1.3   0.01189647   -0.04038784  0.06418078 0.83555221    Evenness
#> location3.location2.3  -0.01934712   -0.07163144  0.03293719 0.62618013    Evenness
#Write table on local machine
write.table(tukey, file="tukey_test.txt", sep="\t")

As we can see, there is not any statistically significant comparison among groups. Another pairwise test option is the Dunn’s test, which is less powerful but make no assumptions on data. This will be done in a similar way (dunn <- dunnT(alpha_table, 4, "location")). In this function, the result is composed by two elements, dunn[1] represents all results from dunn.test function and dunn[2] summarizes results in a data- frame.

When using KruskalWallis function, the pairwise test to be implemented should not be Tukey’s test, because it assumes normal distribution of data. Instead, it is recommended to use Dunn’s test or Mann–Whitney–Wilcoxon. micro4allalso implements a wilcoxon.test function, which would be used like this: wilcoxon <- wilcoxon.test(alpha_table, 4, "location", p.adjust.method = "BH")

Plot \(\alpha\) diversity indices

We will produce a ggplot graphic with α-diversity indices. It is necessary to have a formatted table for ggplot, which we can achieve using pivot_longer function from tidyr.

#### PRODUCE A TABLE FOR ALPHA PLOT ####
alpha_plot_table <- tidyr::pivot_longer(data = alpha_table, names_to = "Measure", values_to = "Value", cols=c(Observed, Shannon, InvSimpson, Evenness))

In the next lines, remember to change location to your variable of interest and also its values in limits, labelsand colors. For example, in scale_fill_manual(values=c("location1"=...)), your grouping variable could be health status, thus, colors should be assigned to healthy and infected. Furthermore, the order of the panels can be changed in the second line, specifically, changing the order of levels in facet_wrap

### ALPHA GRAPHIC GGPLOT ###
alpha_graphic <- ggplot(data = alpha_plot_table, aes(x = location, y = Value)) + #Change location to variable of interest
  facet_wrap(~factor(Measure, levels=c("Observed", "InvSimpson","Shannon", "Evenness")), scale = "free") + #change levels order to reorder panels
  geom_boxplot()+
  theme(axis.text.x = element_text(size=13),
        legend.position="bottom", strip.text = element_text(size = 20), axis.text.y = element_text(size=15), axis.title.y=element_text(size=20)) +
  scale_x_discrete(limits=c("location1",
                            "location2","location3"), breaks=c("location1",
                            "location2","location3"),   ##With breaks and labels you can change the name displayed on labels, #Change locations to values of variable of interest
                   labels=c("Location 1",#Change locations to values of variable of interest
                            "Location 2","Location 3")) +

  aes(fill=location)+ scale_fill_manual(values = c( "location1"= "#33CC00",#Change locations to values of variable of interest
                                                          "location2"= "#ffd624", "location3"="blue"), na.translate=FALSE) +
  theme(legend.key.size = unit(1, 'cm')) +
  ylab("Alpha Diversity Measure")

alpha_graphic

#use ggsave for saving ggplots
ggsave("alpha_plot.tiff", plot = alpha_graphic,width = 17, height = 30, units = "cm", dpi = 800)

Both statistical analysis and graphic show no significant differences between groups. Yet there could be distinct groups when studying β diversity, i.e. integrating abundance and taxonomic information.

\(\beta\) diversity analysis

Data preparation

A \(\beta\) diversity analysis allows us to study the structure of communities along a environmental gradient, e.g., quantify differences between locations. Before computing these differences or distances, it is crucial to normalize data. This is mainly needed in order to account for differences in library sizes between samples (sequencing depth), as well as to consider its compositionality, an inherent characteristic of high-throughput sequencing (HTS) data1. In this tutorial, normalization will be done with edgeR package. When processing data with this package, variations in sequencing depth are automatically adjusted. Moreover, it implements the method of trimmed mean of M-values (TMM) through the function calcNormFactors.This method will be used because our data always sum up to 1 (compositionality). Because of that, an ASV could be under-represented in a sample only because it was under-sampled due to other abundant ASV consuming the library size, a situation that could lead to non-trustful statistical differences.

We will install the package first:

#Install edgeR
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("edgeR")

And load the library:

#Load library
library(edgeR); packageVersion("edgeR")
#> [1] '3.38.4'

WithDGEList we create a DGEList object needed for edgeR processing. With this object, calcNormFactors computes normalization factors to account for effective library size (compositionality). Then, we will extract the normalized counts with cpm.

#### NORMALIZE ACCORDING TO EDGER ####
edgeR <- DGEList(counts = OTU, samples = metadata.micro, genes = tax)#Change metadata.micro for your own mt object
edgeR <- calcNormFactors(edgeR)

##EXTRACT NORMALIZED COUNTS
ASV_norm <- cpm(edgeR, normalized.lib.sizes=T, log=F)

One normalized counts are produced, we need to construct a normalized phyloseq object for further analysis. Remember again to change metadata.micro to your own mt file and to remove tree_root when working with ITS data.

##CREATE NORMALIZED PHYLOSEQ OBJECT
phy_OTU_norm<-otu_table(as.data.frame(ASV_norm,row.names=F), taxa_are_rows = T)
phy_taxonomy_norm<-tax_table(as.matrix(tax))
phy_metadata_norm<-sample_data(metadata.micro) #Change metadata.micro to your own mt file

##Add taxa names
taxa_names(phy_OTU_norm)<- taxa_names(phy_taxonomy_norm)
#Check
identical(rownames(ASV_norm), rownames(tax))
#> [1] TRUE

##Merge
norm_phyloseq<-phyloseq(phy_OTU_norm,phy_taxonomy_norm,phy_metadata_norm,tree_root) #Remove tree_root when working with ITS

Everything is now ready to analyze \(\beta\) diversity. A distance method will be needed to measure the the dissimilarities between samples. There are several ways to compute these dissimilarities. We will mainly use Bray-Curtis, Unifrac and Weighted-Unifrac distances. Bray-Curtis only takes into account the presence or not of shared members between samples, while Unifrac distances is based on shared phylogenetic branches. Weighted-Unifrac is similar to Unifrac, but it also includes information about relative abundances. A very detailed and readable information about this can be found in O Paliy et al., 20162

With these distance measures, a graphical representation can be obtained to visualize if our samples are differently grouped according to our variable of interest. In our case, we would like to know if location shapes bacterial communities differently. However, a statistical test is needed to confirm results. For this purpose, we will use a permutational multivariate analysis of variance (PERMANOVA). adonis function from vegan is implemented in Permanova function of micro4all with the aim of computing a PERMANOVA test with different distances at one. This function works with a normalized phyloseq object, a character vector with distances and a formula, where the variable of interest should be included. Once performed, permanova[[1]] returns a data frame with summary results from adonis function. For a more detailed result, check permanova[[2]].

#### PERMANOVA ####
permanova<- Permanova(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #Change location to variable of interest

The formula argument can be used to analyze more than one factor at once. This can be done as follows: formula="variable1+variable2" (in case of independent factors) or formula="variable1*variable2"(in case of dependent factors).

Next, let’s see the result from Permanova:

##Permanova result
permanova[[1]]
#>           Df  SumOfSqs        R2        F Pr(>F) distances
#> location   2 1.4040287 0.3004895 4.510496  0.001      bray
#> Residual  21 3.2684435 0.6995105       NA     NA      bray
#> Total     23 4.6724722 1.0000000       NA     NA      bray
#> location1  2 0.5468104 0.2156555 2.886975  0.005   unifrac
#> Residual1 21 1.9887628 0.7843445       NA     NA   unifrac
#> Total1    23 2.5355732 1.0000000       NA     NA   unifrac
#> location2  2 0.1343614 0.1647734 2.071439  0.025  wunifrac
#> Residual2 21 0.6810699 0.8352266       NA     NA  wunifrac
#> Total2    23 0.8154313 1.0000000       NA     NA  wunifrac
#Write table to local machine
write.table(permanova[[1]], file="permanova.txt",sep="\t")

With the table above, we can see that differences are statistically significant with every distance method used. Moreover, Bray-Curtis distance seems to explain the most variance (expressed as R2). Nevertheless, another test is necessary to confirm that this significant results are not due to differences among group variances (dispersion of samples). Again, we can perform multiple betadisper test (from vegan package) with several distances measures thanks to micro4all function Betadispersion. It is very similar to they way Permanova works, so do not forget to change location to your variable of interest.

#### BETADISPERSION ####
betadisper<- Betadispersion(loc_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location") #change location to variable of interest

betadisper[1] #data frame with results for every distance method
#> [[1]]
#>            Df      Sum Sq     Mean Sq         F N.Perm Pr(>F) distances
#> Groups      2 0.009769604 0.004884802 0.7280460    999  0.482      bray
#> Residuals  21 0.140898839 0.006709469        NA     NA     NA      bray
#> Groups1     2 0.009190666 0.004595333 0.8473928    999  0.460   unifrac
#> Residuals1 21 0.113881059 0.005422908        NA     NA     NA   unifrac
#> Groups2     2 0.002940247 0.001470124 0.5231941    999  0.615  wunifrac
#> Residuals2 21 0.059007916 0.002809901        NA     NA     NA  wunifrac
#write file to your local machine
write.table(betadisper[1], file="betadisper.txt", sep="\t")

In this example, dispersion was not statistically significantly different between the three groups, so we can be sure of the PERMANOVA results. Nonetheless, it would be interesting to know the pairwise differences between groups. We can make a pairwiseAdonis test to check for differences. Once more, micro4allgives us the option to loop over distances with PairwiseAdonisFun. pval argument refers to p-value threshold to filter results.

#### PAIRWISE ADONIS ####
pairwise<- PairwiseAdonisFun(norm_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location", pval=0.05) #Change location to your variable of interest


pairwise #data frame with all significant results
#>                    pairs Df SumsOfSqs  F.Model        R2 p.value p.adjusted sig distances
#> 1 location1 vs location2  1 0.3391022 2.076899 0.1291853   0.005     0.0050   *      bray
#> 2 location1 vs location3  1 0.9248550 6.081317 0.3028346   0.001     0.0015   *      bray
#> 3 location2 vs location3  1 0.8420859 5.555908 0.2841038   0.001     0.0015   *      bray
#> 4 location1 vs location2  1 0.2696235 2.823642 0.1678377   0.029     0.0370   .   unifrac
#> 5 location1 vs location3  1 0.2773369 2.785471 0.1659454   0.011     0.0330   .   unifrac
#> 6 location2 vs location3  1 0.2732552 3.068367 0.1797692   0.037     0.0370   .   unifrac
#write file to your local machine
write.table(pairwise, file="pairwise.txt", sep="\t")

It seems that, according to statistical analysis used and with Bray-Curtis distances, statistically significant differences are found among all locations.

For visualization, we can use PCoA and NMDS representations with every measure. While PCoA produces artificial variables to explain the variance among samples, NMDS tries to represent all the variance in two axes. In the code below, we will see how to use a loop to produces every possible plot, combining PCoA, NMDS and the three distances. Every plot will be assigned to a variable called plot_list. Then, calling plot_list[1] will give you the first plot. If you want to use more types of distances, modify var_list. Pay attention to the comments (#) to see where changes should be applied when using your own data. To save the plots, you can choose between several options which were detailed at the end of Data preparation section in \(\alpha\) diversity.

#### BETA DIVERSITY PLOTS ####
var_list <-  c("bray", "unifrac", "wunifrac")
plot_type <-  c("PCoA", "NMDS")
combination_plot <- purrr::cross2(plot_type,var_list )


# Make plots.
plot_list = list()

for (i in 1:length(combination_plot)){
      pcoa <- ordinate(norm_phyloseq,combination_plot[[i]][[1]],combination_plot[[i]][[2]]) #get distance matrix
      plot <-  plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
        geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location variable of interest (e.g., health status)
        geom_point(size=4)
      if (combination_plot[[i]][[1]]=="NMDS"){ ## graphic details for NMDS
        plot <-  plot + xlab(paste("Axis 1"))+
        ylab(paste("Axis 2"))+
          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+

          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances", round(pcoa$stress, digits = 3)))

        plot_list[[i]] = plot

      }
      else {## graphic details for PCoA
        plot= plot +   xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
          ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+

          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+

          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances"))

        plot_list[[i]] = plot

      }

       }


#Visualize plot
plot_list[[1]]

#Save plot on local machine
ggsave("bray-curtis_pcoa.tiff", plot = plot_list[[1]],width = 17, height = 30, units = "cm", dpi = 800 )

As it was shown by PERMANOVA, there are differences between all groups (locations) and location3 samples seem to differ more from other groups (i.e., the distance is greater).

If you want to perform only one plot, you can change var_list and plot_type or use the code below, which shows you how to do a NMDS and PCoA separately. Pay attention to the comments (#) in order to adapt the code to your own data:

####PCOA####
#Calculate distances
pcoa <- ordinate(norm_phyloseq,"PCoA","bray") #get distance matrix, change bray to choosen distance 

#Produce plot
pcoa_bray <-  plot_ordination(norm_phyloseq, pcoa, type="samples", color="location")+ #change location to variable of interest
      geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
        geom_point(size=4) + 
       xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
          ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="blue"))+#change locations to values of variable of interest (eg. healthy, infected)
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle("Bacterial Community PCoA on Bray Curtis distances") #change title according to distance used
   
#Visualize plot 
 pcoa_bray

#Save it to local machine
ggsave("bray-curtis_pcoa.tiff", plot = pcoa_bray,width = 17, height = 30, units = "cm", dpi = 800 )

Let’s see how it works for a NMDS!

#### NMDS####
#Calculate distances
nmds <- ordinate(norm_phyloseq,"NMDS","bray") #get distance matrix, change bray to choosen distance 
#> Square root transformation
#> Wisconsin double standardization
#> Run 0 stress 0.1895576 
#> Run 1 stress 0.2136299 
#> Run 2 stress 0.1907108 
#> Run 3 stress 0.203114 
#> Run 4 stress 0.1907108 
#> Run 5 stress 0.1925871 
#> Run 6 stress 0.2273544 
#> Run 7 stress 0.1889755 
#> ... New best solution
#> ... Procrustes: rmse 0.05124936  max resid 0.2188175 
#> Run 8 stress 0.2846452 
#> Run 9 stress 0.1904783 
#> Run 10 stress 0.217401 
#> Run 11 stress 0.1889754 
#> ... New best solution
#> ... Procrustes: rmse 6.24282e-05  max resid 0.0001315906 
#> ... Similar to previous best
#> Run 12 stress 0.1907108 
#> Run 13 stress 0.1925871 
#> Run 14 stress 0.203114 
#> Run 15 stress 0.1925872 
#> Run 16 stress 0.1889755 
#> ... Procrustes: rmse 0.0001726498  max resid 0.0005084086 
#> ... Similar to previous best
#> Run 17 stress 0.2277738 
#> Run 18 stress 0.1889755 
#> ... Procrustes: rmse 0.0002819056  max resid 0.0007794508 
#> ... Similar to previous best
#> Run 19 stress 0.1895576 
#> Run 20 stress 0.2647637 
#> *** Best solution repeated 3 times

#Produce plot
nmds_bray <-  plot_ordination(norm_phyloseq, nmds, type="samples", color="location")+ #change location to variable of interest (e.g., health status)
        geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+#change location to variable of interest
        geom_point(size=4)+
         xlab(paste("Axis 1"))+
        ylab(paste("Axis 2"))+
          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+

          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="blue"))+ #change locations to values of variable of interest (eg. healthy, infected)
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle(paste("Bacterial Community NMDS on Bray-Curtis distances", round(nmds$stress, digits = 3))) #change title according to distance used
   
#Visualize plot 
 nmds_bray

#Save it to local machine
ggsave("bray-curtis_nmds.tiff", plot = nmds_bray,width = 17, height = 30, units = "cm", dpi = 800 )
      

As it was mentioned before, PCoA produces aritifical variables that are represented as axis. Each of them explains a certain amount of variance. Going deeper in the analysis, we can get a bar plot with the variance explained by every axis. Moreover, we will use length to see how many axes were needed for PCoA ordination.

#### ORDINATE PHYLOSEQ OBJECT ####
PCOA_bray  <-  ordinate(norm_phyloseq, "PCoA", "bray", formula="location") #Change to your variable of interest
#Get number of axes
length(PCOA_bray$values$Relative_eig)
#> [1] 23
#Get plot
plot_scree(PCOA_bray)

And with the next line, we can get the percentage of variance explained by the number of axes we choose. The number of axes needed to explain a certain amount of variance can give us an idea of the multidimensionality of our data

##Get explained variance in first 10 axis
sum(PCOA_bray$values$Relative_eig[1:10]) #change 1:10 to select the number of axes
#> [1] 0.833356

Produce tables with relative abundance

With α-diversity and β-diversity analysis, we got a global perspective of the structure a diversity of microbial communities. Nevertheless, it is interesting as well to get information about taxonomic profile, i.e., which phyla and genera are more abundant in our samples. In this line, several analysis can be performed. We will start with producing and saving tables with relative abundances at different levels using our phyloseq object without normalization (loc_phyloseq). For example, we can produce a table with relative abundance of ASV for each sample. Moreover, we can group this by location, thus having ASV relative abundance for each location. This can be done as well at genus and phylum level. We will start at ASV and sample level. First, with the function transform_sample_counts, our phyloseq object (loc_phyloseq) will be converted to relative abundances. Then, each element of this phyloseq object will be extracted and bind into a table.

##CALCULATE RELAIVE ABUNDANCE, ASV, SAMPLES
sample_relabun_ASV <- transform_sample_counts(loc_phyloseq, function(x){x/sum(x)}*100)

##CONSTRUCT THE TABLE
OTU_sample <-  as.data.frame((otu_table(sample_relabun_ASV)))
taxonomy_sample <- as.data.frame(tax_table(sample_relabun_ASV))
identical(rownames(OTU_sample),rownames(taxonomy_sample))
#> [1] TRUE
sample_relabun_ASV_table  <- cbind(taxonomy_sample, OTU_sample)
#SORT IT
sample_relabun_ASV_table <- sample_relabun_ASV_table[sort(sample_relabun_ASV_table$ASV_names, decreasing = FALSE),]

With colSums we should check if summing ASV abundances for every sample (i.e., each column) gives us a result of 100

#Check relative abundance sums 100
colSums(sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)])
#> L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 
#>  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100 
#> L3_4 L3_5 L3_6 L3_7 L3_8 
#>  100  100  100  100  100

We can have a look at the beginning of the table and save it on our local machine as follows:

##See table
head(sample_relabun_ASV_table)
#>          Kingdom         Phylum          Class             Order                Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0003 Bacteria Proteobacteria   unclassified      unclassified          unclassified
#> ASV0006 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0008 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria     Micrococcales Promicromonosporaceae
#>                     Genus ASV_names       L1_1      L1_2      L1_3       L1_4       L1_5
#> ASV0002      Streptomyces   ASV0002 13.8889992  4.937848 10.893415 22.1990687  9.5526244
#> ASV0003      unclassified   ASV0003  3.8330224 27.683538 26.364099 15.4691786 10.6119119
#> ASV0006      unclassified   ASV0006  2.5943864  6.496807  7.246131 10.9847116 19.6124558
#> ASV0007    Pseudonocardia   ASV0007  1.3855254  3.001168  2.136059  0.9440983  0.4732119
#> ASV0008      unclassified   ASV0008 11.3065227  2.218254  2.586520  1.1120798  7.1675279
#> ASV0009 Promicromonospora   ASV0009  0.1905594  1.504018  2.310431  1.8605541  0.1060647
#>               L1_6       L1_7     L1_8       L2_1       L2_2       L2_3      L2_4       L2_5
#> ASV0002  6.9848271 20.4381884 5.620269  7.6930658  7.2420046 11.2892520 3.0031259 14.4508131
#> ASV0003 24.8232220 10.3801344 5.665003 10.5471676  3.6577360 22.6933878 7.1731339  1.5160285
#> ASV0006  6.2825092  5.3759593 1.043176  6.6385461  4.5674805  7.4010722 7.1764071  1.9043707
#> ASV0007  1.8744457  2.3142073 0.925081  2.0439942  0.3016851  0.5514424 1.8673388  0.6254551
#> ASV0008  4.2882140  3.0677164 6.514932 14.3000749 12.2581049  7.5312739 1.3452695  4.4099251
#> ASV0009  0.4674129  0.3419619 3.689588  0.2030197  7.7359552  0.1582844 0.5007937  7.1880659
#>              L2_6     L2_7     L2_8       L3_1      L3_2        L3_3       L3_4      L3_5
#> ASV0002 12.829278 8.654647 9.041784 13.6220214  5.024374 17.70502153  7.8901345  8.995795
#> ASV0003  0.921740 9.530792 9.079588  8.5899753 11.863936 10.48276173 11.3430493  3.270292
#> ASV0006  1.774789 6.913218 5.461540  0.3648316  1.406825  0.62233647  0.7892377  1.592300
#> ASV0007  1.442515 4.163499 3.624719 14.8167625 16.529120  5.98209655  9.9327354  4.862592
#> ASV0008  8.960207 1.828319 6.101981  0.0000000  0.000000  0.00000000  0.0000000  0.000000
#> ASV0009  2.551159 1.178451 1.461006  7.1783073  1.233644  0.03156779  1.8609865 21.526934
#>                L3_6       L3_7        L3_8
#> ASV0002 19.96646697 19.7996690  5.78009910
#> ASV0003 14.42738258  9.9207386 21.88830856
#> ASV0006  2.58040771  0.1428447  0.82764162
#> ASV0007 16.93827060  3.8899051 13.41368689
#> ASV0008  0.00000000  0.0000000  0.00000000
#> ASV0009  0.01840227  0.5853149  0.03214142
#Save ASV sample table on your local machine
write.table(sample_relabun_ASV_table, file="ASV_sample_relabun.txt", sep="\t")

So, how could we do this at genus and phylum level? First, the phyloseq object has to be grouped at the desired taxonomic level with tax_glom and follow the same steps as before.

### GET TABLE BY SAMPLE AT GENUS LEVEL ###
#Glom phyloseq
loc_phyloseq_genus <- tax_glom(loc_phyloseq, taxrank = "Genus") #Change Genus to the desired taxonomic rank 
sample_relabun_genus <- transform_sample_counts(loc_phyloseq_genus, function(x){x/sum(x)}*100)
##Extract elements from phyloseq
OTU_sample_genus <-  as.data.frame((otu_table(sample_relabun_genus)))
taxonomy_sample_genus  <- as.data.frame(tax_table(sample_relabun_genus)[,-7])
identical(rownames(OTU_sample_genus),rownames(taxonomy_sample_genus))
#> [1] TRUE
#Bind table
sample_table_genus  <- cbind(taxonomy_sample_genus, OTU_sample_genus)

Furthermore, if not interested in the taxonomic profile of unclassified sequences at genus level, we can group unclassified genera together. First, we save the unclassified genera in a new variable (sample_talbe_genus_unclass) where we sum up its abundances. Then we bind sample_table_genus without unclassifieds with this new variable:

##Agglomerate unclassified in otu table

#Get sum of unclassified sequences
sample_table_genus_unclass <- sample_table_genus %>% subset(Genus=="unclassified", select=c(7:ncol(sample_table_genus))) %>%
  colSums() %>%  t() %>% as.data.frame() %>% cbind(Kingdom="unclassified", Phylum="unclassified", Class="unclassified",
                                                   Order="unclassified", Family="unclassified", Genus="unclassified",  .)

#Bind sample_table_genus wito unclassified with sample_table_genus_class
sample_table_genus_final <- rbind(subset(sample_table_genus, Genus!="unclassified"), sample_table_genus_unclass)

#Check relative abundance sums 100
colSums(sample_table_genus_final[,8:ncol(sample_table_genus_final)])
#> L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 
#>  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100 
#> L3_5 L3_6 L3_7 L3_8 
#>  100  100  100  100
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
sample_table_genus_final <- sample_table_genus_final[order(rowSums(sample_table_genus_final[,7:ncol(sample_table_genus_final)]), decreasing=TRUE),]

#See first lines of this table
head(sample_table_genus_final)
#>              Kingdom         Phylum               Class             Order                Family
#> 1       unclassified   unclassified        unclassified      unclassified          unclassified
#> ASV0002     Bacteria Actinobacteria      Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0007     Bacteria Actinobacteria      Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0010     Bacteria Actinobacteria      Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009     Bacteria Actinobacteria      Actinobacteria     Micrococcales Promicromonosporaceae
#> ASV0022     Bacteria Proteobacteria Alphaproteobacteria       Rhizobiales          Rhizobiaceae
#>                     Genus       L1_1       L1_2       L1_3      L1_4       L1_5       L1_6
#> 1            unclassified 27.8534281 47.8401209 52.6289022 42.899062 50.3290726 53.4049234
#> ASV0002      Streptomyces 24.5027591 13.3266946 15.1413141 28.718450 10.8362796 10.6258539
#> ASV0007    Pseudonocardia  6.9494621  4.7215164  8.2560364  4.986285  8.2186565  6.6037057
#> ASV0010     Amycolatopsis  7.5211402  2.8123068  5.9238091  5.628442  5.3848246  8.7346293
#> ASV0009 Promicromonospora  0.2679741  1.5040176  2.4508973  1.860554  0.1060647  0.4674129
#> ASV0022      Neorhizobium  0.4069237  0.1201841  0.2567145  1.599013  0.1033451  0.4698099
#>               L1_7       L1_8       L2_1      L2_2       L2_3       L2_4      L2_5      L2_6
#> 1       38.2838284 36.6489524 56.2679860 39.491043 49.1881542 41.2663863 19.876403 33.165067
#> ASV0002 29.6135035 11.0705531  9.3605866 14.072905 18.6341588  5.4039900 23.175445 13.396380
#> ASV0007  5.3998171  7.6189454  9.5892301  4.206396  5.5450600  3.7559531  4.025317  6.603939
#> ASV0010  5.1254523  7.5634763 11.6923562 11.939225 11.5905029  7.6575618  7.787383 11.818080
#> ASV0009  0.4731798  4.3910033  0.3666180  7.735955  0.1582844  0.5007937  7.188066  2.551159
#> ASV0022  0.3280449  0.3453397  0.3409942  4.917623  0.4952770  4.8377330  5.132466  6.952188
#>              L2_7       L2_8       L3_1      L3_2        L3_3      L3_4      L3_5        L3_6
#> 1       38.769415 47.3192644 18.7625308 30.366459 34.09772486 27.253363 17.798282 24.87169526
#> ASV0002 13.127693 10.1047388 21.5250616  7.904302 23.50222102 10.071749 10.554216 24.77968389
#> ASV0007  9.020311  8.9928617 22.2925226 23.293851 11.42528580 16.302691  9.041631 25.36855665
#> ASV0010  3.939032  7.2983611  2.7198028  9.501411  4.71713004  5.627803  4.430140  5.16899421
#> ASV0009  1.178451  1.4610065  7.1783073  1.233644  0.03156779  1.860987 21.526934  0.01840227
#> ASV0022  2.463705  0.7560764  0.9235826  1.026255  0.11048727  1.004484  1.377070  1.25748870
#>               L3_7        L3_8
#> 1       19.3136486 43.26771126
#> ASV0002 31.9118544  8.33266372
#> ASV0007  6.3827193 22.18829517
#> ASV0010  2.5067503 13.26101513
#> ASV0009  0.5853149  0.03214142
#> ASV0022  2.2715791  0.16606401
#Save to your local machine
write.table(sample_table_genus_final, file="sample_table_genus.txt", sep="\t")

If you want to keep the table with all unclassified taxa in individual rows, you can always use write.table with sample_table_genus.

With the tables produced above we have information about relative abundance in each sample. However, it is very useful to group this tables by the variable of interest. That would mean to know the mean abundances of members of a taxonomic rank in each location with standard deviations. First, we will save genus abundance and taxonomy from sample_table_genus_final in two variables. Then, mean and SD will be computed according to the grouping variable of interest. For your own data, it is crucial to substitute metadata.micro by your own metadata variable (e.g., mt) and location by your metadata variable of interest.

### GROUP GENUS TABLE BY LOCATION

#Save genus abundances
otu_genus <- sample_table_genus_final[,7:ncol(sample_table_genus_final)] %>% t() %>% as.data.frame()
#Save taxonomy data
tax_genus <- sample_table_genus_final[,1:6]

#Calculate OTU mean abundance based on grouping factor (e.g., location)
location_mean_genus <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
#Calculate OTU SD  based on grouping factor (e.g., location) and change colnames
location_SD_genus <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1")  %>% t()  %>% #Change metadata.micro and location to your metadata and variable of interest
  as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_mean_genus), "SD"))

#Merge mean abundance, SD and taxonomy.
genus_location_table <- merge(tax_genus, location_mean_genus, by=0) %>%column_to_rownames("Row.names") %>%
  merge(location_SD_genus, by=0) %>% column_to_rownames("Row.names")

#Check abundances sum 100
colSums(genus_location_table[,7:ncol(genus_location_table)])
#>   location1   location2   location3 location1SD location2SD location3SD 
#>   100.00000   100.00000   100.00000    51.40124    59.14292    66.73249
#Sort table from most to least abundant genera (calculated as sum of abundance between samples)
genus_location_table <- genus_location_table[order(rowSums(genus_location_table[,7:ncol(genus_location_table)]), decreasing=TRUE),]

#View table
head(genus_location_table)
#>              Kingdom         Phylum          Class             Order                Family
#> 1       unclassified   unclassified   unclassified      unclassified          unclassified
#> ASV0002     Bacteria Actinobacteria Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0007     Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0010     Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009     Bacteria Actinobacteria Actinobacteria     Micrococcales Promicromonosporaceae
#> ASV0023     Bacteria Actinobacteria Actinobacteria Micromonosporales    Micromonosporaceae
#>                     Genus location1  location2 location3 location1SD location2SD location3SD
#> 1            unclassified 43.736036 40.6679647 26.966427    8.990417   11.046165    8.803745
#> ASV0002      Streptomyces 17.979426 13.4094871 17.322719    8.244194    5.536809    9.197919
#> ASV0007    Pseudonocardia  6.594303  6.4673835 17.036944    1.419046    2.447691    7.291772
#> ASV0010     Amycolatopsis  6.086760  9.2153128  5.991631    1.837111    2.974914    3.642673
#> ASV0009 Promicromonospora  1.440138  2.6425417  4.058412    1.461164    3.072715    7.449846
#> ASV0023    Micromonospora  1.409967  0.8506067  2.076555    2.588402    1.083355    2.368180
#Save table on your local machine
write.table(genus_location_table, file="genus_abund_location.txt", sep="\t")

If you want to do this at other taxonomic levels, you should simply change sample_table_genus_final to the grouped table, e.g., for ASV level it would be sample_relabun_ASV_table. Remember to change metadata.micro to your metadata file and location by your variable of interest:

#Save OTU data
otu_location_ASV <- sample_relabun_ASV_table[,8:ncol(sample_relabun_ASV_table)] %>% t() %>% as.data.frame()
#Save Taxnomy data
tax_location_ASV <- sample_relabun_ASV_table[,1:8]

#Calculate OTU mean abundance based on grouping factor (e.g., PLOT)
location_ASV_mean <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t() #Change metadata.micro and location to your metadata and variable of interest
#Calculate OTU SD  based on grouping factor (e.g., PLOT) and change colnames
location_ASV_SD <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1")  %>% t()  %>% #Change metadata.micro and location to your metadata and variable of interest
  as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_ASV_mean), "SD"))

#Merge mean abundance, SD and taxonomy.
ASV_location_table <- merge(tax_location_ASV, location_ASV_mean, by=0) %>%column_to_rownames("Row.names") %>%
  merge(location_ASV_SD, by=0) %>% column_to_rownames("Row.names")

#Check abundances sum 100
colSums(ASV_location_table[,8:ncol(ASV_location_table)])
#>        L1_1   location1   location2   location3 location1SD location2SD location3SD 
#>   100.00000   100.00000   100.00000   100.00000   111.04717   115.16294    98.63739
#Save to local machine
write.table(ASV_location_table, file="ASV_abund_location.txt", sep="\t")

Taxonomic profile graphics

This section will cover the graphic representation of taxonomic profiles at genus level. These kind of graphics can be interesting to give a visual idea of the main taxonomic groups present in our communities. It will be exemplified at genus level, but it can be easily switched to phylum and specific changes will be explained after comments # (check also the previous section, as it explains how to produce the table).

First, we prepare a table for plotting. We create a column with the mean abundance of each taxa between conditions. With this column, we can sort the table and filter the number of taxa to plot. In our case, we usually represent the 20 most abundant genera. Check the number of columns for your own data. For your own data, this number could differ depending on the number of samples or experimental settings that you have (check first your table on your local machine or in RStudio with View(genus_location_table)).

###TAXONOMICAL PROFILE GENUS LEVEL###
#Format table. 
#Add a column with the mean abundance between conditions and arrange it un decreasing order. 
sample_table_genus_sort <- genus_location_table %>% mutate(mean=rowMeans(genus_location_table[,7:ncol(genus_location_table)])) %>% arrange(desc(mean))

#We have to use pivot_longer to get a table in the right format for ggplot
#We filter out unclassified
taxonomic_genus_location <-  sample_table_genus_sort[1:21,] %>% filter(Genus!="unclassified") %>% dplyr::select(6:9) %>% pivot_longer(!Genus, names_to="Samples", values_to="Abundance")##we select 2:21 because the unclassified sequences are the most abundant.

The data is ready but we still need to arrange colors and legend labels. We need to produce an automatized expression for ggplot in order to show genera and phyla names in italics. We will use the package gdata. Let’s install it and load the library:

#Install gdata
install.packages("gdata")
#Load library
library(gdata);packageVersion("gdata")
#> [1] '2.18.0.1'

Now, we will produce the legend labels for genera in italics. First, we get the location names and genera names. With sapply we will paste the name of genera, a “=” symbol, and italic(“name of genera”) for every genera. If plotting at phylum level, make proper changes (pay attention to # comments).

#Get labels for location
location_label <- levels(as.factor(unique(taxonomic_genus_location$Samples)))

#Get unique names for genera
unique_genera <- unique(as.vector(taxonomic_genus_location$Genus)) #Change Genus to Phylum when needed

#REORDER FACTORS LEVELS IN DATA FRAME
taxonomic_genus_location$Genus=reorder.factor(taxonomic_genus_location$Genus,new.order=rev(unique_genera))#Change Genus to Phylum when needed

sorted_labels_genus<- as.data.frame(unique_genera)

##CREATE AN EXPRESSION FOR GGPLOT WITH ITALICS WHEN NEEDED.
sorted_labels_ggplot <- sapply(sorted_labels_genus$unique_genera,
                                           function(x) parse(text = paste0("italic('",as.character(x), "')")))

When having to color a great number of objects, randomcoloR package can be very useful. It allows us to produce a number of random color and it is possible to modify them later to not repeat them. You can combine it with package scales to display the random generated colors, as well as its codes.

#Install randomcoloR
install.packages("randomcoloR")
#Install scales
install.packages("scales")
#Load libraries
library(randomcoloR);packageVersion("randomcoloR")
#> [1] '1.1.0.1'
library(scales);packageVersion("scales")
#> [1] '1.2.1'

With randomColor you can produce a number of random colors. The length of it will be set by length(unique_genera), which tells us how many different genera there are.

#Produce random color and visualize them
#Get number of genera
length(unique_genera) #This will be the number of genera to color
#> [1] 20

colors_genus <-  randomColor(count=length(unique_genera), hue=c("random"))
show_col(colors_genus)

If you want to change some of these colors, it can be easily done as follows:

#Modify colors
colors_genus[5] <-  "#448800"

In this tutorial, we will use the next palette, because it has been previously optimized to include very distinctive colors.

#Produce a palette of colours
c24 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "pink", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "blue","orange", "darkgrey"
)

Main taxa plotting

At last, let’s plot taxonomic profile at genus level! Comments (#) include information to change the code for phylum profile.

#### BACTERIAL GENERA TAXONOMIC PROFILE ####

ggplot_title <- "Bacterial genera by location"

ggGenera=ggplot(taxonomic_genus_location, aes(x=Samples, y=Abundance, fill=Genus, order=Genus)) + #Change genus to phylum
  geom_bar(stat="identity", position="stack")+
  scale_fill_manual(values=c24,#Change c24 to colors_genus if you are using ramdom colors
                    breaks=unique_genera, #Include genera names
                    labels=sorted_labels_ggplot)+ #Include italic tags

  labs(y="Mean relative abundance (%)", x=NULL, fill="Genus",title=ggplot_title)+
  guides(fill = guide_legend(reverse = TRUE))+
  scale_x_discrete(limits=location_label,
                   labels=location_label)+
  scale_y_continuous(expand=c(0.01,0.01),
                     breaks=c(0,10,20,30,40,50,60,70),
                     labels=c("0","10", "20","30","40","50","60","70"), 
                     limits = c(NA, 70))+
  theme_bw()+
  theme(panel.border = element_rect(colour="black"), axis.title.x=element_blank(),
        plot.title = element_text(face="bold", hjust = 0.5, size=16), axis.text = element_text(size = 14),
        axis.text.x = element_text(face="bold", size=10, angle = 45, vjust=1, hjust = 1), axis.title.y = element_text(size = 16),
        legend.key.size = unit(0.9, "cm"), legend.text = element_text(size = 8),
        legend.title = element_text(size=7, face="bold"), legend.title.align=0.5)


#Visualize plot
ggGenera

#Save it
ggsave("genera_profile.tiff", plot = ggGenera,width = 35, height = 30, units = "cm", dpi = 800 )

ANCOM-BC analysis

There are several methods to perform differential abundance analysis on microbial community data. Among them, we choose ANCOMBC because it is specifically designed for this kind of data and accounts for its compositional structure. Unfortunately, when having multiple groups to compare, ancombc function only compares the first group against all others. This is a big drawback and requires the user to change the input phyloseq object for every other combination between groups. To solve this problem, micro4all has implemented the function ancomloop. Apart from looping over groups, it also returns a table with ANCOM corrected logarithmic abundances. First, we will install microbiome package, as it is required by ancombc function from micro4all. ANCOMBC package is already installed with micro4all. Although ANCOMBC has recently been updated to ANCOMBC-II, we continue to use ‘ancombc’ instead of ‘ancombc2’ as some of its functionality is still being revised. Therefore, a normal warning message is displayed stating that “ancombc is deprecated”. Also, the new function ‘ancombc’ contains the argument ‘tax_level’. Thus, we do not need to glom the phyloseq object at the specific taxonomic level. However, it is not possible to implement ANCOM at ASV level, and the ‘tax.level’ argument must always be specified (including one of the ranks allowed by taxRanks()), otherwise an error will occur.

##Install microbiome package
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("microbiome")
#Load microbiome library
library(microbiome);packageVersion("microbiome")
#> [1] '1.18.0'

Now we will see how to use ancomloop function. The input has to be a phyloseq object. First of all, we have to use a phyloseq object (loc_phyloseq). Moreover, we have to give the grouping variable (change it to your variable of interest). If we do not want to analyze the unclassified genera, we can set out.unclassified to TRUE. When this argument is turned to TRUE, a tax.level should be provided. Finally, we can also add more arguments from the original ANCOMBC function with ancom.options. When running this function, the comparisons performed will be printed on screen (for example, “location1”-“location2”-“location3” means that location1 is being compared with location2 and location3). After it is finished, it returns a list with as many elements as different groups were in the grouping variable, in this case, 3. Each element is a data frame containing the taxonomy, relative abundance corrected by ANCOMBC, standard deviations and all ANCOMBC results for every possible comparison. If you want to visualize each element, use View(ANCOM_location_genus[["location1"]]).

#### ANCOMBC COMPARISONS ####
ANCOM_location_genus <- ancomloop(input_object_phyloseq = loc_phyloseq, grouping = "location", ancom.options = list(global=FALSE, struc_zero=TRUE, n_cl=8),out.unclassified = TRUE, tax.level="Genus") #When performing ancombc at genus level, we can filter unclassified genera out with out.unclassified and tax.level arguments #REMEMBER! Change location1 according to your metadata
#> [1] "location1" "location2" "location3"
#> [1] "location2" "location1" "location3"
#> [1] "location3" "location1" "location2"
#Visualize first lines 
head(ANCOM_location_genus[["location1"]])
#>             Genus  Kingdom         Phylum              Class               Order
#> 1   Achromobacter Bacteria Proteobacteria Betaproteobacteria     Burkholderiales
#> 2 Acidaminococcus Bacteria     Firmicutes      Negativicutes   Acidaminococcales
#> 3  Actinocorallia Bacteria Actinobacteria     Actinobacteria Streptosporangiales
#> 4    Actinomadura Bacteria Actinobacteria     Actinobacteria Streptosporangiales
#> 5 Actinophytocola Bacteria Actinobacteria     Actinobacteria   Pseudonocardiales
#> 6    Actinoplanes Bacteria Actinobacteria     Actinobacteria   Micromonosporales
#>                Family ASV_names Intercept_lfc Intercept_se Intercept_W Intercept_pval
#> 1      Alcaligenaceae      <NA>     2.8600907    0.2510751  11.3913736   4.616539e-30
#> 2  Acidaminococcaceae      <NA>    -0.1595666    0.3490052  -0.4572041   6.475244e-01
#> 3 Thermomonosporaceae      <NA>    -0.9505587    0.2369933  -4.0109101   6.048515e-05
#> 4 Thermomonosporaceae      <NA>     4.0296225    0.5369732   7.5043276   6.174469e-14
#> 5  Pseudonocardiaceae      <NA>     1.6747693    0.7152141   2.3416334   1.919956e-02
#> 6  Micromonosporaceae      <NA>     3.4144532    0.4015234   8.5037460   1.835689e-17
#>   Intercept_qval Intercept_diff location1vslocation2_lfc location1vslocation2_se
#> 1   7.109471e-28           TRUE               -2.5689349               0.7038106
#> 2   1.000000e+00          FALSE                1.4555339               0.5739782
#> 3   6.955792e-03           TRUE                0.0675060               0.3812191
#> 4   8.582512e-12           TRUE                0.3953266               0.6812116
#> 5   1.000000e+00          FALSE                0.6683542               1.0329619
#> 6   2.588322e-15           TRUE                0.2784884               0.5147770
#>   location1vslocation2_W location1vslocation2_p_val location1vslocation2_q_val
#> 1             -3.6500374               0.0002622022                 0.04431216
#> 2              2.5358696               0.0112168507                 1.00000000
#> 3              0.1770793               0.8594461208                 1.00000000
#> 4              0.5803286               0.5616930167                 1.00000000
#> 5              0.6470270               0.5176144794                 1.00000000
#> 6              0.5409885               0.5885155182                 1.00000000
#>   location1vslocation2_diff_abn location1vslocation3_lfc location1vslocation3_se
#> 1                          TRUE               -4.1372748               0.0000000
#> 2                         FALSE                0.2987000               0.6613357
#> 3                         FALSE                2.0390103               0.4893351
#> 4                         FALSE                0.1400161               0.7890357
#> 5                         FALSE                1.3378918               0.8607808
#> 6                         FALSE                1.3220703               0.4769997
#>   location1vslocation3_W location1vslocation3_p_val location1vslocation3_q_val
#> 1            -14.1889557                0.000000000                0.000000000
#> 2              0.4516616                0.651512781                1.000000000
#> 3              4.1668999                0.000030877                0.005063829
#> 4              0.1774521                0.859153279                1.000000000
#> 5              1.5542770                0.120118350                1.000000000
#> 6              2.7716375                0.005577511                0.792006625
#>   location1vslocation3_diff_abn
#> 1                          TRUE
#> 2                         FALSE
#> 3                          TRUE
#> 4                         FALSE
#> 5                         FALSE
#> 6                         FALSE
#Save any of the table on your local machine
write.table(ANCOM_location_genus[["location1"]], file="ancom_location1_genus.txt", sep="\t")

Once saved to your local machine, the ANCOM table can be manually filtered to get only significant results. However, this can be done as well in R. You will see that we select a specific number of columns (1:13) in order to save only location1 vs. location2 comparison.

###Filter ANCOM Results
#Get results for location1
ANCOM_location1_genus <- ANCOM_location_genus[["location1"]] #Change location1 to your variable of interest
#Get only significant results
ANCOM_loc1vsloc2_genus_sig <- ANCOM_location1_genus[,1:19] %>%  filter(location1vslocation2_diff_abn,TRUE) #Change [,1:19] according to the comparison of interest

#Write results on your local machine
write.table(ANCOM_loc1vsloc2_genus_sig, file="ANCOM_loc1loc2_sig.txt", sep="\t")

ANCOM-BC graphic

Graphical representation of differential analysis results can be achieved with two approaches: log fold change with corrected abundances or relative abundance without correction. Here, we will show how to accomplish both graphics. For publication purposes, we rather represent both graphics in one figure.

We will start with log fold change represented in a waterfall plot. As we did ANCOM at genus level, we will bind phylum and genus name for an easier identification. This is most useful when representing differential abundances at ASV level. The graphic will be exemplified with location1 vs location2 comparison.

#Bind phylum name to genus name
#Create a new variable for graphics
graphic_loc1vsloc2_genus<-ANCOM_loc1vsloc2_genus_sig

#Loop to add a new column called Classification and paste phylum and genus names
for (i in 1:nrow(graphic_loc1vsloc2_genus)){
  graphic_loc1vsloc2_genus$Classification[i]<- paste(graphic_loc1vsloc2_genus$Phylum[i],graphic_loc1vsloc2_genus$Genus[i],sep="|")
  }

When using unclassified genera or DA analysis at ASV level, it would be interesting to show classification at higher levels (family, class, order or phylum). The next loop will allow us to paste names according to classification:

#Get the last classified taxonomy level
for (i in 1:nrow(graphic_loc1vsloc2_genus)){ #Loop over rows
  if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_genus$Family[i]!="unclassified")){
  graphic_loc1vsloc2_genus$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_genus$Family[i],
         "|", graphic_loc1vsloc2_genus$Genus[i])}
  else if (isTRUE(graphic_loc1vsloc2_genus$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Order[i]!="unclassified")){
graphic_loc1vsloc2_genus$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_genus$Order[i],
                                                     "|", graphic_loc1vsloc2_genus$Genus[i])
}
  else if (isTRUE(graphic_loc1vsloc2_genus$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Class[i]!="unclassified")){
    graphic_loc1vsloc2_genus$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_genus$Class[i],
                                                         "|", graphic_loc1vsloc2_genus$Genus[i])
  }
  else if (isTRUE(graphic_loc1vsloc2_genus$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_genus$Phylum[i]!="unclassified")){
    graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
                                                         "|", graphic_loc1vsloc2_genus$Genus[i])
  }
  else if (isTRUE(graphic_loc1vsloc2_genus$Phylum[i]=="unclassified")){
    graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", "unclassified")
  }
  else if (isTRUE(graphic_loc1vsloc2_genus$Genus[i]!="unclassified")) {
    graphic_loc1vsloc2_genus$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_genus$Phylum[i],
                                                         "|", graphic_loc1vsloc2_genus$Genus[i])
    
  }}

Next, we can filter out unclassified phyla:

#Filter out unclassified phyla
graphic_loc1vsloc2_genus <- graphic_loc1vsloc2_genus %>% filter(Phylum!="unclassified") 

If using ANCOM at ASV level, the code would be as follows:

#Get the last classified taxonomy level for ANCOM at ASV level
for (i in 1:nrow(graphic_loc1vsloc2_ASV)){
  if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]=="unclassified")& isTRUE(graphic_loc1vsloc2_ASV$Family[i]!="unclassified")){
  graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Family", " ", graphic_loc1vsloc2_ASV$Family[i],
         "|", graphic_loc1vsloc2_ASV$Genus[i], "|", graphic_loc1vsloc2_ASV$ASV_names[i])}
  else if (isTRUE(graphic_loc1vsloc2_ASV$Family[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Order[i]!="unclassified")){
graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Order", " ", graphic_loc1vsloc2_ASV$Order[i],
                                                     "|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
}
  else if (isTRUE(graphic_loc1vsloc2_ASV$Order[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Class[i]!="unclassified")){
    graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Class", " ", graphic_loc1vsloc2_ASV$Class[i],
                                                         "|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
  }
  else if (isTRUE(graphic_loc1vsloc2_ASV$Class[i]=="unclassified") &isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]!="unclassified")){
    graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
                                                         "|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
  }
  else if (isTRUE(graphic_loc1vsloc2_ASV$Phylum[i]=="unclassified")){
    graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", "unclassified")
  }
  else if (isTRUE(graphic_loc1vsloc2_ASV$Genus[i]!="unclassified")) {
    graphic_loc1vsloc2_ASV$Classification[i] <- paste0("Phylum", " ", graphic_loc1vsloc2_ASV$Phylum[i],
                                                         "|", graphic_loc1vsloc2_ASV$Genus[i],"|", graphic_loc1vsloc2_ASV$ASV_names[i])
    
  }}

Now we have a graphic_loc1vsloc2_genus variable with ANCOM results and a new column called Classification. Next, we create a data frame including only data we want to represent. This is the classification column, the log fold change (represented as beta in ANCOM results) and the standard deviation (represented as SE in ANCOM results). In this data frame those genera where log fold change is 0 or less than a specific value will be filter out to reduce the size of graphic.

##Create the data frame for representation, with Classification, Log fold change and standard deviation
logfold_df <-  graphic_loc1vsloc2_genus %>% dplyr::select(Classification, location1vslocation2_lfc, location1vslocation2_se)

colnames(logfold_df)<- c("Genus", "LogFold_location1vslocation2", "SD")

##Filter out 0 log fold change and assign name according to the direction of change
logfold_df  <-  logfold_df %>%
  filter(LogFold_location1vslocation2 != 0) %>% arrange(desc(LogFold_location1vslocation2)) %>%
  mutate(group = ifelse(LogFold_location1vslocation2 > 0, "Location2", "Location1")) #Change location to your variable of interest from your mt object

logfold_df$Genus = factor(logfold_df$Genus, levels = logfold_df$Genus)

Let’s write the table to your local machine:

write.table(file="waterfall_table.txt", logfold_df, sep="\t")

Here we have the filtering step. First, range function is used to see the minimum and maximum log fold change. Then, we filter the data frame. Remember to change the number to your desired threshold, in this tutorial we will use a range between -1.5 and 1.5.

#FILTER WHEN NEEDED. This can be used to filter the results according to log fold change level
range(logfold_df$LogFold_location1vslocation2)
#> [1] -2.880552  3.241747
logfold_df_filtered <- rbind(logfold_df[which(logfold_df$LogFold_location1vslocation2>=1.5),], logfold_df[which(logfold_df$LogFold_location1vslocation2<=-1.5),])

Now that we have prepared all data, we can use ggplot for plotting log fold change.

##Waterfall plot
waterfall_location1 <-  ggplot(data = logfold_df_filtered,#If not filtering, use logfold_df in plotting
           aes(x = Genus, y = LogFold_location1vslocation2, fill = group, color = group)) + #Change Genus to taxonomic rank of interest and logFold_location1vslocation2 to your object
  geom_bar(stat = "identity", width = 0.7,
           position = position_dodge(width = 0.4)) +
  geom_errorbar(aes(ymin = LogFold_location1vslocation2 - SD, ymax = LogFold_location1vslocation2 + SD), width = 0.2, #Change LogFold_location1vslocation2 to your object
                position = position_dodge(0.05), color = "black") +
  labs(x = NULL, y = "Log fold change",
       title = "Waterfall Plot for the Location Effect") + #change title 
  theme_bw() +
  theme(legend.position = "right",
        legend.title =element_blank(),
        plot.title = element_text(hjust = 0.5),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1), plot.margin=margin(10,10,10,100))

#Visualize plot
waterfall_location1

#Save it
ggsave("waterfall_location1.tiff", plot = waterfall_location1,width = 17, height = 30, units = "cm", dpi = 800 )

As it was said at the beginning of this section, we will also produce a graphic with relative abundance, specifically, a back-to-back plot or pyramidal plot. Again, we will show how to do this plot with location1 vs. location2 comparison. However, we need to start with a relative abundance table, in our case, genus_location_table (relative abundance at genus level and grouped by location). First, we will filter out those genera that were not statistically different according to ANCOM and sort the table as logfold_df_filtered (waterfall plot table). Then, we will select columns with classification and abundance and use pivot_longer to give it a format suitable for ggplot2.

#Again, bind phylum name and genus name 
pyramidal_df<- genus_location_table
for (i in 1:nrow(pyramidal_df)){
  pyramidal_df$Classification[i]<- paste(pyramidal_df$Phylum[i],pyramidal_df$Genus[i],sep="|")
  }
#Filter to include only significant genera according to ANCOM
pyramidal_loc1vsloc2_filt <- pyramidal_df[which(pyramidal_df$Classification %in%logfold_df_filtered$Genus ),]

#Sort it in the same way as waterfall plot
pyramidal_loc1vsloc2_filt <- pyramidal_loc1vsloc2_filt[match(logfold_df_filtered$Genus,pyramidal_loc1vsloc2_filt$Classification),]

#Now, transform it to ggplot format
pyramidal_loc1vsloc2 <- pyramidal_loc1vsloc2_filt %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
  pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#

We can write this table to our local machine:

write.table(file="pyramidal_table.txt", pyramidal_loc1vsloc2, sep="\t")

Finally, the data frame (pyramidal_loc1vsloc2_filt) has to be split into two data frames for each location and then create a vector variable with factors (this is needed for keeping the order in ggplot). This is essential for the back-to-back plot.

#Split table according to location
loc1pyrm <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location1"),]#Change location1 to variable of interest
loc2pyrm <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Samples=="location2"),]#Change location1 to variable of interest

#Create vector with factors
loc1pyrm$Classification <- factor(loc1pyrm$Classification, levels=loc1pyrm$Classification)
loc2pyrm$Classification <- factor(loc2pyrm$Classification, levels=loc2pyrm$Classification)

Before plotting, it is important to know the maximum and minimum values of abundances to set the limits of the graphic:

#Get limits for plotting
min(pyramidal_loc1vsloc2$Abundance)
#> [1] 0
max(pyramidal_loc1vsloc2$Abundance)
#> [1] 3.237008

As we can see, we should change the limits to a maximum of 4. Check the comments below (#):

#### PYRAMIDAL PLOT ####
pyramidalggplot=ggplot(data = pyramidal_loc1vsloc2, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
  geom_bar(loc1pyrm, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
  geom_bar(data=loc2pyrm, stat="identity")+ #in data, give the second split dataframe
  theme_bw()+
  scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
  labs(y="Relative abundance (%)")+
  theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
        axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
  coord_flip()


#Visualize plot
pyramidalggplot

#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )

Another way to represent the previous plot would be by abundance order. We will generate the tables needed using the original genus table in reversed order and avoiding the step to order the pyramidal data frame as the waterfall plot.

#Get genus location table in reverse order
genus_location_table_rev <- genus_location_table[nrow(genus_location_table):1,]
#Again, bind phylum name and genus name 
pyramidal_df_abun<- genus_location_table_rev
for (i in 1:nrow(pyramidal_df_abun)){
  pyramidal_df_abun$Classification[i]<- paste(pyramidal_df_abun$Phylum[i],pyramidal_df_abun$Genus[i],sep="|")
  }
#Filter to include only significant genera according to ANCOM
pyramidal_loc1vsloc2_filt_abund <- pyramidal_df_abun[which(pyramidal_df_abun$Classification %in%logfold_df_filtered$Genus ),]

#Now, transform it to ggplot format
pyramidal_loc1vsloc2_abund <- pyramidal_loc1vsloc2_filt_abund %>% dplyr::select(c(13, 7:8)) %>% #select columns with Classification and abundance
  pivot_longer(!Classification, names_to="Samples", values_to="Abundance")#

#Split table according to location
loc1pyrm_abund <- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location1"),]#Change location1 to variable of interest
loc2pyrm_abund <- pyramidal_loc1vsloc2_abund[which(pyramidal_loc1vsloc2_abund$Samples=="location2"),]#Change location1 to variable of interest

#Create vector with factors
loc1pyrm_abund$Classification <- factor(loc1pyrm_abund$Classification, levels=loc1pyrm_abund$Classification)
loc2pyrm_abund$Classification <- factor(loc2pyrm_abund$Classification, levels=loc2pyrm_abund$Classification)

Let’s see how it works

#### PYRAMIDAL PLOT ####
pyramidalggplot_abund=ggplot(data = pyramidal_loc1vsloc2_abund, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
  geom_bar(loc1pyrm_abund, stat="identity", mapping=aes(y=-Abundance))+ #in data, give the first split dataframe
  geom_bar(data=loc2pyrm_abund, stat="identity")+ #in data, give the second split dataframe
  theme_bw()+
  scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-4,4), breaks=seq(-4,4,1))+ #Change limits here
  labs(y="Relative abundance (%)")+
  theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
        axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
  coord_flip()


#Visualize plot
pyramidalggplot_abund

#Save it
ggsave("pyramidal_loc1_loc2.tiff", plot = pyramidalggplot,width = 17, height = 30, units = "cm", dpi = 800 )

Constrained analysis of proximities (CAP)

Microbial communities are greatly influenced by environmental factors. Thus, it is interesting to analyze how these variables relate to community structure. This can be achieved by studying different soil physicochemical parameters and fitting them to an ordination plot. The package vegan includes several functions and a very useful tutorial.

Among the options offered by vegan, we have implemented them in a way it is logical for our daily studies. We will exemplify it with some data from pine trees (which can be downloaded in the link at the beggining of the tutorial).

We will read this phyloseq file (saved as .RDS), convert the abundances to relative abundances and get the soil physicochemical parameters from metadata.

### PREPARE DATA FOR CAP #### 
#Read phyloseq and get relative abundance
cap_phyloseq <- readRDS(file = "cap_phyloseq")
cap_phyloseq_rel <- transform_sample_counts(cap_phyloseq,function(x){x/sum(x)}*100)

#Get ASV and metadata
ASV_cap<-as.data.frame(t(otu_table(cap_phyloseq_rel)))
metadata_cap <- as.matrix(sample_data(cap_phyloseq))
metadata_cap <- as.data.frame(metadata_cap)

#metadata_cap <- metadata_cap[1:ncol(metadata_cap)-1] #Remove column Season_NucleicAcid

If using your own data, you can use your previously generated phyloseq object. If physicochemical metadata was not originally in your metadata file, you can always read it into a variable in R with metadata_cap <- read.table("yourmtfile.txt", sep="\t", header=TRUE) and row.names(metadata_cap)<- metadata_cap$samples. In this file, row names should fit sample names and columns should fit physicochemical parameters.

Now, we will generate two models. CAP1 will include all variables to be part of constrained analysis, that is, it will explain the variance produced by all variables, and CAP0, which is the opposite of CAP1, i.e., fully unconstrained (explained variance without influence of physicochemical parameters).

These two models will be the input for ordistep function. This function allows us to get a model with physicochemical variables that significantly explain a part of the variance in the community and that are not colinear between them. The function starts with a model with all unconstrained variables (CAP0) and then, in each step, it adds a new variable to the constrained model. Then, it performs a checking step. If the variables left are no longer significant, or add no more explained variance to the model, the function stops. In the end, the output will be the formula with those variables that explain the majority of the variance of data. In this way, when having multiple variables, the model can be reduced and remove those that are interdependent.

#Compute CAP1 and CAP0 models
CAP1<-capscale(ASV_cap~Water+ln_P+SOM+N+ph_KCl+CN+Na_Exch+Clay+Sand+Slime+Texture, data=metadata_cap,distance = "bray")#Add here, after ASV_cap, all your parameters

CAP0<-capscale(ASV_cap~1, data=metadata_cap, distance = "bray")

#Get formula with ordistep
CAP_ordi<-ordistep(CAP0, scope=formula(CAP1))
#> 
#> Start: ASV_cap ~ 1 
#> 
#>           Df  AIC      F Pr(>F)   
#> + ph_KCl  11   28 1.2601  0.010 **
#> + Clay    15   10 1.0955  0.335   
#> + Water   15   10 1.0896  0.355   
#> + ln_P     8   33 1.0101  0.495   
#> + Na_Exch 10   33 0.9478  0.735   
#> + SOM     14   23 0.8758  0.905   
#> + N       13   28 0.8611  0.910   
#> + CN      16 -Inf                 
#> + Sand    16 -Inf                 
#> + Slime   16 -Inf                 
#> + Texture 16 -Inf                 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: ASV_cap ~ ph_KCl 
#> 
#>          Df    AIC      F Pr(>F)  
#> - ph_KCl 11 28.737 1.2601   0.02 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>           Df  AIC      F Pr(>F)
#> + Na_Exch  4    8 1.0277  0.465
#> + ln_P     4   10 0.9176  0.540
#> + Water    5 -Inf              
#> + SOM      5 -Inf              
#> + N        5 -Inf              
#> + CN       5 -Inf              
#> + Clay     5 -Inf              
#> + Sand     5 -Inf              
#> + Slime    5 -Inf              
#> + Texture  5 -Inf

Thus, we know now which variables influence our microbial community according to ordistep, in this case, pH. With this parameter, we will use the function envfit to fit this environmental factor as a vector into an ordination plot and get the correlation with each ordination axis. Remember that this correlation values correspond to an unconstrained analysis, i.e., correlation with ordination plots performed in β-diversity analysis.

#### ENVFIT ####
ef <-  envfit (CAP1~ph_KCl, data = metadata_cap, perm = 999) # AFter CAP1, introduce the significant parameters

Now, we can adjust p-values for multiple comparisons and visualize the result

#Now we can adjust pvalues of envfit
ef.adj <- ef
pvals.adj <- p.adjust (ef$vectors$pvals, method = 'BH')
ef.adj$vectors$pvals <- pvals.adj
ef.adj
#> 
#> ***VECTORS
#> 
#> $pvals
#> numeric(0)
#> 
#> 
#> ***FACTORS:
#> 
#> Centroids:
#>              CAP1    CAP2
#> ph_KCl5.6 -0.5680 -1.0802
#> ph_KCl5.7 -0.6385 -1.0075
#> ph_KCl5.8 -0.4246 -0.1969
#> ph_KCl5.9 -0.5018  0.1034
#> ph_KCl6.0  0.6368  0.3278
#> ph_KCl6.1  0.1524  0.6719
#> ph_KCl6.2 -0.3268  1.0491
#> ph_KCl6.4 -0.0529  1.3443
#> ph_KCl6.5  1.0748  0.1100
#> ph_KCl6.6  0.9464 -0.2098
#> ph_KCl7.0  0.4793 -0.3622
#> ph_KCl7.7  1.6547 -0.9666
#> 
#> Goodness of fit:
#>           r2 Pr(>r)  
#> ph_KCl 0.818  0.078 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Permutation: free
#> Number of permutations: 999

With plot_ordination we can get a graphic representation of constrained ordination. In this sense, we can visualize the direction of change of environmental variables. We will introduce another variable, although it was not significant, for showing how to plot various parameters.

#### CAP PLOT ####
#First, get distance matrix
distance_matrix <- phyloseq::distance(physeq = cap_phyloseq_rel, method = "wunifrac")

#Then, generate ordination with CAP (that is, constrained ordination to ph_KCl)
CAP_wunifrac <-  ordinate(physeq = cap_phyloseq_rel, method = "CAP",distance = distance_matrix,
                                  formula = ~ph_KCl+Clay)#Change parameters to yours

#Produce plot
CAP_wunifrac_plot  <- plot_ordination(physeq = cap_phyloseq_rel, ordination = CAP_wunifrac,
                                             color = "Species", axes = c(1,2)) +
  aes(shape = Species) +
  geom_point(aes(colour = Species), alpha = 1, size = 3.5) +
  scale_shape_manual(values=c("Pinaster"=16,"Other"=15), #Change Pinaster and other according to your variable of interest
                     breaks=c("Pinaster","Other"))+#Change Pinaster and other according to your variable of interest
  scale_color_manual(values = c("Pinaster"="brown", "Other"="orange"), #Change Pinaster and other according to your variable of interest
                     name="Host genotype", 
                     breaks=c("Pinaster", "Other"),
                     labels=expression(paste("Confirmed ", italic("P. pinaster")), "Pine forest samples"))+ #This is used to set the legend, change it according to your preference
  guides(shape="none")+
  ggtitle("CAP on Weighted Unifrac distance")+
  theme_bw()+
  theme(legend.key=element_blank(),
        legend.title.align = 0.85,
        legend.title = element_text(face="bold"),
        axis.text = element_text(size=14),
        axis.title = element_text(size = 16),
        plot.title = element_text(hjust=0.5, face="bold"),
        legend.text = element_text(size = 16))+
  geom_hline(aes(yintercept = c(0.00)), lty=2, colour="grey")+
  geom_vline(aes(xintercept = c(0.00)), lty=2, colour="grey")

#Define arrows aesthetic and labels
arrowmat <-  vegan::scores(CAP_wunifrac, display = "bp")
arrowdf <-  data.frame(labels = rownames(arrowmat), arrowmat) 
arrow_map  <-  aes(xend = CAP1, yend = CAP2,x = 0, y = 0, shape = NULL, color = NULL, label=labels)
label_map <- aes(x = 1.1 * CAP1, y = 1.1 * CAP2, shape = NULL,color = NULL, label=labels)
arrowhead  <-  arrow(length = unit(0.02, "npc"))

#Introduce them to plot
CAP_wunifrac_plot +
  geom_segment(mapping = arrow_map, size = c(ph_KCl=1,Clay=1), data = arrowdf, color = 
                 c(ph_KCl="black",Clay="black"),arrow = arrowhead) +#Change pH and Clay to your physycochemical variables of interest
  geom_text(mapping = label_map, size = 4,data = arrowdf, show.legend = FALSE)

BONUS: statistical analysis whitout micro4all

In this section, we will show how it is possible to perform the statistical analysis shown in this tutorial without using functions from micro4all package. This section will not include explanations, as everything was clearly explained throughout the tutorial. We will start with α-diversity. First of all, let’s load some libraries (which were installed with micro4all):

##Load library

library(car);packageVersion("car")
#> Loading required package: carData
#> 
#> Attaching package: 'car'
#> The following object is masked from 'package:DescTools':
#> 
#>     Recode
#> The following object is masked from 'package:dplyr':
#> 
#>     recode
#> The following object is masked from 'package:purrr':
#> 
#>     some
#> [1] '3.1.1'
library(stats);packageVersion("stats")
#> [1] '4.2.2'
library(BiocGenerics); packageVersion("BiocGenerics")
#> [1] '0.42.0'
library(agricolae);packageVersion("agricolae")
#> Registered S3 methods overwritten by 'klaR':
#>   method      from 
#>   predict.rda vegan
#>   print.rda   vegan
#>   plot.rda    vegan
#> 
#> Attaching package: 'agricolae'
#> The following object is masked from 'package:ape':
#> 
#>     consensus
#> The following object is masked from 'package:seqinr':
#> 
#>     consensus
#> [1] '1.3.5'
library(pairwiseAdonis);packageVersion("pairwiseAdonis")
#> Loading required package: cluster
#> [1] '0.4.1'
library(dunn.test);packageVersion("dunn.test")
#> [1] '1.3.5'

For balanced ANOVA, the next lines should be repeated for every diversity index:

#### ALPHA DIVERSITY WITHOUT MICRO4ALL ####
##Balanced anova
res.aov_observed <-  aov(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest and location to mt variable
summary(res.aov_observed)
#>             Df Sum Sq Mean Sq F value Pr(>F)
#> location     2   6191    3095   1.126  0.343
#> Residuals   21  57753    2750

The same logic can be applied to unbalanced designs:

##Unbalanced anova
res.aov_Observed_u <-  Anova(res.aov_observed, type = "III")# Change res.aov_observed for every index (res.aov_shannon, etc...) 
res.aov_Observed_u 
#> Anova Table (Type III tests)
#> 
#> Response: Observed
#>             Sum Sq Df  F value    Pr(>F)    
#> (Intercept) 568711  1 206.7924 2.421e-12 ***
#> location      6191  2   1.1255    0.3433    
#> Residuals    57753 21                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As a post-hoc, Tukey test or Dunn test can be applied. Again, this should be repeated for every index

##POST-HOC TEST, TUKEYHSD 
post_observed <- TukeyHSD(res.aov_observed, which = "location")$location # Change res.aov_observed for every index (res.aov_shannon, etc...) and location to mt variable
which(post_observed[,"p adj"]<0.05) 
#> named integer(0)
##Post-hoc, dunn Test
dunn.result<- dunn.test(x=alpha_table$Observed, g=alpha_table$location, method = "BH")
#>   Kruskal-Wallis rank sum test
#> 
#> data: x and group
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.5
#> 
#> 
#>                            Comparison of x by group                            
#>                              (Benjamini-Hochberg)                              
#> Col Mean-|
#> Row Mean |   location   location
#> ---------+----------------------
#> location |  -0.901953
#>          |     0.2753
#>          |
#> location |   0.212224   1.114177
#>          |     0.4160     0.3978
#> 
#> alpha = 0.05
#> Reject Ho if p <= alpha/2

Now, normality and homocedasticity will be checked:

##Levene test and shapiro test
leveneTest(Observed ~ location, data = alpha_table)#Change OBSERVED for index of interest  and location to mt variable
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Levene's Test for Homogeneity of Variance (center = median)
#>       Df F value Pr(>F)
#> group  2  0.4798 0.6255
#>       21
#Shapiro
aov_residuals_observed = residuals(object = res.aov_observed) # Change res.aov_observed for every index (res.aov_shannon, etc...)
shapiro.test(x = aov_residuals_observed) 
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  aov_residuals_observed
#> W = 0.96527, p-value = 0.5529

For Kruskal-Wallis and pairwise (Wilcoxon test), a similar procedure will be followed:

##Kruskal Wallis
k_observed <- kruskal.test(Observed ~ location, data = alpha_table) #Change OBSERVED for index of interest and location to mt variable
k_observed
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  Observed by location
#> Kruskal-Wallis chi-squared = 1.4, df = 2, p-value = 0.4966
#Wilcoxon test
pairwise.wilcox.test(alpha_table$Observed, alpha_table$location, p.adjust.method = "BH") #Change OBSERVED for index of interest and location to mt variable
#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties

#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
#> 
#>  Pairwise comparisons using Wilcoxon rank sum exact test 
#> 
#> data:  alpha_table$Observed and alpha_table$location 
#> 
#>           location1 location2
#> location2 0.66      -        
#> location3 0.92      0.66     
#> 
#> P value adjustment method: BH

For β-diversity analysis, PERMANOVA test can be applied to a single distance method.

#### BETA DIVERSITY WITHOUT MICRO4ALL ####
#PERMANOVA with bray curtis distances
df <- data.frame(sample_data(norm_phyloseq))
bray <- phyloseq::distance(norm_phyloseq, method="bray") #Change bray to include another distance method
permanova_all <- adonis2(bray~ location, data=df)  #Change location to mt variable of interest
permanova_all
#> Permutation test for adonis under reduced model
#> Terms added sequentially (first to last)
#> Permutation: free
#> Number of permutations: 999
#> 
#> adonis2(formula = bray ~ location, data = df)
#>          Df SumOfSqs      R2      F Pr(>F)    
#> location  2   1.4040 0.30049 4.5105  0.001 ***
#> Residual 21   3.2684 0.69951                  
#> Total    23   4.6725 1.00000                  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dispersion will be checked with betadisper function. If working with another distance, do the proper change adonis:

#Betadisper with bray curtis distances
disper_d <- betadisper(bray, df$location) #Change location to mt variable of interest
permutest(disper_d) # NOT SIGNIFICANT 
#> 
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#> 
#> Response: Distances
#>           Df   Sum Sq   Mean Sq     F N.Perm Pr(>F)
#> Groups     2 0.004001 0.0020003 0.317    999  0.706
#> Residuals 21 0.132519 0.0063104

Next, pairwise.adonis function will be applied to check differences between groups:

#Pairwise adonis
d <-  data.frame(sample_data(norm_phyloseq))
table_vegan <- cbind(t(OTU), df) 
ncolum <- ncol(t(OTU))

## PAIRWISE ADONIS ON BRAY CURTIS ## 
pw_result <- pairwise.adonis(table_vegan[,1:ncolum], table_vegan$location, p.adjust.m = "BH") #Change location to mt variable of interest
pw_result_sig <- pw_result[which(pw_result$p.adjusted<0.05),]
pw_result_sig # None
#>                    pairs Df SumsOfSqs  F.Model        R2 p.value p.adjusted sig
#> 1 location1 vs location2  1 0.3254020 2.020915 0.1261423   0.011     0.0110   .
#> 2 location1 vs location3  1 0.8502918 5.871144 0.2954608   0.001     0.0015   *
#> 3 location2 vs location3  1 0.8446613 5.871946 0.2954892   0.001     0.0015   *

References

  1. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology, 8, 2224.

  2. Paliy, O., & Shankar, V. (2016). Application of multivariate statistical techniques in microbial ecology. Molecular ecology, 25(5), 1032-1057.

  3. Bokulich, N. A., Subramanian, S., Faith, J. J., Gevers, D., Gordon, J. I., Knight, R., … & Caporaso, J. G. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature methods, 10(1), 57-59.