Shotgun Metagenomics Analysis Report

Project Name: Test
Report generated: June 28, 2024
Number of Samples: 21

Overview

This report summarizes results of analyzing metagenomics sequence data. The Reads Per Sample plot shows the total number of reads in each sample. The Read assignment to taxonomic groups (classification) plot summarizes classification of reads into different taxonomic groups. The taxonomic group unclassified contains reads that could not be uniquely assigned to a taxonomic group. Reads in the unclassified group are excluded from subsequent analysis. The Top N most abundant taxa plot shows the top N most abundant taxonomic groups for each sample. N is an integer specified during analysis. By default, top 10 most abundant taxonomic groups are shown. The Alpha Diversity (Shannon Index) plot show diversity of each sample. Shannon index is used to generate the plot. The remaining two plots are a dendrogram and PCA plots generated using the beta diversity matrix. The dendrogram and PCA plot show the relationship between different samples.

Performing Additional Analysis

To make it easier for experienced users to perform their own analysis, we provide abundance estimates ( abundance_read_count.biom ) as determined by Bracken ³
in BIOM format. It contains number of reads assigned to different species for each sample. This count table can be imported and additional analysis performed using popular metagenomics analysis tools such as QIIME 2 and mothur. Below are commands to load the data into QIIME and mothur. Note, these commands are current as of June 2024 and might have changed. Be sure to check documentation for the most current commands.

QIIME 2

 
    # Load count data into QIIME 2
    > qiime tools import --input-path abundance_read_count.biom --type 'FeatureTable[Frequency]'   --input-format BIOMV100Format --output-path qiime

mothur

 
    # Load count data into mothur
    > make.shared(biom=abundance_read_count.biom)

We also provide a tab delimited text file (raw_read_count.txt) with raw read counts as determined by Kraken 2 ². Users that would like to use a different approach to estimate abundance can use this raw reads counts file as input.

Data Analysis Protocol

Analysis followed the protocol in Lu et al ¹ with UMGC in-house scripts used to generate this report and provide additional quality metrics. The main steps of the analysis are summarized below.

Step 1: Read Count

Read count for each sample was performed UMGC in-house scripts.

Step 2: Read Classification

Kraken 2 ² was used to assign taxonomic labels to Metagenomic DNA sequences (reads) using the nt Database as reference.

Step 3: Abundance Estimation

Bracken ³ was used to estimate abundance. By default, estimation is done at the species level.

Step 4: Generate Data & Report

Python analysis scripts packages with Bracken ³ were used to compute alpha and beta diversity metrics. The other accompanying data were generated using QIIME 2.

Data

The data folder accompanying this report contains the following files:

Text Files

final_stats.txt: Tab delimited text file containing data used to generate plots in this report including additional alpha diversity metrics such berger-parker dominance metric.

raw_read_count.txt: Tab delimited text file of reads assigned to each taxon by Kraken 2 ².

abundance_read_count.biom: BIOM table containing abundance estimates (reads counts) as determined by Bracken ³. Note, the total number of reads in abundance_read_count.biom will be different from the total number of reads in your FastQ files. Abundance estimates (abundance_read_count.biom) do not contain unclassified reads.

abundance_read_count.txt: Same data in abundance_read_count.biom but in a format that is easier to parse. A tab delimited text file with rows containing taxa and columns containing samples.

bracken_beta_diversity_bray_curtis_dissimilarity.txt: Text file containing Bray-Curtis dissimilarity matrix computed using Bracken ³.

qiime_beta_diversity_bray_curtis_dissimilarity.txt: Text file containing Bray-Curtis dissimilarity matrix computed using QIIME 2.

q2-artifact/ Qiime2 artifact files

Qiime2 artifact files (.qza) contain the data and metadata from an analysis done using QIIME 2. These files can be analyzed further in Qiime2, or you can view the contents of the files using the online browser at https://view.qiime2.org.

abundance_read_count.qza: Abundance estimates (reads counts). Same data in abundance_read_count.biom.

qiime_braycurtis.qza: Bray-Curtis dissimilarity matrix.

Alpha Rarefaction Curves

If there is a large difference in sequencing depths between samples, the deeply sequenced sample might appear to have greater diversity relative to the sample with low sequencing depth. Rarefaction is the process of subsampling reads to the same depth to control for differences in sequencing depth. Alpha rarefaction curves are typically used to determine the optimal sequencing depth to rarefy data. While common with 16S rRNA data, for several reasons discussed below, alpha rarefaction curves are unlikely to provide useful information with shotgun data.

A study by Weiss et al ⁴ on normalization and microbial differential abundance strategies showed rarefying is beneficial when analyzing samples with large differences in sequencing depths (~10X). Furthermore, another study by Hillmann et al ⁵ showed a depth of approximately 0.5 million was sufficient for most application with a depth of 2 million reads recommended for detection of rare species below a relative abundance of approximately 0.0005. Very few shotgun studies have differences greater than 10X in sequencing depth and have less than 2 million reads. Considering the above, while common with 16S rRNA data, alpha rarefaction curves are not as useful for most shotgun data hence we did not plot them.

To confirm alpha rarefaction curves are unlikely to provide additional useful information for shotgun data, we analyzed WGS data of the same samples analyzed in three different labs in the study by Forry et al ⁶ and plotted alpha rarefaction curves. The results are shown in the figure below.

RarefactionCurve

As the plot above shows, saturation happens at approximately 20,000 reads confirming it is unlikely alpha rarefaction curves will provide additional useful information.

References

1. Lu, J., Rincon, N., Wood, D.E. et al. Metagenome analysis using the Kraken software suite. Nat Protoc 17, 2815–2839 (2022). https://doi.org/10.1038/s41596-022-00738-y

2. Wood, D.E., Salzberg, S.L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15, R46 (2014). https://doi.org/10.1186/gb-2014-15-3-r46

3.Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104 https://doi.org/10.7717/peerj-cs.104

4. Weiss, S., Xu, Z.Z., Peddada, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 27 (2017). https://doi.org/10.1186/s40168-017-0237-y

5. Hillmann B, Al-Ghalith GA, Shields-Cutler RR, Zhu Q, Gohl DM, Beckman KB, Knight R, Knights D 2018. Evaluating the Information Content of Shallow Shotgun Metagenomics. mSystems 3:10.1128/msystems.00069-18. https://doi.org/10.1128/msystems.00069-18

6. Forry, S.P., Servetas, S.L., Kralj, J.G. et al. Variability and bias in microbiome metagenomic sequencing: an interlaboratory study comparing experimental protocols. S ci Rep 14, 9785 (2024). https://doi.org/10.1038/s41598-024-57981-4