Project name: UMGC_Project_563
Run name: UMGC_Project_563
Read type: Single-end 151bp
Samples: 7
Total reads: 13,934,928
Mean reads per sample: 1,990,704
Report generated: Thu Jan 26 18:48:28 CST 2023

More details
Fastq folder: /scratch.global/jgarbe-pipelines/stacks2-UMGC_Project_563/fastq-gunzip
Commandline: /panfs/jay/groups/21/umgc/public/bin/gopher-pipelines/2.4/bin/stacks2-pipeline.pl –fastqfolder /spaces/umgc/umgc_download/umgc-staff/190503_NB551164_0140_AHK25NAFXY/UMGC_Project_563/ –enzyme sbfi –enzyme2 taqi –croplength 80
Gopher-pipelines version: 2.4
Program versions
/panfs/roc/soft/modulefiles.legacy/modulefiles.common/java/jdk1.7.0_45
/panfs/roc/soft/modulefiles.legacy/modulefiles.common/python-epd/canopy-1.3.0
R/4.0.4
bwa/0.7.17.CentOS7
bzip2/1.0.6-gnu7.2.0_PIC
curl/7.59.0_gcc7.2.0
cutadapt/1.8.1
ensembl/1.0
fastqc/0.11.7
gcc/7.2.0
gcc/8.1.0
genomes-other/1.0
genomes/1.0
gmp/6.1.2_gcc7.2.0
gmp/6.1.2_gcc8.1.0
gpipes/2.4
gtools/2.4
isl/0.18_gcc7.2.0
isl/0.19_gcc8.1.0
libgit/1.1.0
liblzma/5.2.2
libtiff/4.0.8
mpc/1.0.3_gcc7.2.0
mpc/1.1.0_gcc8.1.0
mpfr/3.1.6_gcc7.2.0
mpfr/4.0.1_gcc8.1.0
parallel/20210822
pcre2/10.34
perlmodules/1.0
plink/1.90b6.10
plotly/20221014
pysam/0.15.3
python3/3.8.3_anaconda2020.07_mamba
samtools/1.16.1
stacks/2.62
trimmomatic/0.33
umgc/1.0
usearch/10.0
vcftools/0.1.15
xz-utils/5.2.3_gcc7.2.0
zlib/1.2.11_gcc7.2.0
Metric table
general.projectname general.readlength general.runname subsample.rawreads subsample.subsampledreads fastqc.meanreadqualityR1 fastqc.pct.adapter fastqc.pct.deduplicated fastqc.pct.dimer fastqc.pct.gc fastqc.sequencelength fastqc.totalsequences gbstrim.inputreads gbstrim.no3cutsite gbstrim.no5cutsite gbstrim.pct.A gbstrim.pct.ACAGA gbstrim.pct.AGGTCCAA gbstrim.pct.CAG gbstrim.pct.CTCCGA gbstrim.pct.GAATCGA gbstrim.pct.GCTA gbstrim.pct.TCACGCTCA gbstrim.pct.TG gbstrim.pct.no3cutsite gbstrim.pct.no5cutsite gbstrim.pct.no_padding gbstrim.pct.readswithtrailingadapter gbstrim.pct.removed gbstrim.pct.tooshort gbstrim.readswithtrailingadapter gbstrim.removed gbstrim.tooshort gbs.enzyme gbs.enzyme2
NA12878 UMGC_Project_563 151 UMGC_Project_563 1644435 1644435 32.00 6.75 26.21 0.01 53.00 151 1644435 1644435 66759.00 102791.0 12.62 11.75 8.30 9.06 5.08 0.17 9.07 7.53 11.69 4.06 6.25 14.41 35.56 17.62 7.31 584745.0 289688.0 120138.0 sbfi taqi
NA24143 UMGC_Project_563 151 UMGC_Project_563 1842306 1842306 32.30 4.50 25.96 0.02 53.00 151 1842306 1842306 69496.00 119015.0 12.47 11.76 8.52 8.68 4.96 0.35 9.13 7.93 11.65 3.77 6.46 14.32 25.65 14.97 4.74 472613.0 275824.0 87313.0 sbfi taqi
NA24149 UMGC_Project_563 151 UMGC_Project_563 2307539 2307539 32.60 5.05 23.58 0.03 53.00 151 2307539 2307539 88138.00 139616.0 12.21 11.73 8.60 9.18 5.73 0.28 8.93 8.11 11.62 3.82 6.05 13.74 27.94 15.08 5.21 644757.0 347976.0 120222.0 sbfi taqi
NA24385 UMGC_Project_563 151 UMGC_Project_563 1978176 1978176 32.00 6.08 25.66 0.02 53.00 151 1978176 1978176 82773.00 132466.0 12.14 11.31 8.14 9.23 5.45 0.23 9.34 7.74 11.43 4.18 6.70 14.09 32.54 17.20 6.31 643685.0 340150.0 124911.0 sbfi taqi
NA24631 UMGC_Project_563 151 UMGC_Project_563 2007399 2007399 32.50 4.85 24.80 0.02 53.00 151 2007399 2007399 75627.00 117517.0 12.60 11.95 8.62 8.82 5.32 0.19 9.10 7.96 11.76 3.77 5.85 14.05 27.05 14.70 5.08 543015.0 295054.0 101910.0 sbfi taqi
NA24694 UMGC_Project_563 151 UMGC_Project_563 1734200 1734200 32.10 5.89 26.42 0.02 53.00 151 1734200 1734200 66378.00 112914.0 12.52 11.73 8.23 9.20 5.30 0.21 9.09 7.56 11.46 3.83 6.51 14.36 31.91 16.50 6.16 553322.0 286121.0 106829.0 sbfi taqi
NA24695 UMGC_Project_563 151 UMGC_Project_563 2420874 2420874 32.40 6.37 21.80 0.03 54.00 151 2420874 2420874 92343.00 133913.0 12.33 11.84 8.52 9.35 6.15 0.19 8.86 8.00 11.60 3.81 5.53 13.81 32.12 16.92 7.57 777502.0 409631.0 183375.0 sbfi taqi
Mean undef 151 UMGC_Project_563 1990704 1990704 32.27 5.64 24.92 0.02 53.14 151 1990704 1990704 77359.14 122604.6 12.41 11.72 8.42 9.07 5.43 0.23 9.07 7.83 11.60 3.89 6.19 14.11 30.40 16.14 6.05 602805.6 320634.9 120671.1 undef undef

Order samples by

Reads per Sample

The number of reads per sample at the start of the analysis is shown.

Quality Summary

This plot shows the mean base quality score for each position in a fastq file. The higher the score the better the base call. Green indicates very good quality, yellow indicates reasonable quality, and red indicates poor quality. The quality of calls degrade as the sequencing run progresses, so it is common to see base calls turning yellow towards the end of a read. It is common to see base calls turning red in short insert libraries (16s/18s, small RNA, amplicon). This metric is calculated by FastQC.

More QC plots
Adapter Content

This plot shows the cumulative proportion of each sample in which sequencing adapter sequences have been seen at each position. Once an adapter sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages only increase as the read length goes on. It is common to see significant adapter sequence content at the ends of reads in short insert libraries (16s/18s, small RNA, amplicon). This metric is calculated by FastQC.

Library Diversity

This plot shows the percentage of reads that remain after deduplication (removing duplicated sequences), which is a measure of library diversity. Sequence data from genomic DNA libraries typically have high library diversity, and data from targeted sequencing libraries typically have low diversity. This metric is calculated by FastQC.

PCA plot

This is a Principal Components (PCA) plot. The first three principal components are shown, and the percent of total variation explained by each component is shown in the axis titles. Samples with similar characteristics appear close to each other, and samples with dissimilar characteristics are farther apart. Ideally samples will cluster by experimental condition and not by batch or other technical effects.

Mean depth per Sample

The mean read depth across all variants per sample is shown.

Missing genotypes per Sample

The fraction of missing genotype calls per sample is shown.

Missing genotypes per site

The fraction of missing genotype calls per site is shown.

Data

The output folder generated by this analysis pipeline contains the following folders and files:
fastqc/ FastQC html files
variants.vcf.gz VCF variant file (compressed) containing 7 samples and 4620 markers on 4005 loci
variants.filt.vcf.gz Filtered VCF variant file (compressed) containing 7 samples and 4306 markers on 3757 loci
- Samples with > 50% missing genotypes, and variants with genotype calls in less than 80% of samples are removed; variants with maf < 1% are removed
- 0 samples removed:

Fastq files

One of the following adapter sequences is expected to be present at the beginning of each read in the raw fastq files:
[no adapter sequence present]
A
TG
CAG
GCTA
ACAGA
CTCCGA
GAATCGA
AGGTCCAA
TCACGCTCA
Some reads may read through into the Illumina adapter sequnce which begins with CTGTCTCTTATACACATCTCCGAG

Methods

Quality of data in fastq files was assessed with FastQC. GBStrim.pl was used to trim 5' padding sequences and readthrough into 3' padding sequences (https://bitbucket.org/jgarbe/gbstrim). Stacks2 was used to process the dataset. Low quality reads were removed with the process_radtags command using the options "-e bamHI --disable_rad_check -r -c -q -y gzfastq". The ustacks command was used to build de novo loci in each sample using the options "  -p 2 --force-diff-len". All samples were used to build a catalog of loci with the cstacks command using the options " -p 8". All samples were matched against the cstacks catalog using the sstacks command. The tsv2bam command was used to store data by locus instead of by sample. The gstacks command was used to call variant sites and generate genotypes for each individual. The populations command was used to calculate various statistics and generate output files using the options "-r 0.65 --vcf --genepop --structure --fasta-loci --fstats --hwe -t 8".