GBS Reference Report

Project name: umgc_563
Run name: umgc_563
Read type: Single-end 151bp
Samples: 7
Total reads: 13,934,928
Mean reads per sample: 1,990,704
Report generated: Wed Jan 25 11:40:00 CST 2023

Enzyme: sbfi-taqi

More details

Fastq folder: /panfs/jay/groups/9/umgc-staff/shared/project-analysis/gbs/umgc_563/umgc_563
Commandline: /panfs/jay/groups/21/umgc/public/bin/gopher-pipelines/2.4/bin/gbsref-pipeline.pl –freebayesthreads 12 –fastqfolder /home/umgc-staff/shared/project-analysis/gbs//umgc_563/umgc_563 –scratchfolder /scratch.global/jkuriger-pipelines/umgc_563_SbfI-TaqI_0 –bwaindex /home/umii/public/ensembl/Homo_sapiens/current/bwa/genome –referencefasta /home/umii/public/ensembl/Homo_sapiens/current/seq/genome.fa –samplespernode 6 –threadspersample 12 –minlength 50 –maxlength 100 –enzyme SbfI –enzyme2 TaqI –subsample 0
Gopher-pipelines version: 2.4

Program versions

/panfs/roc/soft/modulefiles.legacy/modulefiles.common/java/jdk1.7.0_45
R/4.0.4
advisor/2018/release
bedtools/v2.29.2
bowtie2/2.3.4.1.CentOS7
bwa/0.7.17.CentOS7
bzip2/1.0.6-gnu7.2.0_PIC
curl/7.59.0_gcc7.2.0
cutadapt/1.18
ensembl/1.0
fastqc/0.11.7
freebayes/1.3.6
gcc/7.2.0
gcc/8.1.0
genomes-other/1.0
genomes/1.0
gmp/6.1.2_gcc7.2.0
gmp/6.1.2_gcc8.1.0
gopher-pipelines/2.4
gtools/2.4
impi/2018/release_singlethread
inspector/2018/release
intel/2018/release
intel/cluster/2018
isl/0.18_gcc7.2.0
isl/0.19_gcc8.1.0
itac/2018/release
java/openjdk-8_202
libgit/1.1.0
liblzma/5.2.2
libtiff/4.0.8
mpc/1.0.3_gcc7.2.0
mpc/1.1.0_gcc8.1.0
mpfr/3.1.6_gcc7.2.0
mpfr/4.0.1_gcc8.1.0
parallel/20210822
pcre2/10.34
perlmodules/1.0
plotly/20221014
pysam/0.15.3
python3/3.8.3_anaconda2020.07_mamba
samtools/1.16.1
trimmomatic/0.33
umgc/1.0
variantbam/20190418
vcftools/0.1.15
vtune/2018/release
zlib/1.2.11_gcc7.2.0
zlib/1.2.8

Metric table

	general.projectname	general.readlength	general.runname	subsample.rawreads	subsample.subsampledreads	fastqc.meanreadqualityR1	fastqc.pct.adapter	fastqc.pct.deduplicated	fastqc.pct.dimer	fastqc.pct.gc	fastqc.sequencelength	fastqc.totalsequences	gbstrim.inputreads	gbstrim.no3cutsite	gbstrim.no5cutsite	gbstrim.pct.A	gbstrim.pct.ACAGA	gbstrim.pct.AGGTCCAA	gbstrim.pct.CAG	gbstrim.pct.CTCCGA	gbstrim.pct.GAATCGA	gbstrim.pct.GCTA	gbstrim.pct.TCACGCTCA	gbstrim.pct.TG	gbstrim.pct.no3cutsite	gbstrim.pct.no5cutsite	gbstrim.pct.no_padding	gbstrim.pct.readswithtrailingadapter	gbstrim.pct.removed	gbstrim.readswithtrailingadapter	gbstrim.removed	gbstrim.tooshort	picardalignmentplot.pct_alignedreads
NA12878	umgc_563	151	umgc_563	1644435	1644435	32.00	6.75	26.21	0.01	53.00	151	1644435	1644435	66759.00	102791.0	12.62	11.75	8.30	9.06	5.08	0.17	9.07	7.53	11.69	4.06	6.25	14.41	35.56	10.31	584745.0	169550.0	0.00	96.00
NA24143	umgc_563	151	umgc_563	1842306	1842306	32.30	4.50	25.96	0.02	53.00	151	1842306	1842306	69496.00	119015.0	12.47	11.76	8.52	8.68	4.96	0.35	9.13	7.93	11.65	3.77	6.46	14.32	25.65	10.23	472613.0	188511.0	0.00	97.01
NA24149	umgc_563	151	umgc_563	2307539	2307539	32.60	5.03	25.05	0.03	53.00	151	2000000	2307539	88138.00	139616.0	12.21	11.73	8.60	9.18	5.73	0.28	8.93	8.11	11.62	3.82	6.05	13.74	27.94	9.87	644757.0	227754.0	0.00	98.72
NA24385	umgc_563	151	umgc_563	1978176	1978176	32.00	6.08	25.66	0.02	53.00	151	1978176	1978176	82773.00	132466.0	12.14	11.31	8.14	9.23	5.45	0.23	9.34	7.74	11.43	4.18	6.70	14.09	32.54	10.88	643685.0	215239.0	0.00	99.00
NA24631	umgc_563	151	umgc_563	2007399	2007399	32.50	4.85	24.84	0.02	53.00	151	2000000	2007399	75627.00	117517.0	12.60	11.95	8.62	8.82	5.32	0.19	9.10	7.96	11.76	3.77	5.85	14.05	27.05	9.62	543015.0	193144.0	0.00	98.27
NA24694	umgc_563	151	umgc_563	1734200	1734200	32.10	5.89	26.42	0.02	53.00	151	1734200	1734200	66378.00	112914.0	12.52	11.73	8.23	9.20	5.30	0.21	9.09	7.56	11.46	3.83	6.51	14.36	31.91	10.34	553322.0	179292.0	0.00	98.76
NA24695	umgc_563	151	umgc_563	2420874	2420874	32.50	6.34	23.63	0.03	54.00	151	2000000	2420874	92343.00	133913.0	12.33	11.84	8.52	9.35	6.15	0.19	8.86	8.00	11.60	3.81	5.53	13.81	32.12	9.35	777502.0	226259.0	3.00	90.95
Mean	undef	151	umgc_563	1990704	1990704	32.29	5.63	25.40	0.02	53.14	151	1885588	1990704	77359.14	122604.6	12.41	11.72	8.42	9.07	5.43	0.23	9.07	7.83	11.60	3.89	6.19	14.11	30.40	10.09	602805.6	199964.1	0.43	96.96

Order samples by

Reads per Sample

The number of reads per sample at the start of the analysis is shown.

Quality Summary

This plot shows the mean base quality score for each position in a fastq file. The higher the score the better the base call. Green indicates very good quality, yellow indicates reasonable quality, and red indicates poor quality. The quality of calls degrade as the sequencing run progresses, so it is common to see base calls turning yellow towards the end of a read. It is common to see base calls turning red in short insert libraries (16s/18s, small RNA, amplicon). This metric is calculated by FastQC.

More QC plots

Adapter Content

This plot shows the cumulative proportion of each sample in which sequencing adapter sequences have been seen at each position. Once an adapter sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages only increase as the read length goes on. It is common to see significant adapter sequence content at the ends of reads in short insert libraries (16s/18s, small RNA, amplicon). This metric is calculated by FastQC.

Library Diversity

This plot shows the percentage of reads that remain after deduplication (removing duplicated sequences), which is a measure of library diversity. Sequence data from genomic DNA libraries typically have high library diversity, and data from targeted sequencing libraries typically have low diversity. This metric is calculated by FastQC.

Reads removed by GBStrim

The percent of reads discarded by GBStrim is shown. Less than 10% is good. Less than 20% is OK. Higher than 20% is unusual. Reads are removed if no 5’ cut site is found, if sequencing adapter is removed from the 3’ end but no 3’ cut site is found, or if the trimmed read is too short. Specifying the incorrect enzymes or specifying them in the wrong order are the most common causes of a high percent of discarded reads.

Alignment Rate

Percent of reads aligned to reference is shown. These metrics are calculated by Picard collectalignmentsummarymetrics

PCA plot

This is a Principal Components (PCA) plot. The first three principal components are shown, and the percent of total variation explained by each component is shown in the axis titles. Samples with similar characteristics appear close to each other, and samples with dissimilar characteristics are farther apart. Ideally samples will cluster by experimental condition and not by batch or other technical effects.

Mean depth per Sample

The mean read depth across all variants per sample is shown.

Missing genotypes per Sample

The fraction of missing genotype calls per sample is shown.

Missing genotypes per site

The fraction of missing genotype calls per site is shown.

Markers per chromosome

Each point in the plot is a chromosome or scaffold in the reference genome assembly, plotted by chromosome length (x axis) and # of markers on the chromosome (y axis). Longer chromosomes should have more markers and shorter chromosmes should have fewer markers. Pseudo-chromosomes composed of unplaced contigs often have unexpectedly high or low numbers of markers.

Data

The output folder generated by this analysis pipeline contains the following folders and files:
fastqc/ FastQC html files
variants.vcf.gz VCF variant file (compressed) containing 7 samples and 12308 markers on 7481 loci
variants.filt.vcf.gz Filtered VCF variant file (compressed) containing 7 samples and 9252 markers on 6094 loci
- Samples with > 50% missing genotypes, and variants with genotype calls in less than 95% of samples are removed; variants with maf < 1% are removed
- 0 samples removed:

Fastq files

One of the following adapter sequences is expected to be present at the beginning of each read in the raw fastq files:
[no adapter sequence present]
A
TG
CAG
GCTA
ACAGA
CTCCGA
GAATCGA
AGGTCCAA
TCACGCTCA
Some reads may read through into the Illumina adapter sequnce which begins with CTGTCTCTTATACACATCTCCGAG

Methods

The GBS dataset was analyzed using the reference /panfs/jay/groups/29/umii/public/ensembl/Homo_sapiens/GRCh38/bwa//genome. Quality of data in fastq files was assessed with FastQC. GBStrim.pl was used to trim 5’ padding sequences and readthrough into 3’ padding sequences (https://bitbucket.org/jgarbe/gbstrim). BWA mem was used to align reads to a reference genome (/panfs/jay/groups/29/umii/public/ensembl/Homo_sapiens/GRCh38/bwa//genome). The bam alignment files were sorted and indexed with Samtools. Regions of bam files with more than 500 reads were downsampled to a depth of 500 reads using VariantBam. Freebayes was used to call variants jointly across all samples using the options ‘–use-best-n-alleles 6 –min-coverage 14’. The raw VCF file generated by Freebayes was filtered to remove the lowest quality variants using vcffilter with the options ‘-f “QUAL > 20”’. Alignment rates were summarized with Picard.