Marine Genomics 61 (2022) 100914
1874-7787/© 2021 Elsevier B.V. All rights reserved.
Benefts of merging paired-end reads before pre-processing environmental
metagenomics data
Midhuna Immaculate Joseph Maran, Dicky John Davis G.
*
Faculty of Engineering and Technology, Sri Ramachandra Institute of Higher Education and Research, Chennai, India
A R T I C L E INFO
Keywords:
Metagenomics
High throughput sequencing
Environmental DNA
Sequence alignment
Quality processing
Trimmomatic
ABSTRACT
Background: High throughput sequencing of environmental DNA has applications in biodiversity monitoring, taxa
abundance estimation, understanding the dynamics of community ecology, and marine species studies and
conservation. Environmental DNA, especially, marine eDNA, has a fast degradation rate. Aside from the good
quality reads, the data could have a signifcant number of reads that fall slightly below the default PHRED quality
threshold of 30 on sequencing. For quality control, trimming methods are employed, which generally precede the
merging of the read pairs. However, in the case of eDNA, a signifcant percentage of reads within the acceptable
quality score range are also dropped.
Methods: To infer the ideal merge tool that is sensitive to eDNA, two Hiseq paired-end eDNA datasets were
utilized to study the merging by the tools – FLASH (Fast Length Adjustment of SHort reads), PANDAseq, COPE,
BBMerge, and VSEARCH without preprocessing. We assessed these tools on the following parameters: Time taken
to process, the quality, and the number of merged reads.
Trimmomatic, a widely-used preprocessing tool, was also assessed by preprocessing the datasets at different
parameters for the two approaches of preprocessing: Sliding Window and Maximum Information. The pre-
processed read pairs were then merged using the ideal merge tool identifed earlier.
Results: FLASH is the most effcient merge tool balancing data conservation, quality of reads, and processing time.
We compared Trimmomatic’s two quality trimming options with increasing strictness with FLASH’s direct
merge. The raw reads processed with Trimmomatic then merged, yielded a signifcant drop in reads compared to
the direct merge. An average of 29% of reads was dropped when directly merged with FLASH. Maximum In-
formation option resulted in 30.7% to 68.05% read loss with lowest and highest stringency parameters,
respectively. The Sliding Window approach conserves approximately 10% more reads at a PHRED score of 25 set
as the threshold for a window of size 4. The lowered PHRED cut off conserves about 50% of the reads that could
potentially be informative. We noted no signifcant reduction of data while optimizing the number of reads read
in a window with the ideal quality (Q) score.
Conclusions: Losing reads can negatively impact the downstream processing of the environmental data, especially
for sequence alignment studies. The quality trim-frst-merge-later approach can signifcantly decrease the
number of reads conserved. However, direct merging of pair-end reads using FLASH conserved more than 60% of
the reads. Therefore, direct merging of the paired-end reads can prevent potential removal of informative reads
that do not comply by the trimming tool’s strict checks. FLASH to be an effcient tool in conserving reads while
carrying out quality trimming in moderation. Overall, our results show that merging paired-end reads of eDNA
data before trimming can conserve more reads.
1. Introduction
Metagenomics, also referred to as environmental genomics, is the
study of genetic material recovered directly from environmental sam-
ples (skin and gut samples, soil and water samples) that can be pre-
served, extracted, amplifed, sequenced, and categorized based on its
Abbreviations: DNA, Deoxyribonucleic Acid; eDNA, Environmental DNA; FLASH, Fast Length Adjustment of SHort reads; rRNA, Ribosomal RNA; SW, Sliding
Window; MI, Maximum Information; NGS, Next Generation Sequencing.
* Corresponding author at: Faculty of Engineering and Technology, Sri Ramachandra Institute of Higher Education and Research, Chennai 600116, India.
E-mail address: dicky@sriramachandra.edu.in (D.J. Davis G.).
Contents lists available at ScienceDirect
Marine Genomics
journal homepage: www.elsevier.com/locate/margen
https://doi.org/10.1016/j.margen.2021.100914
Received 16 July 2021; Received in revised form 18 November 2021; Accepted 18 November 2021