2022 Blogs, Blog, Featured

“Sequencing as Service” – Powering NGS on AWS Cloud with RLCatalyst Research Gateway

Relevance Lab has been collaborating with AWS Partnership teams over the last one year to create Genomics Cloud for enabling Next Generation Sequencing (NGS) on-demand. This is one of the dominant use cases for scientific research in the cloud, driven by healthcare and life sciences groups exploring ways to make NGS better, faster and cheaper so that researchers can focus on science and not complex infrastructure.

RL offers a product RLCatalyst Research Gateway that facilitates Scientific Research with easier access to big compute infrastructure, large data sets, powerful analytics tools and a secure research environment, and the ability to drive self-service research with tight cost and budget controls.

Taking the concept of making NGS processing more frictionless, the new functionality being added to RLCatalyst Research Gateway allows researchers to use “Sequencing as Service” by choosing their preferred pipeline processing engines covering both open source platforms like Nextflow and Chromwell and also commercially available engines from Illumina Dragen and NVidia Parabricks. The top use cases for AWS Genomics in the Cloud implemented by this product are given below. Providing an out-of-the-box solution, the product ensures significant costs and efforts saving for customers.

Top Use Cases

Data Transfer and Storage
The high volume of genomics data requires efficient data transfer from sequencers and storing raw data for further quality checks and mapping in a cost-effective manner. AWS enables researchers to manage large-scale data that has outpaced the capacity of on-premises infrastructure. By transferring data to the AWS cloud, organizations can take advantage of high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching to propel genomics research forward.

Genomic Workflow Automation for Secondary Analysis
Genomics organizations can speed up performing secondary analyses and running reproducible and scalable workflows while minimizing IT overhead using open source solutions (Cromwell and Nextflow) or partner (NVIDIA and DRAGEN) solutions. AWS offers services for scalable, cost-effective data analysis and simplified orchestration for running and automating parallelizable workflows.

Data Aggregation
With growing samples of data and variant analysis needs on output data, there is a need to create a Genomic Data Lake for research and interpretation of results that are the foundation of precision medicine. AWS enables organizations to harmonize multi-omics datasets and govern robust data access controls and permissions across a global infrastructure to maintain data integrity as research involves more collaborators and stakeholders. AWS simplifies the ability to store, query, and analyze genomics data, and link with clinical information.

Tertiary Analysis with Interpretation and Deep Learning
As the need for precision medicine grows based on genomic sequencing and analysis of patterns, it requires integrated datasets and knowledge bases, large computational power, big data analytics, and machine learning at scale, which, historically, can take weeks or months, delaying time to insights. AWS accelerates the analysis of big genomics data by leveraging machine learning and high-performance computing. With AWS, researchers have access to greater computing efficiencies at scale, reproducible data processing, data integration capabilities to pull in multi-modal datasets, and public data for clinical annotation—all within a compliance-ready environment.

Open Data Sets
As more life science researchers move to the cloud and develop cloud-native workflows, they bring reference datasets with them, often in their own personal buckets, leading to duplication, silos, and poor version documentation of commonly used datasets. The AWS Open Data Program (ODP) helps democratize data access by making it readily available in Amazon S3, providing the research community with a single documented source of truth. This increases study reproducibility, stimulates community collaboration, and reduces data duplication. The ODP also covers the cost of Amazon S3 storage, egress, and cross-region transfer for accepted datasets.

Cost Optimization
Usage of large-scale compute resources and large data sets for multiple job analyses can be a resource-intensive task with significant cost impacts that need proper capacity planning, tracking, and optimization. Researchers utilize massive genomics datasets that require large-scale storage options and powerful computational processing, which can be cost-prohibitive. AWS presents cost-saving opportunities for genomics researchers across the data lifecycle—from storage to interpretation. AWS infrastructure and data services enable organizations to save time, money and devote more resources to science.

Concept of Sequencing as Service
The concept of “Sequencing as Service” on Cloud is explained below.


Key Building Blocks for “Sequencing as Service” Architecture
The solution for supporting easy use of Genomics Sequencing in the cloud supports the following key components to meet the need of researchers, scientists, developers, and analysts to efficiently run their experiments without the need for deep expertise in the backend computing capabilities.

Genomics Pipeline Processing Engine
The researchers’ community uses popular open-source tools like Nextflow and Cromwell for large data sets by leveraging HPC systems, and the orchestration layer is managed by these tools.

Nextflow is a bioinformatics workflow manager that enables the development of portable and reproducible workflows. It supports deploying workflows on a variety of execution platforms, including local, HPC schedulers, AWS Batch, Google Cloud Life Sciences, and Kubernetes.

Cromwell is a workflow execution engine that simplifies the orchestration of computing tasks needed for genomics analysis. Cromwell enables genomics researchers, scientists, developers, and analysts to efficiently run their experiments without the need for deep expertise in backend computing capabilities.

Many organizations also use commercial tools like Illumina Dragen and NVidia Parabricks for similar solutions that are more optimized in reducing processing timelines but also come with a price.

Open Source Repositories for Common Genomics Workflows
The solution needs to allow researchers to leverage work done by different communities and tools to reuse existing available workflows and containers easily. Researchers can leverage any of the existing pipelines & containers or can also create their own implementations by leveraging existing standards.

GATK4 is a Genome Analysis Toolkit for Variant Discovery in high-throughput sequencing data. Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

BioContainers – A community-driven project to create and manage bioinformatics software containers.

Docstore – Dockstore is a free and open-source platform for sharing reusable and scalable analytical tools and workflows. It’s developed by the Cancer Genome Collaboratory and used by the GA4GH.

nf-core Pipelines – A community effort to collect a curated set of analysis pipelines built using Nextflow.

Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and writable syntax.

AWS Batch for High-Performance Computing
AWS has many services that can be used for genomics. In this solution, the core architecture is with AWS Batch, a managed service that is built on top of other AWS services, such as Amazon EC2 and Amazon Elastic Container Service (ECS). Also, proper security is provided with Roles via AWS Identity and Access Management (IAM), a service that helps you control who is authenticated (signed in) and authorized (has permissions) to use AWS resources.

Large Data Sets Storage and Access to Open Data Sets
AWS cloud is leveraged to deal with the needs of large data sets for storage, processing, and analytics using the following key products.

Amazon S3 for high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching.

AWS Datasync for secure, online service that automates and accelerates moving data between on-premises and AWS storage services.

AWS Open Datasets program houses are openly available, with 40+ open life sciences data repositories.

Outputs Analysis and Monitoring Tools
One of the key building blocks for Genomic Data Analysis needs access to common tools like the following integrated into the solution.

Multi-QC Reports MultiQC searches a given directory for analysis logs and compiles an HTML report. It’s a general-use tool, perfect for summarising the output from numerous bioinformatics tools.

IGV (Integrated Genomics Viewer) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.

RStudio for Genomics since R is one of the most widely-used and powerful programming languages in bioinformatics. R especially shines where a variety of statistical tools are required (e.g. RNA-Seq, population genomics, etc.) and in the generation of publication-quality graphs and figures.

Genomics Data Lake
AWS Data Lake for creating genomics data lake for tertiary processing. Once the Secondary analysis generates outputs, typically in Variant Calling Format (VCF) for further analysis, there is a need to move such data into a Genomics Data Lake for tertiary processing. Leveraging standard AWS tools and solution framework, a Genomics Data Lake is implemented and integrated with the end-to-end sequencing processing pipeline.

Variant Calling Format specification is used in bioinformatics for storing gene sequence variations, typically in a compressed text file. According to the VCF specification, a VCF file has meta-information lines, a header line, and data lines. Compressed VCF files are indexed for fast data retrieval (random access) of variants from a range of positions.

VCF files, though popular in bioinformatics, are a mixed file type that include a metadata header and a more structured table-like body. Converting VCF files into the Parquet format works excellently in distributed contexts like a Data Lake.

Cost Analysis of Workflows
One of the biggest concerns for users of Genomic pipelines processing in Cloud is control on budget and cost that is provided by RLCatalyst Research Gateway by tracking spending across Projects, Researchers, Workflow runs at a granular level and allowing for optimizing spend by using techniques like Spot instances and on-demand computing. There are guardrails built-in for appropriate controls and corrective actions. Users can run sequencing workflows using their own AWS accounts, allowing for transparent control and visibility.

A typical researcher flow for using RLCatalyst Research Gateway for “Sequencing as Service” is explained in the workflow below.


Common Use Case Demonstration – Sarek
While the solution allows any public pipeline built with Workflow Description Language (WDL), Common Workflow Language (CWL), and Nextflow specifications, for this Blog, we have chosen the following popular sample.


Steps on how to use RLCatalyst Research Gateway for the Use Case

1. From the available products tab, provision an S3 product to create a bucket to hold your sample data. Once the bucket is created, use the “Explore” action to view the bucket contents. Use the “Add File” and “Add Folders” buttons to upload your input data to the bucket. From the “Product Details” tab, copy the name of the bucket created.


2. Provision a Nextflow-Advanced product in the Research Gateway. Select the nf-core/sarek pipeline in the PipelineName field by searching for “sarek”.


Use the bucket-name copied in step 1 as the InputDataLocation. Choose a Key Pair that allows you to connect to the head node or create a new one.

3. Once provisioning is complete, use the “SSH to Server” button to connect to the head node. Change directory to the sarek folder (which has the clone of the git repository selected in the pipeline name). You can now run the pipeline using the “nextflow run main.nf -profile test,docker,batch” command.


4. Use Monitor Pipeline to monitor the progress of the job. This will launch the Nextflow Tower URL in a separate browser tab.


5. View the output files using the “View Outputs” button. Download the files by clicking on the links.


6. View Project Costs


7. View Researcher Costs


8. View Workspace Costs


Summary
To make it easier for institutions, principal investigators, and researchers for large-scale genomic sequencing in the cloud, we provide the fundamental building blocks for “Sequencing as Service”. The integrated product covers large data sets access, support for popular pipeline engines, access to open-source pipelines & containers, AWS HPC environments, Analytics tools, and cost tracking that takes away the pains of managing infrastructure, data, security, and costs to enable researchers to focus on science.

To know more about how you can start your Genomic sequencing in the AWS cloud in 30 minutes using our solution at https://research.rlcatalyst.com, feel free to contact marketing@relevancelab.com.

References
High-performance genetic datastore on AWS S3 using Parquet and Arrow
Parallelizing Genome Variant Analysis