Your address will show here +12 34 56 78
2023 Blog, AWS Service, Research Gateway, Blog, Featured

Major advances are happening with the leverage of Cloud Technologies and large Open Data sets in the areas of Healthcare informatics that include sub-disciplines like Bioinformatics and Clinical Informatics. This is being rapidly adopted by Life Sciences and Healthcare institutions in commercial and public sector space. This domain has deep investments in scientific research and data analytics focussing on information, computation needs, and data acquisition techniques to optimize the acquisition, storage, retrieval, obfuscation, and secure use of information in health and biomedicine for evidence-based medicine and disease management.

In recent years, genomics and genetic data have emerged as an innovative areas of research that could potentially transform healthcare. The emerging trends are for personalized medicine, or precision medicine leveraging genomics. Early diagnosis of a disease can significantly increase the chances of successful treatment, and genomics can detect a disease long before symptoms present themselves. Many diseases, including cancers, are caused by alterations in our genes. Genomics can identify these alterations and search for them using an ever-growing number of genetic tests.

With AWS, genomics customers can dedicate more time and resources to science, speeding time to insights, achieving breakthrough research faster, and bringing life-saving products to market. AWS enables customers to innovate by making genomics data more accessible and useful. AWS delivers the breadth and depth of services to reduce the time between sequencing and interpretation, with secure and frictionless collaboration capabilities across multi-modal datasets. Also, you can choose the right tool for the job to get the best cost and performance at a global scale— accelerating the modern study of genomics.

Relevance Lab Research@Scale Architecture Blueprint
Working closely with AWS Healthcare and Clinical Informatics teams, Relevance Lab is bringing a scalable, secure, and compliant solution for enterprises to pursue Research@Scale on Cloud for intramural and extramural needs. The diagram below shows the architecture blueprint for Research@Scale. The solution offered on the AWS platform covers technology, solutions, and integrated services to help large enterprises manage research across global locations.


Leveraging AWS Biotech Blueprint with our Research Gateway
Use case with AWS Biotech Blueprint that provides a Core template for deploying a preclinical, cloud-based research infrastructure and optional informatics software on AWS.

This Quick Start sets up the following:

  • A highly available architecture that spans two availability zones
  • A preclinical virtual private cloud (VPC) configured with public and private subnets according to AWS best practices to provide you with your own virtual network on AWS. This is where informatics and research applications will run
  • A management VPC configured with public and private subnets to support the future addition of IT-centric workloads such as active directory, security appliances, and virtual desktop interfaces
  • Redundant, managed NAT gateways to allow outbound internet access for resources in the private subnets
  • Certificate-based virtual private network (VPN) services through the use of AWS Client VPN endpoints
  • Private, split-horizon Domain Name System (DNS) with Amazon Route 53
  • Best-practice AWS Identity and Access Management (IAM) groups and policies based on the separation of duties, designed to follow the U.S. National Institute of Standards and Technology (NIST) guidelines
  • A set of automated checks and alerts to notify you when AWS Config detects insecure configurations
  • Account-level logging, audit, and storage mechanisms are designed to follow NIST guidelines
  • A secure way to remotely join the preclinical VPC network is by using the AWS Client VPN endpoint
  • A prepopulated set of AWS Systems Manager Parameter Store key/value pairs for common resource IDs
  • (Optional) An AWS Service Catalog portfolio of common informatics software that can be easily deployed into your preclinical VPC

Using the Quickstart templates, the products were added to AWS Service Catalog and imported into RLCatalyst Research Gateway.



Using the standard products, the Nextflow Workflow Orchestration engine was launched for Genomics pipeline analysis. Nextflow helps to create and orchestrate analysis workflows and AWS Batch to run the workflow processes.

Nextflow is an open-source workflow framework and domain-specific language (DSL) for Linux, developed by the Comparative Bioinformatics group at the Barcelona Centre for Genomic Regulation (CRG). The tool enables you to create complex, data-intensive workflow pipeline scripts, and simplifies the implementation and deployment of genomics analysis workflows in the cloud.

This Quick Start sets up the following environment in a preclinical VPC:

  • In the public subnet, an optional Jupyter notebook in Amazon SageMaker is integrated with an AWS Batch environment.
  • In the private application subnets, an AWS Batch compute environment for managing Nextflow job definitions and queues and for running Nextflow jobs. AWS Batch containers have Nextflow installed and configured in an Auto Scaling group.
  • Because there are no databases required for Nextflow, this Quick Start does not deploy anything into the private database (DB) subnets created by the Biotech Blueprint core Quick Start.
  • An Amazon Simple Storage Service (Amazon S3) bucket to store your Nextflow workflow scripts, input and output files, and working directory.

RStudio for Scientific Research
RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. RStudio is available in a desktop version or a server version that allows you to access R via a web browser.

After you’ve analyzed the results, you may want to visualize them. Shiny is a great R package, licensed either commercially or under AGPLv3, that you can use to create interactive dashboards. Shiny provides a web application framework for R. It turns your analyses into interactive web applications; no HTML, CSS, or JavaScript knowledge is required. Shiny Server can deliver your R visualization to your customers via a web browser and execute R functions, including database queries, in the background.

RStudio is provided as a standard catalog item in Research Gateway for 1-Click deployment and use. AWS provides a number of tools like AWS Athena, AWG Glue, and others to connect to datasets for research analysis.

Benefits of using AWS for Clinical Informatics

  • Data transfer and storage
  • The volume of genomics data poses challenges for transferring it from sequencers in a quick and controlled fashion, then finding storage resources that can accommodate the scale and performance at a price that is not cost-prohibitive. AWS enables researchers to manage large-scale data that has outpaced the capacity of on-premises infrastructure. By transferring data to the AWS Cloud, organizations can take advantage of high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching to propel genomics research forward.

  • Workflow automation for secondary analysis
  • Genomics organizations can struggle with tracking the origins of data when performing secondary analyses and running reproducible and scalable workflows while minimizing IT overhead. AWS offers services for scalable, cost-effective data analysis and simplified orchestration for running and automating parallelizable workflows. Options for automating workflows enable reproducible research or clinical applications, while AWS native, partner (NVIDIA and DRAGEN), and open source solutions (Cromwell and Nextflow) provide flexible options for workflow orchestrators to help scale data analysis.

  • Data aggregation and governance
  • Successful genomics research and interpretation often depend on multiple, diverse, multi-modal datasets from large populations. AWS enables organizations to harmonize multi-omic datasets and govern robust data access controls and permissions across a global infrastructure to maintain data integrity as research involves more collaborators and stakeholders. AWS simplifies the ability to store, query, and analyze genomics data, and link with clinical information.

  • Interpretation and deep learning for tertiary analysis
  • Analysis requires integrated multi-modal datasets and knowledge bases, intensive computational power, big data analytics, and machine learning at scale, which, historically can take weeks or months, delaying time to insights. AWS accelerates the analysis of big genomics data by leveraging machine learning and high-performance computing. With AWS, researchers have access to greater computing efficiencies at scale, reproducible data processing, data integration capabilities to pull in multi-modal datasets, and public data for clinical annotation—all within a compliance-ready environment.

  • Clinical applications
  • There are several hindrances that impede the scale and adoption of genomics for clinical applications including speed of analysis, managing protected health information (PHI), and providing reproducible and interpretable results. By leveraging the capabilities of the AWS Cloud, organizations can establish a differentiated capability in genomics to advance their applications in precision medicine and patient practice. AWS services enable the use of genomics in the clinic by providing the data capture, compute, and storage capabilities needed to empower the modernized clinical lab to decrease the time to results, all while adhering to the most stringent patient privacy regulations.

  • Open datasets
  • As more life science researchers move to the cloud and develop cloud-native workflows, they bring reference datasets with them, often in their own personal buckets, leading to duplication, silos, and poor version documentation of commonly used datasets. The AWS Open Data Program (ODP) helps democratize data access by making it readily available in Amazon S3, providing the research community with a single documented source of truth. This increases study reproducibility, stimulates community collaboration, and reduces data duplication. The ODP also covers the cost of Amazon S3 storage, egress, and cross-region transfer for accepted datasets.

  • Cost optimization
  • Researchers utilize massive genomics datasets, which require large-scale storage options and powerful computational processing and can be cost-prohibitive. AWS presents cost-saving opportunities for genomics researchers across the data lifecycle—from storage to interpretation. AWS infrastructure and data services enable organizations to save time, money, and devote more resources to science.

Summary
Relevance Lab is a specialist AWS partner working closely in Health Informatics and Genomics solutions leveraging AWS existing solutions and complementing them with its Self-Service Cloud Portal solutions, automation, and governance best practices.

To know more about how we can help standardize, scale, and speed up Scientific Research in Cloud, feel free to contact us at marketing@relevancelab.com.

References
AWS Whitepaper on Genomics Data Transfer, Analytics and Machine Learning
Genomics Workflows on AWS
HPC on AWS Video – Running Genomics Workflows with Nextflow
Workflow Orchestration with Nextflow on AWS Cloud
Biotech Blueprint on AWS Cloud
Running R on AWS
Advanced Bioinformatics Workshop



0

2021 Blog, AWS Service, Blog, Featured

Working on non-scientific tasks such as setting up instances, installing software libraries, making model compile, and preparing input data are some of the biggest pain points for atmospheric scientists or any scientist for that matter. It’s challenging for scientists as it requires them to have strong technical skills deviating them from their core areas of analysis & research data compilation. Further adding to this, some of these tasks require high-performance computation, complicated software, and large data. Lastly, researchers need a real-time view of their actual spending as research projects are often budget-bound. Relevance Lab help researchers “focus on science and not servers” in partnership with AWS leveraging the RLCatalyst Research Gateway (RG) product.

Why RLCatalyst Research Gateway?
Speeding up scientific research using AWS cloud is a growing trend towards achieving “Research as a Service”. However, the adoption of AWS Cloud can be challenging for Researchers with surprises on costs, security, governance, and right architectures. Similarly, Principal Investigators can have a challenging time managing the research program with collaboration, tracking, and control. Research Institutions will like to provide consistent and secure environments, standard approved products, and proper governance controls. The product was created to solve these common needs of Researchers, Principal Investigator and Research Institutions.


  • Available on AWS Marketplace and can be consumed in both SaaS as well as Enterprise mode
  • Provides a Self-Service Cloud Portal with the ability to manage the provisioning lifecycle of common research assets
  • Gives a real time visibility of the spend against the defined project budgets
  • The principal investigator has the ability to pause or stop the project in case the budget is exceeded till the new grant is approved

In this blog, we explain how the product has been used to solve a common research problem of GEOS-Chem used for Earth Sciences. It covers a simple process that starts with access to large data sets on public S3 buckets, creation of an on-demand compute instance with the application loaded, copying the latest data for analysis, running the analysis, storing the output data, analyzing the same using specialized AI/ML tools and then deleting the instances. This is a common scenario faced by researchers daily, and the product demonstrates a simple Self-Service frictionless capability to achieve this with tight controls on cost and compliance.

GEOS-Chem enables simulations of atmospheric composition on local to global scales. It can be used off-line as a 3-D chemical transport model driven by assimilated meteorological observations from the Goddard Earth Observing System (GEOS) of the NASA Global Modeling Assimilation Office (GMAO). The figure below shows the basic construct on GEOS-Chem input and output analysis.



Being a common use case, there is documentation available in the public domain by researchers on how to run GEOS-Chem on AWS Cloud. The product makes the process simpler using a Self-Service Cloud portal. To know more about similar use cases and advanced computing options, refer to AWS HPC for Scientific Research.



Steps for GEOS-Chem Research Workflow on AWS Cloud
Prerequisites for researcher before starting data analysis.

  • A valid AWS account and an access to the RG portal
  • A publicly accessible S3 bucket with large Research Data sets accessible
  • Create an additional EBS volume for your ongoing operational research work. (For occasional usage, it is recommended to upload the snapshot in S3 for better cost management.)
  • A pre-provisioned SageMaker Jupyter notebook to analyze output data

Once done, below are the steps to execute this use case.

  • Login to the RG Portal and select the GEOS-Chem project
  • Launch an EC2 instance with GEOS-Chem AMI
  • Login to EC2 using SSH and configure AWS CLI
  • Connect to a public S3 bucket from AWS CLI to list NASA-NEX data
  • Run the simulation and copy the output data to a local S3 bucket
  • Link the local S3 bucket to AWS SageMaker instance and launch a Jupyter notebook for analysis of the output data
  • Once done, terminate the EC2 instance and check for the cost spent on the use case
  • All costs related to GEOS-Chem project and researcher consumption are tracked automatically

Sample Output Analysis
Once you run the output files on the Jupyter notebook, it does the compilation and provides output data in a visual format, as shown in the sample below. The researcher can then create a snapshot and upload it to S3 and terminate the EC2 instance (without deleting the additional EBS volume created along with EC2).

Output to analyze loss rate and Air mass of Hydroxide pertaining to Atmospheric Science.


Summary
Scientific computing can take advantage of cloud computing to speed up research, scale-up computing needs almost instantaneously, and do all this with much better cost-efficiency. Researchers no longer need to worry about the expertise required to set up the infrastructure in AWS as they can leave this to tools like RLCatalyst Research Gateway, thus compressing the time it takes to complete their research computing tasks.

The steps demonstrated in this blog can be easily replicated for similar other research domains. Also, it can be used to onboard new researchers with pre-built solution stacks provided in an easy to consume option. RLCatalyst Research Gateway is available in SaaS mode from AWS Marketplace and research institutions can continue to use their existing AWS account to configure and enable the solution for more effective Scientific Research governance.

To learn more about GEOS-Chem use cases, click here.

If you want to learn more about the product or book a live demo, feel free to contact marketing@relevancelab.com.

References
Enabling Immediate Access to Earth Science Models through Cloud Computing: Application to the GEOS-Chem Model
Enabling High‐Performance Cloud Computing for Earth Science Modeling on Over a Thousand Cores: Application to the GEOS‐Chem Atmospheric Chemistry Model



0

HPC Blog, 2021 Blog, AWS Service, Blog, Featured

AWS provides a comprehensive, elastic, and scalable cloud infrastructure to run your HPC applications. Working with AWS in exploring HPC for driving Scientific Research, Relevance Lab leveraged their RLCatalyst Research Gateway product to provision an HPC Cluster using AWS Service Catalog with simple steps to launch a new environment for research. This blog captures the steps used to launch a simple HPC 1.0 cluster on AWS and roadmap to extend the functionality to cover more advanced use cases of HPC Parallel Cluster.

AWS delivers an integrated suite of services that provides everything needed to build and manage HPC clusters in the cloud. These clusters are deployed over various industry verticals to run the most compute-intensive workloads. AWS has a wide range of HPC applications spanning from traditional applications such as genomics, computational chemistry, financial risk modeling, computer-aided engineering, weather prediction, and seismic imaging to new applications such as machine learning, deep learning, and autonomous driving. In the US alone, multiple organizations across different specializations are choosing cloud to collaborate for scientific research.


Similar programs exist across different geographies and institutions across EU, Asia, and country-specific programs for Public Sector programs. Our focus is to work with AWS and regional scientific institutions in bringing the power of Supercomputers for day-to-day researchers in a cost-effective manner with proper governance and tracking. Also, with Self-Service models, the shift needs to happen from worrying about computation to focus on Data, workflows, and analytics that requires a new paradigm of considering prospects of serverless scientific computing that we cover in later sections.

Relevance Lab RLCatalyst Research Gateway provides a Self-Service Cloud portal to provision AWS products with a 1-Click model based on AWS Service Catalog. While dealing with more complex AWS Products like HPC there is a need to have a multi-step provisioning model and post provisioning actions that are not always possible using standard AWS APIs. In these situations requiring complex orchestration and post provisioning automation RLCatalyst BOTs provide a flexible and scalable solution to complement based Research Gateway features.

Building blocks of HPC on AWS
AWS offers various services that make it easy to set up an HPC setup.


An HPC solution in AWS uses the following components as building blocks.

  • EC2 instances are used for Master and Worker nodes. The master nodes can use On-Demand instances and the worker nodes can use a combination of On-Demand and Spot Instances.
  • The software for the manager nodes is built as an AMI and used for the creation of Master nodes.
  • The agent software for the managers to communicate with the worker nodes is built into a second AMI that is then used for provisioning the Worker nodes.
  • Data is shared between different nodes using a file-sharing mechanism like FSx Lustre.
  • Long-term storage uses AWS S3.
  • Scaling of nodes is done via Auto-scaling.
  • KMS for encrypting and decrypting the keys.
  • Directory services to create the domain name for using HPC via UI.
  • Lambda function service to create user directory.
  • Elastic Load Balancing is used to distribute incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, Lambda functions, and virtual appliances.
  • Amazon EFS is used for regional service storing data within and across multiple Availability Zones (AZs) for high availability and durability. Amazon EC2 instances can access your file system across AZs.
  • AWS VPC to launch the EC2 instances in private cloud.

Evolution of HPC on AWS
  • HPC clusters first came into existence in AWS using the CfnCluster Cloud Formation template. It creates a number of Manager and Worker nodes in the cluster based on the input parameters. This product can be made available through AWS Service Catalog and is an item that can be provisioned from the RLCatalyst Research Gateway. The cluster manager software like Slurm, Torque, or SGE is pre-installed on the manager nodes and the agent software is pre-installed on the worker nodes. Also pre-installed is software that can provide a UI (like Nice EngineFrame) for the user to submit jobs to the cluster manager.
  • AWS Parallel Cluster is a newer offering from AWS for provisioning an HPC cluster. This service provides an open-source, CLI-based option for setting up a cluster. It sets up the manager and worker nodes and also installs controlling software that can watch the job queues and trigger scaling requests on the AWS side so that the overall cluster can grow or shrink based on the size of the queue of jobs.

Steps to Launch HPC from RLCatalyst Research Gateway
A standard HPC launch involves the following steps.

  • Provide the input parameters for the cluster. This will include
    • The compute instance size for the master node (vCPUs, RAM, Disk)
    • The compute instance size for the worker nodes (vCPUs, RAM, Disk)
    • The minimum and maximum number of worker nodes.
    • Select the workload manager software (Slurm, Torque, SGE)
    • Connectivity options (SSH keys etc.)
  • Launch the product.
  • Once the product is in Active state, connect to the URL in the Output parameters on the Product Details page. This connects you to the UI from where you can submit jobs to the cluster.
  • You can SSH into the master nodes using the key pair selected in the Input form.

RLCatalyst Research Gateway uses the CfnCluster method to create an HPC cluster. This allows the HPC cluster to be created just like any other products in our Research Gateway catalog items. Though this provisioning may take upto 45 minutes to complete, it creates an URL in the outputs which we can use to submit the jobs through the URL.

Advanced Use Cases for HPC

  • Computational Fluid Dynamics
  • Risk Management & Portfolio Optimization
  • Autonomous Vehicles – Driving Simulation
  • Research and Technical Computing on AWS
  • Cromwell on AWS
  • Genomics on AWS

We have specifically looked at the use case that pertains to BioInformatics where a lot of the research uses Cromwell server to process workflows defined using the WDL language. The Cromwell server acts as a manager that controls the worker nodes, which execute the tasks in the workflow. A typical Cromwell setup in AWS can use AWS Batch as the backend to scale the cluster up and down and execute containerized tasks on EC2 instances (on-demand or spot).



Prospect of Serverless Scientific Computing and HPC
“Function As A Service” Paradigm for HPC and Workflows for Scientific Research with the advent of serverless computing and its availability on all major computing platforms, it is now possible to take the computing that would be done on a High Performance Cluster and run it as lambda functions. The obvious advantage to this model is that this virtual cluster is highly elastic, and charged only for the exact execution time of each lambda function executed.

One of the limitations of this model currently is that only a few run-times are currently supported like Node.js and Python while a lot of the scientific computing code might be using additional run-times like C, C++, Java etc. However, this is fast changing and cloud providers are introducing new run-times like Go and Rust.


Summary
Scientific computing can take advantage of cloud computing to speed up research, scale-up computing needs almost instantaneously and do all this with much better cost efficiency. Researchers no longer worry about the expertise required to set up the infrastructure in AWS as they can leave this to tools like RLCatalyst Research Gateway, thus compressing the time it takes to complete their research computing tasks.

To learn more about this solution or participate in using the same for your internal needs feel free to contact marketing@relevancelab.com

References
Getting started with HPC on AWS
HPC on AWS Whitepaper
AWS HPC Workshops
Genomics in the Cloud
Serverless Supercomputing: High Performance Function as a Service for Science
FaaSter, Better, Cheaper: The Prospect of Serverless Scientific Computing and HPC



0