Some of the biggest success stories in data-intensive applications are coming from life sciences where innovation in large-scale data analysis facilitated the Human Genome Project. International researchers took 13 years, $2.7 billion and an array of supercomputers to sequence all 3 billion base pairs in the human genome. In April 2003, they completed our common DNA map, making this the biggest of Big Data projects of the time.
This amazing achievement created a new branch of health care — precision medicine. And the business of genome sequencing continues to drive development of faster, simpler, cheaper technology for compiling, storing, sharing, moving and analyzing prodigious amounts of data to uncover the insights within.
And what insights! From precision medicine comes precision oncology, where doctors and scientists seek to end the burden of cancer, or at least “end cancer as we know it today,” by tailoring medical plans and treatments to an individual patient’s genome, lifestyle and environment.
Data-Intensive Precision Medicine
Precision medicine, also called personalized medicine, relies heavily on artificial intelligence or machine learning algorithms and is an extremely data-hungry industry. The 3 billion base pairs of each human’s genome create around 6 GB of storage once fully sequenced. Processing can multiply those datasets by 30 to 35 times through oversampling or coverage (processing the same location of the DNA multiple times to build accuracy) – and in some applications, data could be multiplied by 800 times. Now the sample has grown to 200 GB — with intermediate data processes in the sequencing possibly growing it to 700 GB. Per patient!
“There's a transition happening from gene paneling to whole exome sequencing to whole genome sequencing,” says Hemant Thapar, co-founder and CEO of OmniTier, a developer of memory-focused, application-specific, high-performance data products. “And as you move that direction, the amount of data that you have to process gets very large.” But the potential is also high: Personalized medicine could thrive with the discovery of more genetic variants — such as point substitutions, insertions, deletions and structural variations from the norm in individual genomes.
This modern explosion of data-centric and data-dependent applications requires new memory and storage technology, interfaces and software stacks. Researchers are working to bring the benefits of whole genome sequencing, for example, to more patients for wider research and development. “The key point here is that mass markets cannot rely on supercomputing,” says Thapar. “For a mass market like health care, you have to have very efficient ways to analyze the datasets. That’s why OmniTier made it a focus: How can we support this precision medicine initiative?”
Tiered Memory and Off-the-Shelf Servers
OmniTier has announced and is beta testing its CompStor Novos®, a memory-centric computer cluster solution for full DNA sequencing using “de novo” assembly techniques. De novo means a whole genome sequencing starting from the beginning. Sequencing (assembling multiple DNA fragments to emulate a longer sequence) is one of the first steps of DNA analysis. The standard approach follows a DNA template, typically the human genome sequence above. But that approach tends to hide places where the patient’s personal genome has variants, which is key data to investigate in predictive medicine. Because de novo sequencing uses no template, it is particularly useful in detecting structural variants.
The company also developed a hardware-software solution for life sciences that overcomes the limitations of today’s compute paradigm, which is riddled with memory bottlenecks. These slowdowns contribute to low application performance, higher server power consumption and bigger space requirements. These and other inefficiencies add cost, which is the primary barrier to a system’s mass availability.
Instead, the CompStor assembly implementation uses OmniTier’s unique, proprietary tiered-memory algorithm. The solution helps researchers to be faster and more efficient by arranging the implementation so that not all information is considered to be equally time critical. As a result, information can be accessed at varying rates.
OmniTier’s novel algorithms and dataflow optimize a multithreaded flow per data center server. CompStor Novos achieves performance comparable to a DRAM high-capacity-memory subsystem by using a subsystem composed of two tiers of memory: DRAM and (more affordable and larger capacity) NAND flash NVMe™ solid-state drives (SSDs). OmniTier is working directly with Micron to explore potential collaboration opportunities, and received an investment from Micron Ventures to help create value from new compute architectures and applied AI and machine learning solutions.
This Novos assembly is more precise and 10 to 20 times faster than existing assembly algorithms. Experiments also show that using the OmniTier algorithm and appliances rather than working only through the host CPU can cut energy use up to three times for some applications. “Researchers can now perform de novo genome assembly on organisms in a fraction of the time and cost, compared to standard assemblers,” says Thapar. “The reduced time to diagnostics for mutated DNAs and diseases can benefit patients and health care practitioners.”
As Fast as a Supercomputer
How fast is it? The Human Genome Project took 13 years on supercomputers. OmniTier’s CompStor assembly solution reduces genome sequencing to about eight minutes and uses commercial off-the-shelf (COTS) servers with tiered-memory comprising DRAM and NVMe SSDs, and proprietary algorithms and dataflow across the disparate memory types.
When tests compared short-read, next-generation sequencing data on eight CompStor assembly server nodes, de novo assembly of a human genome run on COTS servers equaled the assembly time previously achieved with an advanced supercomputer.
Accelerating Precision Medicine
The goals of precision medicine are to help medical professionals better treat disease and to improve patient outcomes. A genome sequencing solution for the masses has to be affordable, scalable and deployable on premises or in the cloud. This solution’s success depends on memory. "We are using hardware solutions, but we are really coming from the memory point of view,” said Thapar. “We are bringing our knowledge of those different memory technologies like solid-state drives or hard-disk drives or other alternative technologies to bear to solve these particular problems."
Health care and life science informatics demand high performance, especially when neural networks must process multiomic models, such as cross-indexing huge datasets for genomics, environment and lifestyle to identify the personalized treatment with the best outcome. “By delivering near supercomputing performance with low cost, we’re making whole genome sequencing more accessible to patients who struggle with diagnosis and treatment and to the researchers helping them,” said Thapar.
Tiering memory within a system to gain performance efficiencies is one way that Micron memory and storage solutions are being used to further precision medicine. Visit Micron.com/Insight to learn about other ways Micron is transforming how the world uses information to enrich life.