Contact Sales

RSS

Get Parallel – Extreme Performance for Extremely Big Data

Brent Welch

CTO

In 2008 the first computer broke the petascale barrier, allowing a supercomputer to exceed 1015 operations per second. The Panasas parallel file system was instrumental in accomplishing this major milestone, achieving extreme performance for big data sets. Los Alamos National Laboratory (LANL) built that first petascale system, “Roadrunner,” using the Panasas parallel file system to maximize I/O performance for the most demanding simulations and modeling applications. Roadrunner still remains one of the Top 10 supercomputing systems in the world. The supercomputer “Cielo,” also at LANL, ranks number six on the TOP500 List of the world’s supercomputers and also utilizes the Panasas file system, executing parallel programs over hundreds of thousands of computer cores and running non-stop for months at a time.

Gary Grider, deputy director of programs at LANL has said, “[T]his requires our Panasas storage systems to deal with situations like 100,000 processes all trying to create a file in a directory all within a few milliseconds of one another and 100,000 processes all trying to write to 100,000 files or possibly to one file, all at the same time. So we have to sink 100 terabytes and typically we have to do that in a few minutes time (like 5-10). So we are talking about 100 to 200 gigabytes/second [performance from the file system]. So you have to organize thousands of disk systems in parallel to handle this load.”

If you read the Wikipedia definition of big data it becomes clear pretty quickly that Panasas is one of the few storage vendors whose product is actually built to manage the workloads that big data brings:

“Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. [Emphasis ours.] Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set.”

By contrast, other storage vendors touting big data are using standard protocols such as NFS or CIFS in a very conventional way, i.e. clustered NAS filer heads, usually talking to traditional block storage devices managed by hardware RAID controllers. A system storing tens of terabytes hardly fits the big data definition outlined above. As Gary Grider states, individual sets within big data sets are tens of terabytes and the storage system has to support large numbers of these individual sets efficiently with problem sizes measured in petabytes.

To really process big data sets well, computer applications and systems must be capable of handling massive parallelism and the underlying storage systems must be architected for parallelism, too. Simply providing client-side support for the pNFS protocol will not deliver the performance seen by extreme big data systems like Roadrunner and Cielo.

Panasas has been instrumental in enabling parallelism in scientific discovery, dramatically improving the ability to deliver extreme performance for big data. This breakthrough in core scientific research is now becoming the standard for the next generation of commercial applications and systems. Just recently we saw the acceptance of pNFS, a parallel file system standard pioneered by Dr. Garth Gibson at Panasas, into the Linux distribution.

To deliver the performance promise of petascale, traditional storage systems must be designed from the ground up to support massive parallelism. Panasas parallel storage products have been solving these big data problems since 2004.