The Lustre parallel file system has been widely adopted by high-performance computing (HPC) centers as an effective system for managing large-scale storage resources. Lustre achieves unprecedented aggregate performance by parallelizing I/O over file system clients and storage targets at extreme scales. Today, 7 out of 10 fastest supercomputers in the world use Lustre for high-performance storage. To date, Lustre development has focused on improving the performance and scalability of large-scale scientific workloads. In particular, large-scale checkpoint storage and retrieval, which is characterized by bursty I/O from coordinated parallel clients, has been the primary driver of Lustre development over the last decade. With the advent of extreme scale computing and Big Data computing, many HPC centers are seeing increased user interest in running diverse workloads that place new demands on Lustre.
In early March of 2016, the 2nd International Workshop on the Lustre Ecosystem was held in Hanover, Maryland. This workshop series is intended to help explore improvements in the performance and flexibility of Lustre for supporting diverse application workloads. The first workshop, held in March 2015, helped initiate a discussion on the open challenges associated with enhancing Lustre for diverse applications, the technological advances necessary, and the associated impacts to the Lustre ecosystem. This year’s workshop was the second edition, with a program that featured a day of tutorials and a day of technical paper and invited presentations, with keynote presentations leading off each day.
The workshop kicked off on March 8th with a keynote speech from Brent Gorda, General Manager for HPC Storage in the High Performance Computing Group at Intel. The keynote, titled “HPC Storage Futures – a 5 year outlook”, provided a great starting point for the workshop by discussing new storage hardware technologies that are having a major impact on the design of next-generation scalable storage systems for commercial and HPC systems, and Intel’s vision for how Lustre will continue to play a critical role in future storage system architectures.
After the first day’s keynote, a series of tutorials on Lustre administration, management, and monitoring approaches were presented by experts from BP, Indiana University, Lawrence Livermore National Laboratory, the National Institute for Computational Sciences (NICS), and the Oak Ridge Leadership Computing Facility (OLCF) at ORNL. The OLCF tutorial, presented by HPC Unix/Storage system administrator Jesse Hanley, covered its recent use and analysis of Lustre JobStats data, which provides a profile of I/O on a per-job basis. The first day concluded with a discussion panel, moderated by Sarp Oral of OLCF, where the tutorial and keynote presenters were asked additional questions related to how Lustre can adapt to new diverse workloads and current areas of concern on which the community should take action.
On March 9th, a keynote from Cory Spitz, Lead Developer of Storage R&D at Cray, started the day. The keynote, titled “Making Lustre Data-Aware”, discussed the challenges Lustre faces in adapting to new storage hardware technologies and data-focused workloads, and promoted a renewed community-led effort to define how Lustre will prioritize and address those challenges.
After the second day’s keynote, a collection of speakers from academia, industry, and national laboratories gave technical presentations and invited, special topics talks. The technical presentations covered a wide range of topics, including evaluating new Lustre features like Distributed Namespace (DNE) and Progressive File Layout (PFL), synchronization of petabyte-scale filesystems, long-haul file transfers involving Lustre, and a new approach for distributed file recovery in Lustre. The special topics talks focused on Lustre security and a comparison of Infiniband versus Ethernet as the underlying network transport for Lustre.
Presentation content for the keynotes, tutorials, invited talks, and technical presentations can be found at http://lustre.ornl.gov/ecosystem-2016/agenda.html
The workshop was organized by ORNL and sponsored by the Computational Research and Development Programs at ORNL, funded by the US Department of Defense. The workshop program co-chairs were Neena Imam, Michael Brim, and Sarp Oral. Richard Mohr of the University of Tennessee (NICS) served as the tutorials chair.
Changes in storage hardware are coming and HPC is going to benefit in a major way. Intel and others are working on solid state solutions that will disrupt how we view the storage hierarchy and bring massive changes in both performance and capacity nearer the cpu. As the compute-centric world of HPC starts to care about energy use, data movement and efficiency, the storage community will see more of the limelight (read: resources) to help the overall system design. Lustre is still a critical technology for storage in this future, but it will be augmented with on-system hardware and a number of (object-oriented) software interfaces. The future of storage for HPC is about to undergo change for the better.
Brent Gorda is the General Manager for HPC Storage in the High Performance Computing Group within Intel. Brent co-founded and led Whamcloud, a startup focused on the Lustre technology which was acquired by Intel in 2012. A long time member of the HPC community, Brent was at the Lawrence Livermore National Laboratory and responsible for the BlueGene P/Q architectures as well as many of the large IB-based clusters in use at the Tri-Labs. Previously Brent ran the Future Technologies Group at the Lawrence Berkeley National Laboratory, and has a long history with the National Energy Research Scientific Computing Center (NERSC).
Marc Stearman, Lawrence Livermore National Laboratory
From the beginning Lustre used the Linux ext file system as the building block for the backend storage. As time went on it became desireable to have a more robust feature-rich file system underneath Lustre. ZFS is a combined file system, logical volume manager, and RAID engine with extreme scalability. With Lustre 2.4 and beyond, ZFS adds an additional OSD layer to Lustre. This tutorial will focus on how to configure, install, tune, and monitor a ZFS based Lustre file systems, pointing out some of the differences between ZFS and ldiskfs along the way.
Jesse Hanley, Oak Ridge National Laboratory, Oak Ridge Leadership Computing Facility
An environment such as OLCF can have hundreds to thousands of jobs simultaneously accessing large scale parallel file systems. Determining the impact of a single job on a shared resource can be a time consuming and difficult endeavor. This tutorial focuses on OLCF's investigation and initial deployment of Lustre Job Stats. We cover our current progress, the current analysis environment, and useful information we can derive from gathered metrics.
Shawn Hall and Kent Blancett, BP
"Lustre is a piece of cake," said no one ever. Yet many of us have a necessary addiction. This talk gives an industry perspective on using Lustre and the methods we use to cope with unlimited work in a finite amount of time. Topics include oil and gas applications on Lustre, using Lustre as a site-wide filesystem, Lustre administration and monitoring, pain points, and lessons learned.
Stephen Simms, Indiana University
Administration of a site-wide file system can be challenging. In most cases, the file system will be mounted on resources that are in your immediate control. However, in some instances, that file system could be mounted in another department or laboratory outside of your direct control. UID mapping allows a file system to span administrative domains while providing consistent ownership and permissions for all files. Suppose you would like to protect your filesystem against a man in the middle attack and verify that the incoming file system traffic is coming from a specific laboratory. Furthermore, you would like to encrypt the ePHI data coming from that lab. Shared Key Crypto provides Lustre with an option to verify the integrity of data passed between client and server as well as an option that will encrypt that data in addition to verifying its integrity. This tutorial will detail how to setup and use UID mapping and shared key crypto in your Lustre environment.
Richard Mohr, University of Tennessee, National Institute for Computational Sciences
Because Lustre is open-source, it is a natural fit for sites looking for a high performance file system without the high price tag. However, even running free software entails some operational costs. Deploying a reliable, fast parallel file system requires an understanding of those costs. This is particularly important for small sites working with limited hardware and staff. This tutorial will discuss some of the costs that must be considered when deploying Lustre and how trade-offs can be made among resources. Lustre operational experiences at the National Institute for Computational Sciences (NICS) will be used to illustrate some of these concepts.
Technologies that will breathe new life into storage systems are coming soon, but it isn't clear how they can best be leveraged. In fact, it isn't entirely evident that Lustre will be the software technology that will best take advantage of these new technologies. There are several problems that need careful consideration. One basic problem is the hardware organization of new systems. It is obvious that simply dropping in solid-state components with the same approach to deployment and software isn't going to work. Concepts such as burst buffers, which typically distribute storage components across and throughout a system, are being tested, but the ideas and execution of such methods are far from perfected. Other obstacles must also be taken into account. Perhaps more importantly, we also need to consider the new requirements of shared storage for diverse workloads. If our data is more broadly distributed across these components, we need to think about enabling the seamless movement of and access to that data. Power HPC users could make the investment to carefully marshal their data to optimize their own personal workflows. However, I doubt that we should expect technical computing users, data scientists, or others to make the same investment. Today, Lustre doesn't provide the framework, the tools, or the technology to easily access broadly distributed data. To meet these kinds of expectations, Lustre must move on from so-called scratch storage and become both data aware and data placement aware. Permanence and provenance of data will then be absolute requirements. While it is not yet clear how we can begin to solve these problems in the Lustre ecosystem, it is evident that we must begin to find solutions. If we abandon Lustre altogether we risk starting over and setting ourselves further back. Our codes and workflows will likely be less portable as different providers and different vendors begin trying different emerging solutions. However, Lustre is battle tested and it will take new technology years to catch up. Consequently, a better path forward is to adopt solutions that have been successfully demonstrated, evolve Lustre to adopt the best solutions, and thus ready Lustre to meet the needs of the future. If Lustre is a (sledge) hammer we need to evolve it into a Swiss army knife (with a hammer).
Cory Spitz leads a development team in Cray's Storage R&D group. Cory is a graduate of Michigan State University and has been with Cray Inc. for fifteen years, working with Lustre and storage for roughly half that time. He has been an advocate of Lustre and involved member of OpenSFS from its outset and is particularly involved in OpenSFS working groups, recently serving as the interim lead of the Lustre Working Group. Cory is working at Cray and through OpenSFS to create a strong Lustre development community that will contribute to successful Lustre community releases. Cory works in St. Paul and lives in Minneapolis with his wife and two boys.
Building a More Secure Lustre System using Open Source Tools
Josh Judd (WARP Mechanics)
Abstract: There are a wide range of areas in which Lustre needs to be made more secure. These range from relatively simple steps to secure the base OS, to much more complex and comprehensive multi-layer security appliance solutions. This talk will cover WARP Mechanics' efforts to build an entirely open secure appliance, rather than locking community members into proprietary code. It will cover where these efforts stand today, and where they are going moving forward. Since the topic of "security as a whole" is too large to cover in depth, WARP will also present one specific aspect of secure Lustre as a deeper technical case study: how to create black and white lists for Lustre client NIDs without lowering performance or stability. This will illustrate how other needed open source elements can be linked in to form a complete solution.
Lustre Networking Technologies: Ethernet vs. Infiniband
Blake Caldwell (Oak Ridge National Laboratory, OLCF)
Abstract: While Lustre operates over numerous different network transports, the majority of performance studies and deployment best practices have focused on the Infiniband implementation. However, running Lustre over the TCP Ethernet transport driver offers distinct advantages with regards to fault tolerance and in lowering the barrier to entry for deploying Lustre. This presentation addresses the pros and cons of each in the context of a Lustre networking infrastructure, and summarizes the knowledge gained from tuning Lustre over Ethernet for the Spallation Neutron Source at Oak Ridge National Laboratory. Our findings from this study were that Lustre workloads with limited parallelism can achieve excellent throughput on Ethernet if the effects of TCP congestion control are mitigated. From attending this talk, the audience can expect a guided exploration of the relative technical merits between the Infiniband and Ethernet Lustre networking implementations.
Lustre Security - Today and in the Future
Andreas Dilger (Intel)
Abstract: A number of new security-related features have recently been implemented for Lustre or are currently in active development. The presentation will describe the various features that have landed in the Community Lustre 2.8 release or are planning to land in the upcoming Community Lustre 2.9 release. These features include SELinux Mandatory Access Controls (MACs) to ensure that access to files is secure by default. Secure network communication and client authentication via Kerberos or Shared-Key cryptography protects data on the network and ensures that clients are positively identified. Sub-directory mounts allow the client to mount a subdirectory of the filesystem, rather than the filesystem root. UID/GID mapping based allows the MDS to map user and group identities from a remote administrative domain based on client node authentication, and squash access to files with identifiers not in the identity mapping. Each of these features is useful independently but in combination they could provide the framework to allow an administrator the ability to completely isolate subsets of the filesystem namespace from each other. This would allow Lustre to be used in secure environments to segregate files with different classification levels, as well as multi-tenant hosted or virtual environments where different users have complete control of the clients but should not be able to access files of other users. Finally this presentation will identify the work remaining to integrate these features in order to achieve these security goals, such as restricting subdirectory mounts to specific client nodemaps, which will also be described as well.
Evaluating Progressive File Layouts for Lustre
Richard Mohr (University of Tennessee - NICS),
Michael Brim and Sarp Oral (Oak Ridge National Laboratory),
Andreas Dilger (Intel)
Abstract: Progressive File Layout (PFL) is a Lustre feature currently being developed to support multiple striping patterns within the same file. A file can be created with several non-overlapping extents, and each extent can have different Lustre striping parameters. In this paper, we discuss evaluation results for an early PFL prototype implementation. Results for metadata and streaming I/O tests showed comparable performance between standard Lustre striping and PFL striping. When compared to synthetic dynamic striping, PFL files had better streaming I/O performance. Additionally, object placement tests showed that a single PFL layout could be used for files of widely varying sizes while still producing an object distribution equivalent to customized stripe layouts.
Parallel Synchronization Of Multi-Pebibyte File Systems
Andy Loftus (National Center for Supercomputing Applications)
Abstract: When challenged to find a way to migrate an entire file system onto new hardware while maximizing availability and ensuring exact data and metadata duplication, NCSA found that existing file copy tools were insufficient to accomplish the task in a timely manner. So they set out to create a new tool. One that would scale to the limits of the file system and provide a robust interface to adjust to the dynamic needs of the cluster. The resulting tool, Psync (Parallel Sync), effectively manages many syncs running in parallel. As a synchronization tool, it correctly duplicates both contents and metadata of every inode from the source file system to the target file system including special files, such as soft and hard links, and additional metadata, such as ACLs and Lustre stripe information. Psync supports dynamic scalability features to add and remove worker nodes on the fly and robust management capabilities that allow an entire run to be paused and unpaused or stopped and restarted. Psync has been run successfully on hundreds of nodes with multiple processes each. Spreading the load of synchronization across thousands of processes results in relatively short runtimes for an initial copy and even shorter runtimes for successive runs (to update a previous sync with new and changed files). This paper presents the overall design of Psync and it's use as a general purpose tool for copying lots of data as quickly as possible.
Measurements of File Transfer Rates Over Dedicated Long-Haul Connections
Nageswara Rao, Greg Hinkel, and Neena Imam (Oak Ridge National Laboratory),
Bradley Settlemyer (Los Alamos National Laboratory)
Abstract: Wide-area file transfers are an integral part of several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rate and low competing traffic, are increasingly being provisioned over current HPC infrastructures to support such transfers. To gain insights into these file transfers, we collected transfer rate measurements for Lustre and xfs file systems between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in 0-366 ms range. Memory transfer throughput over these connections is measured using iperf, and file IO throughput on host systems is measured using xddprof. We consider two file system configurations: Luster over IB network and xfs over SSD connected to PCI bus. Files are transferred using xdd across these connections, and the transfer rates are measured, which indicate the need to jointly optimize the connection and host file IO parameters to achieve peak transfer rates. In particular, these measurements indicate that (i) peak file transfer rate is lower than peak connection and host IO throughput, in some cases by as much as 50% or lower, (ii) xdd request sizes that achieve peak throughput for host file IO do not necessarily lead to peak file transfer rates, and (iii) parallelism in host IO and TCP transport does not always improve the file transfer rates.
Distributed Name Space (DNE) Evaluation at the Oak Ridge Leadership
Computing Facility (OLCF)
James S. Simmons, Dustin Leverman, Jesse Hanley, and Sarp Oral (Oak Ridge National Laboratory)
Abstract: This paper describes the Lustre Distributed Name Space (DNE) evaluation carried out at the Oak Ridge Leadership Computing Facility (OLCF) between 2014 and 2015. DNE is a development project, funded by OpenSFS, to improve Lustre metadata performance and scalability. The development effort has been split into two parts, the first part (DNE P1) providing support for remote directories over remote Lustre Metadata Server (MDS) nodes and Metadata Target (MDT) devices, while the second phase (DNE P2) addressed split directories over multiple remote MDS nodes and MDT devices. OLCF has been actively evaluating the performance, reliability, and the functionality of both DNE phases. For these tests, an internal OLCF testbed is used. Results are promising and OLCF is planning on a full DNE deployment on production systems by mid-2016 timeframe.
Distributed File Recovery on the Lustre Distributed File System
Justin Albano, Remzi Seker, and Radu Babiceanu (Embry-Riddle Aeronautical University),
Sarp Oral (Oak Ridge National Laboratory)
Abstract: With the advancement of cloud-computing technologies and the growth in distributed software applications, a great deal of research that has been focused on the concepts and implementations of distributed file systems to support these applications. Since its inception in 1999 by Peter Braam at Carnegie Mellon University, the Lustre distributed file system has gained both the technical, as well as financial interest of some of the largest technology entities, including Oracle, Seagate, Intel, Oak Ridge National Laboratory, and OpenSFS. With this immense backing, Lustre has been incorporated in over 60% of the TOP100 high performance computers in the world and is slated to significantly increase this market share. Although the Lustre file system itself has seen a sharp increase in research since its infancy, support for many of the fields surrounding the file system has been greatly lacking. Primary among these deficiencies is file recovery on the Lustre file system. This paper attempts to fill this gap and provides a simplified solution which is then developed into a distributed solution that can scale to meet the needs and requirements of various sizes of Lustre file system deployments. While this paper focuses on the Lustre file system, the concepts and solution provided in this paper can be used on any similar metadata-based distributed file system. Although this paper does not provide an implementation of this solution, a complete solution architecture is provided, enabling further research and implementation.