2nd International Workshop on The Lustre Ecosystem:
Enhancing Lustre Support for Diverse Workloads

Agenda

[Day 1] - [Day 2]

March 8, 2016 - Keynote & Tutorials

8:00am – Registration and breakfast

8:30 - 8:45am: Welcome and Introductions

Neena Imam - Oak Ridge National Laboratory

8:45 - 9:45am: Keynote Address: HPC Storage Futures - a 5 year outlook (Brent Gorda - Intel, General Manager HPC Storage)

Changes in storage hardware are coming and HPC is going to benefit in a major way. Intel and others are working on solid state solutions that will disrupt how we view the storage hierarchy and bring massive changes in both performance and capacity nearer the cpu. As the compute-centric world of HPC starts to care about energy use, data movement and efficiency, the storage community will see more of the limelight (read: resources) to help the overall system design. Lustre is still a critical technology for storage in this future, but it will be augmented with on-system hardware and a number of (object-oriented) software interfaces. The future of storage for HPC is about to undergo change for the better.

9:45 - 10:00am: Morning Break

10:00 - 11:00am: Tutorial 1: Installing, Tuning, and Monitoring a ZFS based Lustre file system (Marc Stearman, Lawrence Livermore National Laboratory)

From the beginning Lustre used the Linux ext file system as the building block for the backend storage. As time went on it became desireable to have a more robust feature-rich file system underneath Lustre. ZFS is a combined file system, logical volume manager, and RAID engine with extreme scalability. With Lustre 2.4 and beyond, ZFS adds an additional OSD layer to Lustre. This tutorial will focus on how to configure, install, tune, and monitor a ZFS based Lustre file systems, pointing out some of the differences between ZFS and ldiskfs along the way.

Bio: Marc Stearman Leads the operational team for the parallel file systems at Lawrence Livermore National Laboratory, which manages more than 10 file systems totaling approximately 100 PB. He has been working on Lustre system administration for over a decade and is an avid fan of Star Wars, Legos and Craft Beer.

11:00 - 12:00pm: Tutorial 2: Lustre Job Stats Metric Aggregation at OLCF (Jesse Hanley and Rick Mohr, Oak Ridge National Laboratory/Oak Ridge Leadership Computing Facility and University of Tennessee/National Institute for Computational Sciences)

An environment such as OLCF can have hundreds to thousands of jobs simultaneously accessing large scale parallel file systems. Determining the impact of a single job on a shared resource can be a time consuming and difficult endeavor. This tutorial focuses on OLCF's investigation and initial deployment of Lustre Jobstats. We cover our current progress, the current analysis environment, and useful information we can derive from gathered metrics.

Bio: Jesse Hanley is a member of the storage team of the HPC Operations group at the OLCF. In addition to assisting with the administration and maintenance of the Spider file system, he works on SION (OLCF's Scalable I/O Network), and on monitoring of storage resources.

Bio: Richard Mohr runs Lustre file systems and other storage resources for the National Institute for Computational Sciences at the University of Tennessee. While finishing his Ph.D. in Physics at the Ohio State University, he became enamoured with high performace computing. He has been working in the HPC field for the past 15 years and working with Lustre for the past 6 years.

12:00 - 1:00pm: Working Lunch - Discussion and Feedback on Morning Tutorials

1:00 - 2:00pm: Tutorial 3: A 12 Step Program for Lustre Filesystem Addiction (Shawn Hall and Kent Blancett, BP)

"Lustre is a piece of cake," said no one ever. Yet many of us have a necessary addiction. This talk gives an industry perspective on using Lustre and the methods we use to cope with unlimited work in a finite amount of time. Topics include oil and gas applications on Lustre, using Lustre as a site-wide filesystem, Lustre administration and monitoring, pain points, and lessons learned.

Bio: Kent's experience has ranged from departmental computing to architecting, integrating, operating, and debugging cost-effective large scale HPC systems in industry. He has worked to help users effectively use computing resources by finding, understanding, and working around issues in applications, filesystems, and networks. Kent holds a B.S. degree in Computer Science from the University of Oklahoma.

Bio: Shawn's experience is in large scale system administration, having worked with HPC clusters in industry and academia. He has worked on many aspects of large scale systems and his interests include parallel file systems, configuration management, performance analysis, and security. Shawn holds a B.S. and M.S. degree in Electrical and Computer Engineering from Ohio State University.

2:00 - 3:00pm: Tutorial 4: Using UID Mapping and Shared Key Crypto in Lustre (Stephen Simms, Indiana University)

Administration of a site-wide file system can be challenging. In most cases, the file system will be mounted on resources that are in your immediate control. However, in some instances, that file system could be mounted in another department or laboratory outside of your direct control. UID mapping allows a file system to span administrative domains while providing consistent ownership and permissions for all files.

Suppose you would like to protect your filesystem against a man in the middle attack and verify that the incoming file system traffic is coming from a specific laboratory. Furthermore, you would like to encrypt the ePHI data coming from that lab. Shared Key Crypto provides Lustre with an option to verify the integrity of data passed between client and server as well as an option that will encrypt that data in addition to verifying its integrity.

This tutorial will detail how to setup and use UID mapping and shared key crypto in your Lustre environment.

Bio: Stephen Simms currently manages the High Performance File Systems group at Indiana University as part of the Research Technologies division of University Information Technology Services. He has been involved in the Lustre file system community since 2005. Since that time he and his team have pioneered the use of the Lustre file system across wide area networks. He currently serves on the OpenSFS board as the Lustre community representative.

3:00 - 3:30pm: Afternoon Break

3:30 - 4:15: Tutorial 5: Managing Lustre on a Budget (Rick Mohr, University of Tennessee/National Institute for Computational Sciences)

Because Lustre is open-source, it is a natural fit for sites looking for a high performance file system without the high price tag. However, even running free software entails some operational costs. Deploying a reliable, fast parallel file system requires an understanding of those costs. This is particularly important for small sites working with limited hardware and staff. This tutorial will discuss some of the costs that must be considered when deploying Lustre and how trade-offs can be made among resources. Lustre operational experiences at the National Institute for Computational Sciences (NICS) will be used to illustrate some of these concepts.

Bio: Richard Mohr runs Lustre file systems and other storage resources for the National Institute for Computational Sciences at the University of Tennessee. While finishing his Ph.D. in Physics at the Ohio State University, he became enamoured with high performace computing. He has been working in the HPC field for the past 15 years and working with Lustre for the past 6 years.

4:15 - 5:00pm : Panel Lustre as a Shared Resource, Keynote & Tutorial Presenters, with moderator Sarp Oral (ORNL)

March 9, 2016 - Keynote and Technical Presentations

8:45 - 9:00am: Welcome and Introductions

Neena Imam - Oak Ridge National Laboratory

9:00 - 10:00am: Keynote Address: Making Lustre Data-Aware (Cory Spitz - Cray, Lead Developer of Storage R&D)

Technologies that will breathe new life into storage systems are coming soon, but it isn't clear how they can best be leveraged. In fact, it isn't entirely evident that Lustre will be the software technology that will best take advantage of these new technologies. There are several problems that need careful consideration. One basic problem is the hardware organization of new systems. It is obvious that simply dropping in solid-state components with the same approach to deployment and software isn't going to work. Concepts such as burst buffers, which typically distribute storage components across and throughout a system, are being tested, but the ideas and execution of such methods are far from perfected.

Other obstacles must also be taken into account. Perhaps more importantly, we also need to consider the new requirements of shared storage for diverse workloads. If our data is more broadly distributed across these components, we need to think about enabling the seamless movement of and access to that data. Power HPC users could make the investment to carefully marshal their data to optimize their own personal workflows. However, I doubt that we should expect technical computing users, data scientists, or others to make the same investment. Today, Lustre doesn't provide the framework, the tools, or the technology to easily access broadly distributed data. To meet these kinds of expectations, Lustre must move on from so-called scratch storage and become both data aware and data placement aware. Permanence and provenance of data will then be absolute requirements.

While it is not yet clear how we can begin to solve these problems in the Lustre ecosystem, it is evident that we must begin to find solutions. If we abandon Lustre altogether we risk starting over and setting ourselves further back. Our codes and workflows will likely be less portable as different providers and different vendors begin trying different emerging solutions. However, Lustre is battle tested and it will take new technology years to catch up. Consequently, a better path forward is to adopt solutions that have been successfully demonstrated, evolve Lustre to adopt the best solutions, and thus ready Lustre to meet the needs of the future. If Lustre is a (sledge) hammer we need to evolve it into a Swiss army knife (with a hammer).

10:00 - 10:15am : Morning Break

10:15 - 11:00am : Special Topics Talk: Building a More Secure Lustre System using Open Source Tools (Josh Judd - WARP Mechanics)

There are a wide range of areas in which Lustre needs to be made more secure. These range from relatively simple steps to secure the base OS, to much more complex and comprehensive multi-layer security appliance solutions. This talk will cover WARP Mechanics' efforts to build an entirely open secure appliance, rather than locking community members into proprietary code. It will cover where these efforts stand today, and where they are going moving forward.

Since the topic of "security as a whole" is too large to cover in depth, WARP will also present one specific aspect of secure Lustre as a deeper technical case study: how to create black and white lists for Lustre client NIDs without lowering performance or stability. This will illustrate how other needed open source elements can be linked in to form a complete solution.

11:00 - 11:30pm : Technical Presentation: Evaluating Progressive File Layouts for Lustre (Richard Mohr - University of Tennessee; Michael Brim, Sarp Oral - Oak Ridge National Laboratory; Andreas Dilger - Intel Corporation)

Progressive File Layout (PFL) is a Lustre feature currently being developed to support multiple striping patterns within the same file. A file can be created with several non-overlapping extents, and each extent can have different Lustre striping parameters. In this paper, we discuss evaluation results for an early PFL prototype implementation. Results for metadata and streaming I/O tests showed comparable performance between standard Lustre striping and PFL striping. When compared to synthetic dynamic striping, PFL files had better streaming I/O performance. Additionally, object placement tests showed that a single PFL layout could be used for files of widely varying sizes while still producing an object distribution equivalent to customized stripe layouts.

11:30 - 12:00pm : Technical Presentation: Parallel Synchronization Of Multi-Pebibyte File Systems (Andy Loftus - National Center for Supercomputing Applications)

When challenged to find a way to migrate an entire file system onto new hardware while maximizing availability and ensuring exact data and metadata duplication, NCSA found that existing file copy tools were insufficient to accomplish the task in a timely manner. So they set out to create a new tool. One that would scale to the limits of the file system and provide a robust interface to adjust to the dynamic needs of the cluster. The resulting tool, Psync (Parallel Sync), effectively manages many syncs running in parallel. As a synchronization tool, it correctly duplicates both contents and metadata of every inode from the source file system to the target file system including special files, such as soft and hard links, and additional metadata, such as ACLs and Lustre stripe information. Psync supports dynamic scalability features to add and remove worker nodes on the fly and robust management capabilities that allow an entire run to be paused and unpaused or stopped and restarted. Psync has been run successfully on hundreds of nodes with multiple processes each. Spreading the load of synchronization across thousands of processes results in relatively short runtimes for an initial copy and even shorter runtimes for successive runs (to update a previous sync with new and changed files). This talk will present the overall design of Psync and it's use as a general purpose tool for copying lots of data as quickly as possible.

12:00 - 1:00pm : Working Lunch - Discussion and Feedback on Morning Talks

1:00 - 1:45pm : Special Topics Talk: Lustre Networking Technologies: Ethernet vs. Infiniband (Blake Caldwell - Oak Ridge National Laboratory)

While Lustre operates over numerous different network transports, the
majority of performance studies and deployment best practices have
focused on the Infiniband implementation. However, running Lustre over
the TCP Ethernet transport driver offers distinct advantages with
regards to fault tolerance and in lowering the barrier to entry for
deploying Lustre. This presentation addresses the pros and cons of each
in the context of a Lustre networking infrastructure, and summarizes the
knowledge gained from tuning Lustre over Ethernet for the Spallation
Neutron Source at Oak Ridge National Laboratory. Our findings from this
study were that Lustre workloads with limited parallelism can achieve
excellent throughput on Ethernet if the effects of TCP congestion
control are mitigated. From attending this talk, the audience can expect
a guided exploration of the relative technical merits between the
Infiniband and Ethernet Lustre networking implementations.

1:45 - 2:15pm : Technical Presentation: Measurements of File Transfer Rates Over Dedicated Long-Haul Connections (Nageswara Rao, Greg Hinkel, Neena Imam - Oak Ridge National Laboratory; Bradley Settlemyer - Los Alamos National Laboratory)

Wide-area file transfers are an integral part of several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rate and low competing traffic, are increasingly being provisioned over current HPC infrastructures to support such transfers. To gain insights into these file transfers, we collected transfer rate measurements for Lustre and xfs file systems between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in 0-366 ms range. Memory transfer throughput over these connections is measured using iperf, and file IO throughput on host systems is measured using xddprof. We consider two file system configurations: Luster over IB network and xfs over SSD connected to PCI bus. Files are transferred using xdd across these connections, and the transfer rates are measured, which indicate the need to jointly optimize the connection and host file IO parameters to achieve peak transfer rates. In particular, these measurements indicate that (i) peak file transfer rate is lower than peak connection and host IO throughput, in some cases by as much as 50% or lower, (ii) xdd request sizes that achieve peak throughput for host file IO do not necessarily lead to peak file transfer rates, and (iii) parallelism in host IO and TCP transport does not always improve the file transfer rates.

2:15 - 2:45pm : Technical Presentation: Lustre Distributed Name Space (DNE) Evaluation (Jesse Hanley, Sarp Oral, Michael Brim, James Simmons, and Dustin Leverman - Oak Ridge National Laboratory)

This paper describes the Lustre Distributed Name Space (DNE) evaluation
carried out at the Oak Ridge Leadership Computing Facility (OLCF) between 2014 and 2015. DNE is a development project funded by the OpenSFS, to improve Lustre metadata performance and scalability. The development effort has been split into two parts, the first part (DNE P1) providing support for remote directories over remote Lustre Metadata Server (MDS) nodes and Metadata Target (MDT) devices, while the second phase (DNE P2) addressed split directories over multiple remote MDS nodes and MDT devices. The OLCF have been actively evaluating the performance, reliability, and the functionality of both DNE phases. For these tests, internal OLCF testbed were used. Results are promising and OLCF is planning on a full DNE deployment by mid-2016 timeframe on production systems.

2:45 - 3:15pm : Afternoon Break

3:15 - 4:00pm : Special Topics Talk: Lustre Security - Today and in the Future
(Andreas Dilger - Intel Corporation
)

A number of new security-related features have recently been implemented
for Lustre or are currently in active development. The presentation will
describe the various features that have landed in the Community Lustre
2.8 release or are planning to land in the upcoming Community Lustre 2.9
release. These features include SELinux Mandatory Access Controls (MACs)
to ensure that access to files is secure by default. Secure network
communication and client authentication via Kerberos or Shared-Key
cryptography protects data on the network and ensures that clients are
positively identified. Sub-directory mounts allow the client to mount a
subdirectory of the filesystem, rather than the filesystem root. UID/GID
mapping based allows the MDS to map user and group identities from a
remote administrative domain based on client node authentication, and
squash access to files with identifiers not in the identity mapping.

Each of these features is useful independently but in combination they
could provide the framework to allow an administrator the ability to
completely isolate subsets of the filesystem namespace from each other.
This would allow Lustre to be used in secure environments to segregate
files with different classification levels, as well as multi-tenant
hosted or virtual environments where different users have complete
control of the clients but should not be able to access files of other
users.

Finally this presentation will identify the work remaining to integrate
these features in order to achieve these security goals, such as
restricting subdirectory mounts to specific client nodemaps, which will
also be described as well.

4:00 - 4:30pm : Technical Presentation: Distributed File Recovery on the Lustre Distributed File System(Justin Albano, Remzi Seker, Radu Babiceanu - Embry-Riddle Aeronautical University; Sarp Oral - Oak Ridge National Laboratory)

With the advancement of cloud-computing technologies and the growth in distributed software applications, a great deal of research that has been focused on the concepts and implementations of distributed file systems to support these applications. Since its inception in 1999 by Peter Braam at Carnegie Mellon University, the Lustre distributed file system has gained both the technical, as well as financial interest of some of the largest technology entities, including Oracle, Seagate, Intel, Oak Ridge National Laboratory, and OpenSFS. With this immense backing, Lustre has been incorporated in over 60% of the TOP100 high performance computers in the world and is slated to significantly increase this market share.

Although the Lustre file system itself has seen a sharp increase in research since its infancy, support for many of the fields surrounding the file system has been greatly lacking. Primary among these deficiencies is file recovery on the Lustre file system. This paper attempts to fill this gap and provides a simplified solution which is then developed into a distributed solution that can scale to meet the needs and requirements of various sizes of Lustre file system deployments. While this paper focuses on the Lustre file system, the concepts and solution provided in this paper can be used on any similar metadata-based distributed file system. Although this paper does not provide an implementation of this solution, a complete solution architecture is provided, enabling further research and implementation.