International Workshop on The Lustre Ecosystem:
Challenges and Opportunities
March 3, 2015 - Keynote & Tutorials
8:00am – Registration and breakfast
8:45am - Welcome and Introductions
Neena Imam - Oak Ridge National Laboratory
9:00am - Keynote Address
Eric Barton - Intel: Growing the Lustre Ecosystem
10:00am - Morning Break
10:15am - OLCF Lustre Overview (Sarp Oral, ORNL)
This talk will provide a brief introduction to the production Lustre environment of the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL). The subsequent tutorials provide information and techniques used in managing and optimizing this Lustre environment.
10:30am - Tutorial 1: Network Contention and Congestion Control
(Matt Ezell [Network Contention and Congestion Control: Lustre Fine-Grained Routing] and Feiyi Wang [Improve Large-Scale Storage System Performance
via Topology-aware and Balanced Data Placement], ORNL)
High-performance storage systems are complex and large. Deployment and
configuration of these systems must be carefully planned to avoid congestion points and extract the maximum performance. This tutorial will discuss the Lustre LNET fine-grained routing (FGR) and Balanced I/O placement technologies developed at OLCF to alleviate network congestion on past and present systems.
11:30am - Tutorial 2: LNET Configuration (Jason Hill, ORNL)
In this tutorial session we cover the kernel module parameters for LNET,
what they mean, and what reasonable values might look like for several
compute environment scenarios.
12:30pm – Working Lunch
1:00pm - Tutorial 3: Robust Monitoring and Analysis (Blake Caldwell, ORNL)
Large scale parallel filesystems have tens of thousands of moving parts,
and on top of that complex interconnections and software. This talk covers best practices for monitoring our center-wide parallel filesystem resources at OLCF. We focus on hardware level monitoring, interconnect issues, and Lustre software monitoring.
2:00pm - Tutorial 4: Failure Handling (Jason Hill, ORNL)
Individual failures of hardware or software can be compounded by the
seat-keyboard interface. Minimizing the impact to availability and data
loss is of the utmost importance. During the design phase decisions can be made to help mitigate the compounding. Sometimes with additional funds these constraints can be overcome; but funding isn't always available. In this talk we will see how the OLCF has designed for avoiding the human error conditions, and also cover a few events where the human condition has caused downtime and even data loss.
3:00pm - Afternoon Break
3:15pm - 3:45 - Tutorial 5: Scalability Limitations of Standard Linux Tools (Feiyi Wang, ORNL)
Standard Linux tools are single-threaded and run on a single client. The
lack of parallelism has a substantial impact to user productivity when
working with large data sets. There have been multiple efforts in
community to develop parallel Linux tools, but none so far has been widely accepted. This tutorial will focus on the mpiFileUtils effort. This effort focuses on developing parallelized and high performance versions of standard Linux tools. The mpiFileUtils effort is a collaboration between ORNL, Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory (LANL), and Data Direct Networks (DDN).
3:45pm - Q&A Session: Ask the OLCF (OLCF Staff)
This session will provide audience members the chance to ask the OLCF Lustre experts additional questions on large-scale Lustre administration and monitoring.
March 4, 2015 - Technical Presentations
8:00am - 8:45am: Registration and Breakfast
8:45am - 9:00am : Welcome and Introductions
Neena Imam - Oak Ridge National Laboratory
9:00am - 10:15am : Supporting Data-Intensive Workloads
Development of a Burst Buffer System for Data-Intensive Applications
Teng Wang, Michael Pritchard, Kevin Vasko, and Weikuan Yu - Auburn University, Auburn, Alabama, U.S.A.
Sarp Oral - Oak Ridge National Laboratory, Oak Ridge, Tennessee, U.S.A.
Modern parallel filesystems such as Lustre are designed to provide high, sustainable I/O bandwidth in response to soaring I/O requirements. To handle bursty I/O, a burst buffer system is needed to temporarily buffer the bursty I/O and gradually flush datasets to long-term parallel filesystems. In this paper, we explore the issues for developing a high-performance burst buffer system for data-intensive scientific applications. Our initial result demonstrates that a burst buffer system on top of Lustre is very promising in absorbing intensive I/O traffic from application checkpoints.
Evaluating Dynamic File Striping for Lustre
Joel Reed, Jeremy Archuleta, Michael J. Brim, and Joshua Lothian - Oak Ridge National Laboratory, Oak Ridge, Tennessee, U.S.A.
We define dynamic striping as the ability to assign different Lustre striping characteristics to contiguous segments of a file as it grows. In this paper, we evalauate the effects of dynamic striping on data analytic workloads using a watermark-based strategy where the stripe count or width is increased once a file's size exceeds one of the chosen watermarks. Initial results using a modified IOR on our Lustre testbed suggest that dynamic striping may provide performance benefits, but results using two data analytic workloads with significant random read phases are inconclusive on both our testbed and the production Lustre environment at the Oak Ridge Leadership Computing Facility (OLCF).
10:15am - 10:30am : Break
10:30am - 12:00pm : Lustre Management
Taking back control of HPC file systems with Robinhood Policy Engine
Thomas Leibovici - CEA/DAM, Arpajon, France
Today, the largest Lustre file systems store billions of entries. On such systems, classic tools based on namespace scanning become unusable. Operations such as managing file lifetime, scheduling data copies and generating overall filesystem statistics become painful as they require collecting, sorting and aggregating information for billions of records. Robinhood Policy Engine is an open source software developed to address these challenges. It makes it possible to schedule automatic actions on huge numbers of filesystem entries. It also gives a synthetic understanding of file system contents by providing overall statistics about data ownership, age and size profiles. Although it can be used with any POSIX filesystem, Robinhood supports Lustre-specific features like OSTs, pools, HSM, ChangeLogs, DNE, etc. It implements specific support for these features, and takes advantage of them to manage Lustre file systems efficiently.
Distributed Lustre Activity Tracking
Henri Doreau - CEA/DAM, Arpajon, France
Numerous administration tools and techniques require near real time vision of the activity occuring on a distributed filesystem. The changelog facility provided by Lustre to address this need suffers limitations in terms of scalability and flexibility. We have been working on reducing those limitations by enhancing Lustre itself and developing external tools such as Lustre ChangeLog Aggregate and Publish (LCAP) proxy. Beyond the ability to distribute changelog processing, this effort aims at opening new prospectives by making the changelog stream simpler to leverage for various purposes.
12:00pm - 1:00pm : Lunch and Networking
1:00pm - 2:30pm : Lustre Monitoring
Monitoring Extreme-scale Lustre Toolkit
Michael J. Brim and Joshua K. Lothian - Oak Ridge National Laboratory, Oak Ridge, Tennessee, U.S.A.
We discuss the design and ongoing development of the Monitoring Extreme-scale Lustre Toolkit (MELT), a unified Lustre performance monitoring and analysis infrastructure that provides continuous, low-overhead summary information on the health and performance of Lustre, as well as on-demand, in- depth problem diagnosis and root-cause analysis. The MELT infrastructure leverages a distributed overlay network to enable monitoring of center-wide Lustre filesystems where clients are located across many network domains. We preview interactive command-line utilities that help administrators and users to observe Lustre performance at various levels of resolution, from individual servers or clients to whole filesystems, including job-level reporting. Finally, we discuss our future plans for automating the root-cause analysis of common Lustre performance problems.
Monitoring Lustre - Getting More from Performance Info
Ben Evans - Terascala Inc., Boston, Massachusetts, U.S.A.
Everyone monitors Lustre, gets read/write speeds, I/O histograms, memory and CPU usage, metadata operations, etc. It's good data to have, however it doesn't allow a deep dive into answering why things are slow, or how they can be improved. Everyone monitors for errors at all levels of storage, from disks up to Lustre itself, the network and beyond. With added information, can we get more out of the data we already collect? Absolutely. Knowing how disks are positioned in their enclosures, where their enclosures are positioned in a rack, and where the rack is positioned in the data center, isolating temperature anomolies can be done visually in a fraction of the time using a heatmap. Instead of monitoring just OST performance, usage, etc., being able to display performance within pools or groups of pools can yield quick, easy answers to "why is this slow?". To do this, you need many more layers of data and the tools to act on it. In this talk, I will discuss the construction of a system that can handle these sorts of issues and many others.
2:30pm - 3:00pm : Break
3:00pm - 5:00pm : Lustre Storage Advances
ZFS on RBODs - Leveraging RAID Controllers for Metrics and Enclosure Management
Marc Stearman - Lawrence Livermore National Laboratory, Livermore, California, U.S.A.
Traditionally, the Lustre file system has relied on the ldiskfs file system with reliable RAID (Redundant Array of Inexpensive Disks) storage underneath. As of Lustre 2.4, ZFS was added as a backend file system with built-in RAID, thereby removing the need of expensive RAID controllers. ZFS was designed to work with JBOD (Just a Bunch Of Disks) storage enclosures under the Solaris Operating System, which provided a rich device management system. Long time users of the Lustre file system have relied on the RAID controllers to provide metrics and enclosure monitoring and management services, with rich APIs and command line interfaces. This talk will discuss a hybrid approach using an advanced RAID enclosure with all the features, but presented to the host as a JBOD, allowing ZFS to do the RAID protection and management of the disks.
WARP-Z architecture: scalable & performant erasure code protection for Lustre
Josh Judd - Warp Mechanics Ltd., San Francisco, California, U.S.A.
WARP-Z takes the vertical Lustre/ZFS OSS design and turns it into a scale-out / erasure code protection style of design. It maintains the performance and client compatibility of Lustre, but achieves the cost and distributed protection of Object Storage. In this talk, Josh Judd (CTO of WARP Mechanics) will describe the WARP-Z architecture in detail, along with the current state of the art in testing.
Improving block-level efficiency with scsi-mq
Blake Caldwell - Oak Ridge National Laboratory, Oak Ridge, Tennessee, U.S.A.
Current generation solid-state storage devices are exposing a new bottleneck in the Linux kernel's SCSI and block layers, where small I/O throughput is limited by the common SMP scaling issues of lock contention, inefficient interrupt handling, and poor memory locality. Major re-writes of the Linux kernel block layer blk-mq and the SCSI component scsi-mq are addressing these concerns with a scalable multi-queue design. Beyond the use case of high-IOP SSD devices, this work explores the block-level throughput handling capacity in caching storage controllers that exceed previous limits of the SCSI subsystem. We then present an analysis of the impact of scsi-mq on Lustre filesystem performance and discuss opportunities for efficiency gains from improved SMP affinity of storage target I/O threads.
5:00pm - 5:30pm : Open Discussion
5:30pm : Closing Remarks
Neena Imam - Oak Ridge National Laboratory