Lustre 101

Lustre Administration Essentials

 

 

 

 

 

 

Lustre101 Overview

The Lustre 101 web-based course series is focused on administration and monitoring of large-scale deployments of the Lustre parallel file system. Course content is drawn from nearly a decade of experience in deploying and operating leadership-class Lustre file systems at the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL).

A primary concern in deploying a large system such as Lustre is building the operational experience and insight to triage and resolve intermittent service problems. Although there is no replacement for experience, it is also true that there is no adequate training material for becoming a Lustre administration expert. The overall goal of the Lustre 101 course series is to distill and disseminate to the Lustre community the working knowledge of those with significant experience in administration of large-scale Lustre deployments in the hope that others can avoid the trials and tribulations of Lustre administration and monitoring at scale.

Topics in this Course

The Lustre Administration Essentials course is targeted at experienced system administrators who are relatively new to Lustre, but may have prior experience with other distributed and parallel file systems. Topics in this course include an introduction to Lustre, hardware selection and benchmarking strategies, Lustre software installation and basic configuration, Lustre tuning and LNet configuration, basic file system administration and monitoring, methods for problem diagnosis and analysis, and methods for evaluating Lustre functionality, performance, and reliability.

The course is structured as a series of short lessons that allow students to digest the material at their own pace. Each lesson contains roughly 15 minutes of presentation content, and is available as a standalone presentation (pdf) and narrated slideshow recordings (wmv, mp4).

 

Feedback

We welcome all feedback and suggestions for improving course content. Please send comments and suggestions to:

lustre101-feedback @ ornl.gov

Related Content

Lustre User Manual

Official Lustre Wiki Pages

·       Lustre Monitoring

·       Lustre Testing

 

Lustre Ecosystem Workshop Tutorials - 2015 2016

Acknowledgments

The Lustre 101 course series is developed by the Computational Research and Development Programs at Oak Ridge National Laboratory (ORNL), with support from the U.S. Department of Defense and the Oak Ridge Leadership Computing Facility (OLCF). OLCF is supported by the Office of Science of the U.S. Department of Energy.

 

Course Lessons

1.     Introduction to Lustre (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation provides a general overview of the Lustre file system for anyone wanting to learn more about basic Lustre functionality, features, and architecture. The basic components of Lustre are discussed, including the LNet transport layer. Information about Lustre file striping is also included.

2.     Hardware Selection and Benchmarking for Lustre (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation covers topics relevant to the process of hardware selection for a Lustre file system. Recommendations for servers, clients, and networking are provided. In addition, general concepts for benchmarking a Lustre file system are discussed, and a list of some useful tools is provided.

3.     Basic Lustre Installation and Setup from Stock RPMs (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation illustrates how to setup a simple Lustre file system using the stock Lustre RPM packages. The process for installing the RPM packages will be covered, along with a description of the configuration files that are needed by Lustre. Options for formatting and mounting the Lustre storage are also covered.

4.     Creating a Lustre Test System from Source with Virtual Machines (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation describes how to build a small Lustre file system using virtual machines that can be used as a test platform for anyone wanting to experiment with Lustre. Rather than use the stock Lustre RPM packages, the process of building Lustre from source code will be demonstrated.

5.     Lustre Tuning and Advanced LNet Configuration (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation discusses several of the Lustre kernel modules and available performance tuning options, as well as server and client tuning options for Lustre. Examples of more complex LNet configurations are also illustrated.

6.     File System Administration and Monitoring (posted on June 2015)   [ pdf  wmv  mp4 ]

This presentation covers some basic Lustre file system administration tasks such as starting and stopping a Lustre file system, mounting the file system on a client node, and usage reporting. An overview of several useful monitoring tools is also presented.

7.     Analysis of Crash Dumps and Log Files (posted on February 2016)   [ pdf  wmv  mp4 ]

This presentation discusses how to gather diagnostic information for Lustre using kernel crash dumps and log files. An overview of crash dumps is given, including the necessary steps to generate dumps and use them for analyzing the cause of Lustre kernel module exceptions (aka LBUGs). The analysis of system logs and Lustre-specific logs to identify problems is also covered.

8.     Methods for Evaluating the Functionality, Performance, & Reliability of Lustre (posted on April 2016)   [ pdf  wmv  mp4 ]

This presentation describes the methods used by the Oak Ridge Leadership Computing Facility (OLCF) to evaluate the functionality, performance, and reliability of new Lustre versions before deploying them into production.

9.     Recent LNET and Lustre Tool Improvements (posted on March 2017)   [ mp4 ]

This presentation describes the methods used by the Oak Ridge Leadership Computing Facility (OLCF) to evaluate the functionality, performance, and reliability of new Lustre versions before deploying them into production.

10.     LNET Bonding (posted on May 2017)[ pdf    mp4 ]

This course will cover LNET bonding - the technique of combining multiple Ethernet or InfiniBand interfaces to function as a single logical interface.

11.     LUSTRE and Memory (posted on December 2017)[ pdf    mp4 ]

12.     Lustre Over Long-Haul Connections Using LNet Routers (posted on December 2017)[ pdf    mp4 ]

13.     Working with Problematic Nodes (posted on December 2017)[ pdf    mp4 ]

14.     Recovery and Eviction (posted on December 2017)[ pdf    mp4 ]

 

 

 

 

ORNL | ORNL Disclaimer