|
|
|
|
|
|
Lustre
101 |
|
|
|
|
|
Lustre101 OverviewThe Lustre 101 web-based course series is focused on administration and
monitoring of large-scale deployments of the Lustre parallel file system. Course
content is drawn from nearly a decade of experience in deploying and
operating leadership-class Lustre file systems at the Oak Ridge Leadership
Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL). A primary concern in deploying a large
system such as Lustre is building the operational experience and insight to
triage and resolve intermittent service problems. Although there is no replacement
for experience, it is also true that there is no adequate training material
for becoming a Lustre administration expert. The overall goal of the Lustre 101 course series is to distill and disseminate to the Lustre
community the working knowledge of those with significant experience
in administration of large-scale Lustre deployments
in the hope that others can avoid the trials and
tribulations of Lustre administration and monitoring at scale. Topics in this CourseThe Lustre Administration Essentials course is targeted at experienced system administrators who are relatively new to Lustre, but may have prior experience with other distributed and parallel file systems. Topics in this course include an introduction to Lustre, hardware selection and benchmarking strategies, Lustre software installation and basic configuration, Lustre tuning and LNet configuration, basic file system administration and monitoring, methods for problem diagnosis and analysis, and methods for evaluating Lustre functionality, performance, and reliability. The course is structured as a series of short lessons that allow students to digest the material at their own pace. Each lesson contains roughly 15 minutes of presentation content, and is available as a standalone presentation (pdf) and narrated slideshow recordings (wmv, mp4). |
|
Feedback
We welcome all feedback
and suggestions for improving course content.
Please send comments and suggestions to: lustre101-feedback @ ornl.gov Related Content
Official Lustre Wiki Pages Lustre Ecosystem
Workshop Tutorials -
2015
2016
Acknowledgments
The Lustre 101 course series is
developed by the Computational Research and Development Programs at Oak Ridge
National Laboratory (ORNL), with support from the U.S. Department of Defense and
the Oak Ridge Leadership Computing Facility (OLCF). OLCF is supported by the
Office of Science of the U.S. Department of Energy. |
Course Lessons1. Introduction to
Lustre (posted on June 2015) [ pdf wmv mp4 ] This presentation provides a general
overview of the Lustre file system for anyone wanting to learn more about
basic Lustre functionality, features, and architecture. The basic components
of Lustre are discussed, including the LNet transport layer. Information
about Lustre file striping is also included. 2. Hardware
Selection and Benchmarking for Lustre (posted on June 2015) [ pdf wmv mp4 ] This presentation covers topics relevant to
the process of hardware selection for a Lustre file system. Recommendations
for servers, clients, and networking are provided. In addition, general
concepts for benchmarking a Lustre file system are discussed, and a list of
some useful tools is provided. 3. Basic Lustre
Installation and Setup from Stock RPMs (posted on June 2015) [ pdf wmv mp4 ] This presentation illustrates how to setup
a simple Lustre file system using the stock Lustre RPM packages. The process
for installing the RPM packages will be covered, along with a description of
the configuration files that are needed by Lustre. Options for formatting and
mounting the Lustre storage are also covered. 4. Creating a Lustre
Test System from Source with Virtual Machines (posted on June 2015) [ pdf wmv mp4 ] This presentation describes how to build a
small Lustre file system using virtual machines that can be used as a test
platform for anyone wanting to experiment with Lustre. Rather than use the
stock Lustre RPM packages, the process of building Lustre from source code
will be demonstrated. 5. Lustre Tuning and
Advanced LNet Configuration (posted on June 2015) [ pdf wmv mp4 ] This presentation discusses several of the
Lustre kernel modules and available performance tuning options, as well as
server and client tuning options for Lustre. Examples of more complex LNet
configurations are also illustrated. 6. File System
Administration and Monitoring (posted on June 2015) [ pdf wmv mp4 ] This presentation covers some basic Lustre
file system administration tasks such as starting and stopping a Lustre file
system, mounting the file system on a client node, and usage reporting. An
overview of several useful monitoring tools is also presented. 7. Analysis of
Crash Dumps and Log Files (posted on February 2016) [ pdf wmv mp4 ]
This presentation discusses how to gather diagnostic
information for Lustre using kernel crash dumps and log files. An overview of crash
dumps is given, including the necessary steps to generate dumps and use them
for analyzing the cause of Lustre kernel module exceptions (aka LBUGs). The analysis
of system logs and Lustre-specific logs to identify problems is also covered.
8. Methods
for Evaluating the Functionality, Performance, & Reliability of
Lustre (posted on April 2016) [ pdf wmv mp4 ]
This presentation describes the
methods used by the Oak Ridge Leadership Computing Facility (OLCF) to
evaluate the functionality, performance, and reliability of new
Lustre versions before deploying them into production.
9. Recent LNET and Lustre Tool Improvements (posted on March 2017) [ mp4 ]
This presentation describes the
methods used by the Oak Ridge Leadership Computing Facility (OLCF) to
evaluate the functionality, performance, and reliability of new
Lustre versions before deploying them into production.
10. LNET Bonding (posted on May 2017)[ pdf mp4 ]
This course will cover LNET bonding - the technique of combining multiple Ethernet or InfiniBand interfaces to function as a single logical interface.
11.
LUSTRE and Memory (posted on December 2017)[ pdf mp4 ]
12.
Lustre Over Long-Haul Connections Using LNet Routers (posted on December 2017)[ pdf mp4 ]
13.
Working with Problematic Nodes (posted on December 2017)[ pdf mp4 ]
14.
Recovery and Eviction (posted on December 2017)[ pdf mp4 ]
|