Lustre 101

Lustre Administration Tutorials

 

 

 

 

 

 

Lustre101 Overview

The Lustre 101 web-based course series is focused on administration and monitoring of large-scale deployments of the Lustre parallel file system. Course content is drawn from nearly a decade of experience in deploying and operating leadership-class Lustre file systems at the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL), as well as contributions from other sites with large-scale Lustre experience.

A primary concern in deploying a large system such as Lustre is building the operational experience and insight to triage and resolve intermittent service problems. Although there is no replacement for experience, it is also true that there is no adequate training material for becoming a Lustre administration expert. The overall goal of the Lustre 101 course series is to distill and disseminate to the Lustre community the working knowledge of those with significant experience in administration of large-scale Lustre deployments in the hope that others can avoid the trials and tribulations of Lustre administration and monitoring at scale.

Topics in this Course

The Lustre Administration Tutorials course includes a series of tutorials on specific aspects of Lustre administration and monitoring. Lessons in this course are intended for system administrators who are relatively new to Lustre, as well as more experienced admins seeking to learn about new ways of using Lustre or improving its reliability or performance.

Each lesson contains roughly 45 minutes of presentation content, and is available as a standalone presentation (pdf). Questions on tutorial content should be directed to the appropriate contact person identified in each tutorial.

 

Feedback

We welcome all feedback and suggestions for improving course content. Please send comments and suggestions to:

lustre101-feedback @ ornl.gov

Related Content

Lustre User Manual

Official Lustre Wiki Pages

·       Lustre Monitoring

·       Lustre Testing

 

Lustre Ecosystem Workshop Tutorials - 2015 2016

Acknowledgments

The Lustre 101 course series is developed by the Computational Research and Development Programs at Oak Ridge National Laboratory (ORNL), with support from the U.S. Department of Defense and the Oak Ridge Leadership Computing Facility (OLCF). OLCF is supported by the Office of Science of the U.S. Department of Energy.

 

Course Lessons

1.     Installing, Tuning, and Monitoring a ZFS based Lustre file system    [ pdf ]

From the beginning Lustre used the Linux ext file system as the building block for the backend storage. As time went on it became desireable to have a more robust feature-rich file system underneath Lustre. ZFS is a combined file system, logical volume manager, and RAID engine with extreme scalability. With Lustre 2.4 and beyond, ZFS adds an additional OSD layer to Lustre. This tutorial focuses on how to configure, install, tune, and monitor ZFS based Lustre file systems, pointing out some of the differences between ZFS and ldiskfs along the way.

Note: This tutorial is made available courtesty of Marc Stearman from Lawrence Livermore National Laboratory. The tutorial was originally presented at the 2nd International Workshop on the Lustre Ecosystem in March of 2016.

2.     Robust Health Monitoring    [ pdf ]

Large scale parallel filesystems have tens of thousands of moving parts, and on top of that complex interconnections and software. This tutorial covers best practices for monitoring our center-wide parallel filesystem resources at OLCF. We focus on hardware level monitoring, interconnect issues, and Lustre software monitoring.

Note: This tutorial is made available courtesty of Blake Caldwell from Oak Ridge National Laboratory. The tutorial was originally presented at the 1st International Workshop on the Lustre Ecosystem in March of 2015.

3.     Network Contention and Congestion Control    [ pdf ]

High-performance storage systems are complex and large. Deployment and configuration of these systems must be carefully planned to avoid congestion points and extract the maximum performance. This tutorial discusses the Lustre LNET fine-grained routing (FGR) technique developed at OLCF to alleviate network congestion on past and present systems.

Note: This tutorial is made available courtesty of Matt Ezell from Oak Ridge National Laboratory. The tutorial was originally presented at the 1st International Workshop on the Lustre Ecosystem in March of 2015.

 

ORNL | ORNL Disclaimer