|
|
|
|
|
|
Lustre
101 |
|
|
|
|
|
Lustre101 OverviewThe Lustre 101 web-based course series is focused on administration and
monitoring of large-scale deployments of the Lustre parallel file system. Course
content is drawn from nearly a decade of experience in deploying and
operating leadership-class Lustre file systems at the Oak Ridge Leadership
Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL),
as well as contributions from other sites with large-scale Lustre experience. A primary concern in deploying a large
system such as Lustre is building the operational experience and insight to
triage and resolve intermittent service problems. Although there is no replacement
for experience, it is also true that there is no adequate training material
for becoming a Lustre administration expert. The overall goal of the Lustre 101 course series is to distill and disseminate to the Lustre
community the working knowledge of those with significant experience
in administration of large-scale Lustre deployments
in the hope that others can avoid the trials and
tribulations of Lustre administration and monitoring at scale. Topics in this CourseThe Lustre Administration Tutorials course includes a series of tutorials on specific aspects of Lustre administration and monitoring. Lessons in this course are intended for system administrators who are relatively new to Lustre, as well as more experienced admins seeking to learn about new ways of using Lustre or improving its reliability or performance. Each lesson contains roughly 45 minutes of presentation content, and is available as a standalone presentation (pdf). Questions on tutorial content should be directed to the appropriate contact person identified in each tutorial. |
|
Feedback
We welcome all feedback
and suggestions for improving course content.
Please send comments and suggestions to: lustre101-feedback @ ornl.gov Related Content
Official Lustre Wiki Pages Lustre Ecosystem
Workshop Tutorials -
2015
2016
Acknowledgments
The Lustre 101 course series is
developed by the Computational Research and Development Programs at Oak Ridge
National Laboratory (ORNL), with support from the U.S. Department of Defense and
the Oak Ridge Leadership Computing Facility (OLCF). OLCF is supported by the
Office of Science of the U.S. Department of Energy. |
Course Lessons1. Installing, Tuning, and Monitoring a ZFS based Lustre file system [ pdf ] From the beginning Lustre used the
Linux ext file system as the building block for the backend
storage. As time went on it became desireable to have a more robust
feature-rich file system underneath Lustre. ZFS is a combined file
system, logical volume manager, and RAID engine with extreme
scalability. With Lustre 2.4 and beyond, ZFS adds an additional OSD
layer to Lustre. This tutorial focuses on how to configure,
install, tune, and monitor ZFS based Lustre file systems, pointing
out some of the differences between ZFS and ldiskfs along the way.
Note: This tutorial is made available
courtesty of Marc Stearman from Lawrence Livermore National
Laboratory. The tutorial was originally presented at the
2nd
International Workshop on the Lustre Ecosystem
in March of 2016.
2. Robust
Health Monitoring [ pdf ] Large scale parallel filesystems
have tens of thousands of moving parts, and on top of that complex
interconnections and software. This tutorial covers best practices for
monitoring our center-wide parallel filesystem resources at OLCF. We
focus on hardware level monitoring, interconnect issues, and
Lustre software monitoring.
Note: This tutorial is made available
courtesty of Blake Caldwell from Oak Ridge National
Laboratory. The tutorial was originally presented at the
1st
International Workshop on the Lustre Ecosystem
in March of 2015.
3. Network
Contention and Congestion Control [ pdf ] High-performance storage systems
are complex and large. Deployment and configuration of these
systems must be carefully planned to avoid congestion points and
extract the maximum performance. This tutorial discusses the
Lustre LNET fine-grained routing (FGR) technique developed at OLCF
to alleviate network congestion on past and present systems.
Note: This tutorial is made available
courtesty of Matt Ezell from Oak Ridge National
Laboratory. The tutorial was originally presented at the
1st
International Workshop on the Lustre Ecosystem
in March of 2015.
|