EB JSC Logos

EasyBuild

A New Hope

Damian Alvarez (JSC)

Disclaimer: I am not a Star Wars fan
Brief history of EasyBuild at JSC...

JURECA system details & requirements

JURECA system characteristics

  • 1.8 + 0.44 PFlops, #69 in Top500 (Nov'16)
  • 1872 compute nodes (Haswell)
  • 75 compute nodes, each with 2 NVIDIA K80 GPUs
  • 12 visualization nodes, each with 2 NVIDIA K40 GPUs
  • Mellanox EDR InfiniBand with fat tree topology

  • ~950 users and ~180 projects

  • Any guess on user requirements?

User requirements

"I want it all, I want it all, I want it all, and I want it now"♪♬♫♪♬♫
  • Intel compilers

  • GNU compilers

  • PGI compilers

  • ParaStationMPI

  • Intel MPI

User requirements

"I want it all, I want it all, I want it all, and I want it now"♪♬♫♪♬♫
  • CUDA

  • CUDA-aware MPI

  • And of course:

    • Tons of libraries

    • Compatibility

Designing the User View

Module Naming Scheme

Different ways to present modules to users:
  • Flat
    • More than 800 packages at once
    • Not all visible software is compatible

  • Toolchain based
    • Have to choose compiler, MPI and math libraries before seeing anything else
    • Have to choose weird or long names (pmvmklc vs PGI_MVAPICH2_MKL_CUDA)
    • Visible software is compatible

Module Naming Scheme

Different ways to present modules to users:
  • Hierarchy of compilers and MPI runtimes
    • Modules available are shown in a staged fashion
    • Intuitive
    • Visible software is compatible

    • This is our default Module Naming Scheme

Lmod as modules tool

  • Lmod was designed with module hierarchies in mind
    • module spider and module key
    • Module families (family("compiler") or family("mpi"))

Lmod as modules tool

  • Lmod also has other interesting features
    • Good support for hidden modules (--show-hidden)
    • Cache
    • Properties
    • Hooks
    • ...

EasyBuild Evolution on JURECA

Why EasyBuild on JURECA?

  • Designed exactly for this use case
  • Production ready
  • Easily configurable
  • Nice integration with Lmod and different MNS
  • Active and dynamic project
  • Support for over 1000 packages

Shortcomings

  1. Based on monolithic toolchains
    • Unnecesary redundancy
  2. Auxiliary libraries have their own module
    • Swamped module view
  3. Couldn't cope with GCC-only software
  4. Crytic toolchain names led to confusion and support issues

Implemented enhancements

  • Enhanced dependency resolution
    • Minimal toolchains
    • Enables using software lower in the toolchain hierarchy
    • Toolchain hierarchy: Compiler → MPI → Math libraries

Implemented enhancements

  • Commom base compiler (GCCcore)
    • Enables base layer for compilers, tools and auxiliary libraries
    • Toolchain hierarchy: GCCcore → Compiler → MPI → Math libraries

Implemented enhancements

  • Support for hidden modules
    • Eliminates clutter
    • Supported in various ways (command line options, environment variables, easyconfig parameters)
    • Can hide toolchains (GCCcore)

Implemented enhancements

  • Custom module naming schemes
  • Naming scheme-independent software installation directories
  • Performance improvements
  • Refactoring of MPICH-based easyblocks
  • Support for compiler dependent --optarch (since 3.1.0)

Current State in JURECA

Toolchain hierarchy
Toolchain tree
User View and Hidden Modules
  • Initial user view:
    • Compilers (GCC 5.4.0, Intel [2016.4, 2017.0], PGI 16.9)
    • Binary tools (VTune, Advisor, TotalView, ...)

  • After loading a compiler:
    • MPI runtimes (ParaStationMPI, MVAPICH2, IntelMPI)
    • Packages built with GCCcore
    • Packages built with the chosen compiler

  • After loading an MPI runtime:
    • Packages built with the chosen compiler and MPI runtime

  • If a compiler or MPI is loaded on top of the loaded ones
    • Lmod will swap branches and activate/deactivate modules accordingly (using Lmod's families)
User View and Hidden Modules
Not all packages available for a given combination are visible!

  • There are more than 200 hidden packages in total!
  • Close to 400 in stages with duplicated GCCcore
Bundling Extensions
Python, R and Perl have "extensions"

  • 1 module per extension is totally excessive
  • → Bundles
Bundling Extensions
  • Python
    • Python (30 extensions)
    • SciPy-Stack (22 extensions)
    • PyCUDA (6 extensions)
    • numba (2 extensions)
  • R (365 extensions)
  • Perl (217 extensions)
  • X.Org (229 extensions)
Finding Software
  • module [--show-hidden] available
    • Shows software immediately available

  • module [--show-hidden] spider something[/version]
    • Crawls the module tree looking for something on their name
    • Tells what it finds and how to get to it

  • module key something
    • Crawls the module tree looking for something on their description
    • Tells which modules have been found
    • Might need to use spider afterwards to find how to get them
    • Useful for looking for the contents of a bundle (ie: numpy)
Upgrading and Retiring Software
Stage concept:

  • Software deployment area for a given timeframe
  • Simply a directory
  • Default stage upgraded every 6 months
  • There is a development stage to test software
  • Tested software is added to our Golden repository
  • (and deployed to production)
  • Close to seamless transitions between stages during maintenance
  • Development and old stages are available but not visible by default
Ensuring Consistency and Quality
Software team

  • Allowed to install software in the development stage
  • Can test different compilation options, dependencies, functionality, etc
  • Anybody in the team can modify any other installation
Ensuring Consistency and Quality
Software manager

  • Only account allowed to install software in the production stages
  • Supervises quality standards on easyconfigs before adding them to the Golden repository
    • Check for correct dependencies
    • Proper programming in the easyconfigs (no hardcoded paths, use of EB variables, etc)
  • Manages the whole infrastructure
Divergence from Upstream EasyBuild
  • Divergence motivated by
    • Use of latest versions available at deployment time
    • Re-positioning of packages in the toolchain hierarchy

  • Most differences are minimal
    • Different versions of software
    • Different versions of dependencies
    • Different toolchains
Divergence from Upstream EasyBuild
EasyConfigs used in JURECA*

EB upstream EasyConfigs 47
JSC EasyConfigs 777


*in the 2016a Stage
Divergence from Upstream EasyBuild
Toolchains used in JURECA*
EB upstream TCs JSC TCs
Compilers 3 0
Comp.+MPI 3 3
Comp.+MPI+Math 3 3

*in the 2016a Stage
Divergence from Upstream EasyBuild
EasyBlocks used in JURECA*
EB upstream EasyBlocks ~65
JSC tweaked EasyBlocks 5
JSC merged EasyBlocks 5
JSC private EasyBlocks 4

*in the 2016a Stage

Demo

Porting to Other Clusters

Porting to Other Clusters
  • Besides JURECA, JSC also has JUROPA3 and JUAMS
    • Similarities: x86_64, InfiniBand, Red Hat based OS
    • Differences: Microarchitecture, different OSes, mix of Xeon Phi and GPUs


  • Minimal changes needed to reuse JURECA's setup
    • Fix errouneous easyconfigs
    • Provide new versions in EasyBuild of obsolete OS packages
    • Obviously: Set up the whole environment
Porting to Other Clusters
Software in JUAMS and JUROPA3*
Total packages in JUAMS 671
Total packages in JUROPA3 658
Ad-hoc packages in both 15

*in the 2016a Stage

Common problems

Common problems
  • User complains about software X not being available
    • My answer: "Your eyes can deceive you. Don't trust them"
    • Or: "I find your lack of faith disturbing"
    • Lmod's answer: "You underestimate my power"
    • Most of the time it is there but
      • Hidden
      • Bundled
      • In a different toolchain

Common problems
  • User complains about newly missing modules
    • Typically after a stage update
    • User specified module version
    • Old version is not available in the new stage
      • A new one is
Common problems
  • User complains about missing libraries
    • Typically after a stage update
    • Happens when linking against particular library versions (e.g.: libgsl.so.19)
    • Old version is not available in the new stage
Common problems
  • User complains about missing symbol versions version `GLIBCXX_3.4.20' not found (required by ....)
    • Typically when loading certain binary tools
    • Happens when the tools ship older library versions (e.g.: libstdc++.so.6)
    • Depends on the module load order
      • append instead of prepend LD_LIBRARY_PATH?
Common problems
  • Component compatibility
    • Interactions between compilers
    • Intel and PGI need a particular range of underlying GCC versions
    • CUDA needs a particular range of underlying GCC/Intel/PGI versions
    • This forces us to duplicate compilers and/or GCCcore
Common problems
  • Stage updates
    • A lot of hassle to:
      • Find and update new software versions
      • Update dependencies

    • New software: new bugs
      • Particularly important for compilers and MPI runtimes
Common problems
  • Side effect of having a centralized software manager
    • Every software/modules problem is an "EasyBuild problem"
      • "These are not the droids experts you're looking for"

    • Not so frequent anymore
    • Need to make clear with the team the different roles

Common problems
  • User complains about not being able to unload a sticky module
    • My answer: "Use the force"
    • module --force purge
    • This has never been a problem, but I HAD TO make the pun

Future Work/What to change?

EasyBuild changes
  • Automatic upgrades
    • Of dependency versions (start during the hackathon!)
    • Of software versions
  • Linking with -rpath (experimental feature right now)
  • Octave extensions

  • "Fat" easyconfigs
Possible software management changes
  • Reshuffling packages
    • Move Python from compiler to GCCcore?
  • Tracking module usage
    • Lmod hooks and later on XALT?
  • Do not preload Stages by default?
  • Default module sets
    • Preselected packages/views for users that don't care about compilers and MPI runtimes
Deployment in new systems
  • JULIA
    • Cray system: KNL+OmniPath+NVM+Broadwell
  • JURON
    • IBM system: POWER8+NVIDIA P100+InfiniBand+NVM

  • JURECA Booster
    • ~10PFLOPs KNL system
    • Integrated with JURECA
    • Jobs can simultaneously use Haswell+KNL partitions

Conclusions

  • EasyBuild enables to deploy and manage a tremendous amount of software, using a small team

  • We undertook significant effort to

    • Minimize SW replication

    • Provide latests and greatest (at Stage deployment time)

    • Provide a meaningful user view

Conclusions

  • EasyBuild enables easy porting to similar systems

  • Active project that grows everyday

  • Still room for improvement

    • Hackathon yay!

"May the build be with you"

Questions?

XKCD SW *Every nerdy technical slide deck needs an XKCD comic strip