High Performance Computing

Getting Started with henry2 Linux Cluster

Page Contents:


Henry2 System Configuration

There are 1233 dual Xeon compute nodes in the henry2 cluster. Each node has two Xeon processors (mix of dual-, quad-, six-, eight-, ten-core, twelve-core) and 2 to 6 GigaBytes of memory per core. The total number of cores increases as more cores are purchased and now exceeds 10 thousand.

The nodes all have 64-bit processors. Generally 64-bit x86 executables will run correctly. Some 32-bit executables may run, but they are no longer supported. . 64-bit executables are required in order to access more than about 3GB of memory for program data.

The compute nodes are managed by the LSF resource manager and are not for access except through LSF (accounts directly accessing compute nodes are subject to immediate termination).

Logins for the cluster are handled by a set of login nodes which can be accessed as login.hpc.ncsu.edu using ssh.

Additional information on the initial henry2 configuration (c. 2004) is available in http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf.
Some additional informaion about the cluster architecture is available at http://hpc.ncsu.edu/Hardware/henry2_architecture.php.

Logging onto henry2 cluster

Normal login
    SSH access is supported to the login nodes (login.hpc.ncsu.edu). These are a set of nodes utilizing DNS round-robin load balancing. Logins are authenticated using Unity user names and passwords. Microsoft Windows users can use X-Win32 to log onto to login.hpc.ncsu.edu. To obtain X-Win32 go to ITECS Software page. Click on the Downloads button near the bottom, login, and then download and install the software X-Win32.

    If you are a MAC (or Linux) user, then in the Terminal, you can use the command

    ssh -l yourUsername login.hpc.ncsu.edu

    to log onto HPC login nodes.

    Login nodes should not be used for interactive jobs that take any significant amount of system resources. The usual way to run CPU intensive codes is to submit them as batch jobs to LSF, which schedules them for execution on computational nodes. Example LSF job submission files can be found in Intel Compilers. See LSF 9.1.2 for some more complete documentation.

Alternative login nodes (VCL)

    It is sometimes necessary to use interactive GUI based serial pre and post processors for data resident in the HPC environment. Interactive computing in the HPC environment should be performed by requesting a Virtual Computing Lab (VCL) HPC environment.

    To use the VCL HPC environment you need to send e-mail to

    oit_hpc@help.ncsu.edu

    to request to be added to "vcl" group first. After you are added to the vcl group you can go to the web page http://vcl.ncsu.edu and click on "Make a Reservation". If you have not already authenticated with your Unity ID and password you will be prompted to do so.

    From the list of environments, select "HPC (CentOS 7.1 64 bit VM)".

    (You cannot see the entry of "HPC (CentOS 7.1 64 bit VM)" if you have not been added to the "vcl" group.)

    When the environment is ready VCL will provide information regarding how to log in. VCL provides a dedicated environment, so heavy interactive use will not interfer with other users. If you have problems with using the VCL HPC environment, se nd e-mail to: oit_hpc at help dot ncsu dot edu.

    For more information about VCL HPC please go to Running Interactive jobs with with the HPC VCL image.

File Systems

AFS files are not available from the cluster (but are available on the VCL HPC environments described above).
Home Directory

    Users have a home directory that is shared by all the cluster nodes. Also, the /usr/local file system is shared by all nodes. Home file system is backed up daily, with one copy of each file retained.

    Home file system quota is intentionally small to ensure that the entire file system can be restored quickly if necessary since cluster operation is dependent on presence of home file system. Large files and datasets should be stored on other file systems.

Scratch File Systems
    A shared scratch file system /share is available to all users. This file systems are not backed up and files may be deleted from the file systems automatically at any time, use of these file systems is at the users own risk. There is a 10TB group quota on /share. To find the group directory in which the user bfoo has permission to read and write, on a login blade, type

    grep bfoo /etc/group
    

    A parallel file system /gpfs_share is also available. Directories on /gpfs_share can be requested. There is a 1TB group quota imposed on /gpfs_share. /gpfs_share file system is not backed up and files are subject to being deleted at any time. Use is at the users own risk.

Mass Storage

    Finally, from the login nodes the HPC mass storage file systems, /ncsu/volume1 and /ncsu/volume2, are available for storage in excess of what can be accomodated in /home. Since these file system are not available from the compute nodes, they cannot be used for running jobs.

User files in /home, /ncsu/volume1, and /ncsu/volume2 are backed up daily. A single backup version is maintained for each file. User files in all other file systems are not backed up.

Important files should never be placed on storage that is not backed up unless another copy of the file exists in another location.

HPC projects are allocated 1TB of storage in one of the HPC mass storage systems (volume1 or volume2). Additional backed up space in these file systems can be purchased or leased.

Additional information about storage on HPC resources is available from http://hpc.ncsu.edu/Documents/GettingStartedstorage.php

Software

Many software packages have already been compiled to run on the cluster. If you click on Software in the left toolbar or on http://hpc.ncsu.edu/Software/Software.php , you'll see a list of software. In many cases, there are "HowTos" which explain how to get access and submit example jobs. Suggestions on documentation updates and on additional software are encouraged.

Compiling

There are three compiler flavors available on the cluster: 1) the standard GNU compilers supplied with Linux, 2) the Intel compilers, and 3) the Portland Group compilers.

The default GNU compilers are okay for compiling utility programs but in most cases are not appropriate for computationally intensive applications.

Overall the best performance has been observed using the Intel compilers. However, the Intel compilers support very few extensions of the Fortran standard - so codes written using non-standard Fortran may fail to compile without modifications.

The Portland Group compilers tend to be somewhat less syntacticly strict than the Intel compilers while still generating more efficient code than the Gnu compilers.

For some pointers on using common tools to port codes, see Makefile, Configure, Cmake. Additional information about use of intel, pgi and gnu compilers is available from the following links. Generally objects and libraries built with different compiler flavors should not be mixed as unexpected behavior may result.

Programs with memory requirements of more than ~1GB should review the following information.
A note on compiling executables with large (> ~1 GB) memory requirements

Running Jobs

The cluster is designed to run computationally intensive jobs on compute nodes. Running resource intensive jobs on the login nodes, while technically possible, is not permitted.

Please limit your use of the login nodes to editing and compiling, and transferring files. Running more than one concurrent file transfer program (scp, sftp, cp) from login nodes is also not desirable.

Running Serial Jobs

To run computationally intensive jobs on the cluster use the compute nodes. Access to the compute nodes is managed by LSF . All tasks for the compute nodes should be submitted to LSF.

The following steps are used to submit jobs to LSF:

  • Create a script file containing the commands to be executed for your job:
    #!/bin/csh
    
    #BSUB -o standard_output
    #BSUB -e standard_error
    
    cp input /share/myuserid/input
    cd /share/myuserid
    ./job.exe < input
    cp output /home/myuserid
    
    
  • Use the bsub command to submit the script to the batch system. In the following example two hours of run time are requested:
    bsub -W 2:00 < script.csh
    

  • The bjobs command can be used to monitor the progress of a job
  • The -e and -o options specify the files for standard error and standard output respectively. If these are not specified the standard output and standard error will be sent by email to the account submitting the job.
  • The bpeek command can be used to view standard output and standard error for a running job.
  • The bkill command can be used to remove a job from LSF (regardless of current job status).
For running many instances of a job with differing data files or other systematic changes in the job submission file, you may want to automate the submission of jobs by writing scripts. A few examples are in the Perl HowTo.

Running MPI Parallel Jobs

Here's a sample bsub job submission file for running a code compiled with intel compilers and an openmpi library. The following job submission file "bfoo" ran "ringping" in parallel on 16 cores.

 
#!/bin/csh

#BSUB -W 15
#BSUB -n 16
#BSUB -R span[ptile=4]
#BSUB -q single_chassis
#BSUB -o out.%J
#BSUB -e err.%J

source /usr/local/apps/openmpi/intel2013_ompi.csh
mpirun ./ringping
The job should be submitted from login01, login02, login03, or login04 (April 2017) by the command
bsub < bfoo
The job asks for 16 cores (-n 16), 15 minutes (-W 5), runs with 4 cores per blade (-R span[ptile=4] ). It generates a standard error file err.xxxxx and a standard output file out.xxxxx. It's submitted to the single chassis queue, for which we could request up to 56 cores and 5760 minutes (4 days).

To set the use of intel compilers before compiling, use the same source command as in the job submission file above. For pgi compilers, you would use

 
source /usr/local/apps/openmpi/ompi184_pgi151.csh

For gnu compilers (discouraged because gnu compiled codes typically run more slowly, where the usual point of parallel computing is to get jobs to run more quickly)
 
source /usr/local/apps/openmpi/openmpi_gcc.csh

Running MPI Parallel Jobs with Infiniband

The cluster nodesare connected by a Gigabit network. A limited number of nodes are connected by a lower latency infiniband network.

For current mpi libraries, the same executable should be able to use either GiGE or inifinband networks.

Running Shared Memory Parallel Jobs

Henry2 nodes are a mix of dual, quad, six-core, eight-core, ten-core and twelve-core processors where each node has two processors. Thus total processor cores per node range from 4 to 24. All the processor cores on a node share access to the all of the memory on the node. Individual nodes can be used to run programs written using a shared memory programming model - such as OpenMP.

Below is a sample shared memory job script that uses 16 processor cores.

 
#!/bin/csh

#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -W 15

setenv OMP_NUM_THREADS 16
./exec

The number of job slots requested, -n 16 in this example, needs to match the number of threads the parallel job will use (OMP_NUM_THREADS). The resource request must specify span[hosts=1] to ensure that LSF assigns all the requested job slots on the same node - so they will have access to the same physical memory.

See the individual compilers for the flags needed to compile codes to enable OpenMP shared memory parallelism. Short course lecture notes on Openmp from the fall of 2009 give some instructions for converting a Fortran or C code to use OpenMP parallelism.

Requesting Specific Amount of Memory

If your job needs a large amount of memory (RAM) then you can use the syntax -R "rusage[mem=6000]", where the 6000 specifies the amount of memory which is in unit of MB. One thing that needs special note is that the memory request rusage[mem=6000] is a per job slot resource request. Thus, the total memory requested will be 6000 multiplied by the number of job slots you specify by -n.

Below is an example job script which has -n 16 and rusage[mem=8000]. Thus the total amount of memory that the following example job script requests is 8000 x 16 = 128,000 MB. That is the total amount of memory that the job script requests is about 128 GB.

#!/bin/csh 

#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 16
#BSUB -R "rusage[mem=8000]  span[hosts=1]" 
#BSUB -W 15
#BSUB -q shared_memory 

setenv OMP_NUM_THREADS 16
./exec

In September, 2016, the maximal amount of RAM available for a shared_memory queue job was 512 GBytes. 9 nodes have 128 GBytes of RAM, and 3 nodes have 512 GBytes.

Job Queues and LSF

A number of LSF queues are configured on the henry2 cluster. Often the best queue will be selected without the user specifing a queue to the bsub command. In some cases LSF may override user queue choices and assign jobs to a more appropriate queue.

Jobs requesting 16 or fewer processors and 100 minutes or less time are assigned to the debug queue and run with minimal wait times. Once a user is satisfied a job is running well, more time will typically be requested.

Queues available to all users support jobs running on up to 256 processors for two days or jobs running for up to 15 days on up to 16 processors. Jobs that need up to two hours and up to 32 processors are run in a queue that has access to nearly all cluster nodes [generally the queues open to all users only have access to nodes that were purchased with central funding]. Jobs that require 56 or fewer processors and up to 4 days are placed in the single chassis queue. Jobs in this queue are scheduled on nodes located within the same physical chassis - resulting in better message passing bandwidth and lower latency for messages.

Partners, those who have purchased nodes to add to the henry2 cluster, may add the bsub option -q partnerqueueame to place their job in the partner queue. Partner queues are dedicated for use of the partner and their project and have priority access to the quantity of processors the partner has added to the cluster.

A note on LSF job scheduling provides some additional details regarding how LSF is configured on henry2 cluster.

LSF writes some intermediate files in the user's home directory as jobs are starting and running. If the user's disk quota has been exceeded, then the batch job will fail, often without any meaningful error messages or output. The quota command will display usage of /home file system.