High Performance Computing

Getting Started with henry2 Linux Cluster

Page Contents:

Henry2 System Configuration

There are 1233 dual Xeon compute nodes in the henry2 cluster. Each node has two Xeon processors (mix of single-, dual-, quad-, six-, eight-, and ten-core) and 2 to 6 GigaBytes of memory per core.

The nodes all have 64-bit processors. Generally, either 32-bit or 64-bit x86 executables will run correctly. 64-bit executables are required in order to access more than about 3GB of memory for program data.

The compute nodes are managed by the LSF resource manager and are not for access except through LSF (accounts directly accessing compute nodes are subject to immediate termination).

Logins for the cluster are handled by a set of login nodes which can be accessed as login.hpc.ncsu.edu using ssh.

Additional information on the initial henry2 configuration (c. 2004) is available in http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf.
Some additional informaion about the cluster architecture is available at http://hpc.ncsu.edu/Hardware/henry2_architecture.php.

Logging onto henry2 cluster

Normal login
    SSH access is supported to the login nodes (login.hpc.ncsu.edu). These are a set of nodes utilizing DNS round-robin load balancing. Logins are authenticated using Unity user names and passwords. Microsoft Windows users can use X-Win32 to log onto to login.hpc.ncsu.edu. To obtain X-Win32 go to ITECS Software page. Click on the Downloads button near the bottom, login, and then download and install the software X-Win32.

    Login nodes should not be used for interactive jobs that take any significant amount of system resources. The usual way to run CPU intensive codes is to submit them as batch jobs to LSF, which schedules them for execution on computational nodes. Example LSF job submission files can be found in Intel Compilers. See LSF 9.1.1 for some more complete documentation.

Alternative login nodes (VCL)

    It is sometimes necessary to use interactive GUI based serial pre and post processors for data resident in the HPC environment. Interactive computing in the HPC environment should be performed by requesting a Virtual Computing Lab (VCL) HPC environment.

    To use the VCL HPC environment you need to send e-mail to


    to request to be added to "vcl" group first. After you are added to the vcl group you can go to the web page http://vcl.ncsu.edu and click on "Make a Reservation". If you have not already authenticated with your Unity ID and password you will be prompted to do so.

    From the list of environments, select "HPC (CentOS 7.1 64 bit VM)".

    (You cannot see the entry of "HPC (CentOS 7.1 64 bit VM)" if you have not been added to the "vcl" group.)

    When the environment is ready VCL will provide information regarding how to log in. VCL provides a dedicated environment, so heavy interactive use will not interfer with other users. If you have problems with using the VCL HPC environment, se nd e-mail to: oit_hpc at help dot ncsu dot edu.

    For more information about VCL HPC please go to Running Interactive jobs with with the HPC VCL image.

File Systems

AFS files are not available from the cluster (but are available on the VCL HPC environments described above).
Home Directory

    Users have a home directory that is shared by all the cluster nodes. Also, the /usr/local file system is shared by all nodes. Home file system is backed up daily, with one copy of each file retained.

    Home file system quota is intentionally small to ensure that the entire file system can be restored quickly if necessary since cluster operation is dependent on presence of home file system. Large files and datasets should be stored on other file systems.

Scratch File Systems
    Three shared scratch file systems /share, share2, and /share3 are available to all users. These file systems are not backed up and files may be deleted from the file systems automatically at any time, use of these file systems is at the users own risk. There is a 1TB group quota on each of these file systems.

    A parallel file system /gpfs_share is also available. Directories on /gpfs_share can be requested. There is a 1TB group quota imposed on /gpfs_share. /gpfs_share file system is not backed up and files are subject to being deleted at any time. Use is at the users own risk.

Mass Storage

    Finally, from the login nodes the HPC mass storage file systems, /ncsu/volume1 and /ncsu/volume2, are available for storage in excess of what can be accomodated in /home. Since these file system are not available from the compute nodes, they cannot be used for running jobs.

User files in /home, /ncsu/volume1, and /ncsu/volume2 are backed up daily. A single backup version is maintained for each file. User files in all other file systems are not backed up.

Important files should never be placed on storage that is not backed up unless another copy of the file exists in another location.

HPC projects are allocated 1TB of storage in one of the HPC mass storage systems (volume1 or volume2). Additional backed up space in these file systems can be purchased or leased.

Additional information about storage on HPC resources is available from http://hpc.ncsu.edu/Documents/GettingStartedstorage.php


Many software packages have already been compiled to run on the cluster. If you click on Software in the left toolbar or on http://hpc.ncsu.edu/Software/Software.php , you'll see a list of software. In many cases, there are "HowTos" which explain how to get access and submit example jobs. Suggestions on documentation updates and on additional software are encouraged.


There are three compiler flavors available on the cluster: 1) the standard GNU compilers supplied with Linux, 2) the Intel compilers, and 3) the Portland Group compilers.

The default GNU compilers are okay for compiling utility programs but in most cases are not appropriate for computationally intensive applications.

Overall the best performance has been observed using the Intel compilers. However, the Intel compilers support very few extensions of the Fortran standard - so codes written using non-standard Fortran may fail to compile without modifications.

The Portland Group compilers tend to be somewhat less syntacticly strict than the Intel compilers while still generating more efficient code than the Gnu compilers.

Additional information about use of each of these compilers is available from the following links. Generally objects and libraries built with different compiler flavors should not be mixed as unexpected behavior may result.

Programs with memory requirements of more than ~1GB should review the following information.
A note on compiling executables with large (> ~1 GB) memory requirements

Running Jobs

The cluster is designed to run computationally intensive jobs on compute nodes. Running resource intensive jobs on the login nodes, while technically possible, is not permitted.

Please limit your use of the login nodes to editing and compiling, and transferring files. Running more than one concurrent file transfer program (scp, sftp, cp) from login nodes is also not desirable.

Running Serial Jobs

To run computationally intensive jobs on the cluster use the compute nodes. Access to the compute nodes is managed by LSF . All tasks for the compute nodes should be submitted to LSF.

The following steps are used to submit jobs to LSF:

  • Create a script file containing the commands to be executed for your job:
    #BSUB -o standard_output
    #BSUB -e standard_error
    cp input /share/myuserid/input
    cd /share/myuserid
    ./job.exe < input
    cp output /home/myuserid
  • Use the bsub command to submit the script to the batch system. In the following example two hours of run time are requested:
    bsub -W 2:00 < script.csh

    In the transition between centos 5 and centos 7, jobs submitted form login01.hpc.ncsu.edu (or login02 or login03) run on blades with the centos 7 operating system. Jobs submitted from login blade login52.hpc.ncsu.edu run on the centos 5 operating systems ( May 2015).

  • The bjobs command can be used to monitor the progress of a job
  • The -e and -o options specify the files for standard error and standard output respectively. If these are not specified the standard output and standard error will be sent by email to the account submitting the job.
  • The bpeek command can be used to view standard output and standard error for a running job.
  • The bkill command can be used to remove a job from LSF (regardless of current job status).

Running MPI Parallel Jobs with Centos 7

Here's a sample bsub job submission file for running a code compiled with an intel compilers and an openmpi library. To run on centos 7 blades and having compiled on login01, login02, or login03, the following job submission file "bfoo" ran "ringping" in parallel on 16 centos7 cores.


#BSUB -W 15
#BSUB -n 16
#BSUB -R span[ptile=4]
#BSUB -q single_chassis
#BSUB -o out.%J
#BSUB -e err.%J

source /usr/local/apps/openmpi/intel2013_ompi.csh
mpirun ./ringping
The job should be submitted from login01, login02, or login03 (April 2016) by the command
bsub < bfoo
The job asks for 16 cores (-n 16), 15 minutes (-W 5), runs with 4 cores per blade (-R span[ptile=4] ). It generates a standard error file err.xxxxx and a standard output file out.xxxxx. It's submitted to the single chassis queue, for which we could request up to 56 cores and 5760 minutes (4 days).

To set the use of intel compilers before compiling, use the same source command as in the job submission file above. For pgi compilers, you would use

source /usr/local/apps/openmpi/ompi184_pgi151.csh

For gnu compilers (discouraged because gnu compiled codes typically run more slowly, where the usual point of parallel computing is to get jobs to run more quickly)
source /usr/local/apps/openmpi/openmpi_gcc.csh

The other "Running MPI Parallel Jobs" sections are for centos 5 and are still valid for jobs compiled and submitted from login01, login02 .. login05.

Running MPI Parallel Jobs with Hydra MPICH2

After an LSF upgrade in November 2014, codes compiled with MPICH-2 libraries exhibited run time errors, "sendRegisterTask: nb_tcpConnect" was an example. Since the previous version of LSF had grown unstable and was no longer supported, we kept the upgrade. Compiling and running with the MPICH-2 parallel libraries requires some changes in the bsub job submission file.

As before, for MPICH-2, environmental variables are set with "add pgi_hydra", "add intel_hydra for pgi and intel compilers, respectively. Alternatively, "source /usr/local/apps/env/pgi_hydra.csh, source /usr/local/apps/env/intel_hydra.csh,

We discourage use of gnu compilers for parallel computation, but if you do need an equivalent parallel tool chain, please contact HPC support. For MPICH-2 compiled codes, a job submission script bfoo looks like


#BSUB -n 4
#BSUB -W 15
#BSUB -R span[ptile=2]
#BSUB -o standard_output.%J
#BSUB -e standard_error.%J

set tasks = `mkmach.pl mach`
mpihydra -n $tasks -f mach -bootstrap-exec blaunch ./mpi-hello

Mpich-2 compiled codes no longer support standard input. So you cannot have a standard input file inputfile specified by

#BSUB -i inputfile
The following syntax for standard input also gives a runtime error.
mpihydra -n $tasks -f mach -bootstrap-exec blaunch ./mpi-hello < inputfile

If you want to use standard input for an MPI code compiled for the GigE or 10GiGE interconnects, you should use MPICH-3 libraries instead of MPICH-2. (See the next section).

The span[ptile=2] requests than job tasks be distributed two per node. This specification is optional and can range from 1 to 12. Specifying specific number of tasks per node may result in longer time waiting in the queue for the available resources to match the request.

If you set setenv MPICH_NO_LOCAL 1 that will specify that all MPI messages will be passed through sockets, not using shared memory available on a node. If setenv MPICH_NO_LOCAL 1 is omitted, the span[ptile must remain. Some possible alternative lines would be

#BSUB -R span[ptile=4]
which would allocate 4 MPI processes on each node, or
#BSUB -R span[ptile=8]
which would allocate 8 MPI processes each on quad core (8 core total) nodes. "span[ptile=8]" restricts the choice of nodes on which LSF can schedule jobs to empty quad core nodes or to 12 (hex) or 16 (oct) core nodes. Oct core nodes are usually not empty, so asking for 16 cores on a node can entail a long wait before running.

If the number of MPI processes on each node (specified by the -R span[ptile= ) is not specified, then the line "setenv MPICH_NO_LOCAL 1" is necessary. But even with "setenv MPICH_NO_LOCAL 1", a ptile setting often helps job execution performance. (Absent a ptile setting, many processes may land on a few nodes. Runtime bottlenecks can occur as many processes communicate through a few sockets.)

Running MPI Parallel Jobs with Hydra MPICH3

Here's a sample bsub job submission file for running a code compiled with an intel compilers and mpich3 library.


#BSUB -W 15
#BSUB -n 16
#BSUB -R span[ptile=4]
#BSUB -q single_chassis
#BSUB -o out.%J
#BSUB -e err.%J

source /usr/local/apps/mpich3/int111_mpich3.csh
mpiexec.hydra ./ringping

To set the use of intel compilers before compiling, use the same source command as in the job submission file above. For pgi compilers, you would use

source /usr/local/apps/mpich3/pgi_mpich3_hydra-134.csh

For gnu compilers (discouraged because gnu compiled codes typically run more slowly, where the usual point of parallel computing is to get jobs to run more quickly)
source /usr/local/apps/mpich3/gnu454_mpich3.csh

Running MPI Parallel Jobs with Infiniband

The cluster nodesare connected by a Gigabit network. A limited number of nodes are connected by a lower latency infiniband network.

For the openmpi libraries we are supporting under centos7, the same executable should be able to use either GiGE or inifinband networks. (though few infiniband nodes are yet available for centos 7). To use centos5 infiniband connected blades, MPI codes need to be (re)compiled and linked with the mvapich libaries. For pgi compilers,

add pgi_mvapich

or inside a bsub job submission script
source /usr/local/apps/env/pgi_mvapich.csh 

or for intel compilers
add intel_mvapich

or inside a bsub job submssion script
source /usr/local/apps/env/intel_mvapich.csh

will set environmental variables so that mpif90 and mpicc use pgf90 and pgcc compilers (ifort and icc) and link to the infiniband mpich (mvapich) libraries.

In order to run infiniband jobs, a couple of environmental variables need to be set on all nodes on which the job will run. To do that, edit a file .tcshrc in your home directory. .tcshrc will be executed as part of the setup process for parallel jobs.

ls -l .tcshrc

will show whether you already have a .tcshrc file. Put or append the lines
setenv RLMIT_MEMLOCK 1000000
limit memorylocked unlimited

to the .tcshrc file.

Once you have a mvapich linked executable, you can submit your infiniband job to the standard_ib queue. Instead of mpiexec_hydra, use mpiexec_mvapich. A sample bsub script follows.


#BSUB -W 15
#BSUB -n 16 
#BSUB -q standard_ib 
#BSUB -R "span[ptile=4] same[chassis]" 
#BSUB -o out.%J
#BSUB -e err.%J

source /usr/local/apps/env/pgi_mvapich
mpiexec_mvapich ./ringping

Note that if you use the span[ptile=4] to specify 4 cores per node, you also need the same[chassis], else the standard_ib queue may put jobs on two different chassis, causing a runtime infiniband error.

For performance reasons, we do not recommend using gnu compilers. If you find a need to use gnu compilers (code only compiles with gnu or you want to make sure your code works with open source compilers?), please contact HPC support.

Running Shared Memory Parallel Jobs

Henry2 nodes are a mix of dual, quad, six-core, and 8 core processors where each node has two processors. Thus total processor cores per node range from 4 to 24. All the processor cores on a node share access to the all of the memory on the node. Individual nodes can be used to run programs written using a shared memory programming model - such as OpenMP.

To submit a shared memory job to use multiple cores on a single node use the bsub options -n 16 -x. These request exclusive use of a node. An example submission file might be


#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 16
#BSUB -R "rusage[mem=128000]  span[hosts=1]" 
#BSUB -W 15
#BSUB -q shared_memory 


If the above file is shmemjob, it could be submitted by the command

bsub < shmemjob
and will run on a node with 16 cores.

In September, 2013, the maximal amount of RAM available for a shared_memory queue job is 512 GBytes (2 nodes). 9 nodes have 128 Gbytes of RAM, and 3 nodes have 128 GBytes. To request memory, use the -R rusage[mem=xxxx] flag, where mem is expressed in megabytes. The above bsub file used 128000 (128 with 3 zeros) to request 128 Gbytes of RAM.

Shared memory jobs can also be run on other nodes, but with access to fewer total processor cores. A script such as the following would use nodes with two quad-core processors to access 8 total processor cores.


#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -W 15


The number of job slots requested, -n 8 in this example, needs to match the number of threads the parallel job will use (OMP_NUM_THREADS). The resource request must specify span[hosts=1] to ensure that LSF assigns all the requested job slots on the same node - so they will have access to the same physical memory.

See the individual compilers for the flags needed to compile codes to enable OpenMP shared memory parallelism. Short course lecture notes on Openmp from the fall of 2009 give some instructions for converting a Fortran or C code to use OpenMP parallelism.

Running Hybrid (MPI + Shared Memory) Parallel Jobs
Normally, when running a hybrid parallel job, you want place 1 MPI process on each node and, under that MPI process, you want to use all the cores available on that node. The following script is a simple sample script that can be used to run a hybrid parallel job "hybrid-job".

#BSUB -o standard_output.%J
#BSUB -e standard_error.%J
#BSUB -n 16 
#BSUB -x
#BSUB -R "qc span[ptile=1]"
#BSUB -W 15

source /usr/local/apps/env/intel_mpich2_hydra-101.csh

setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`; mpiexec_hydra ./hybrid-job
If the script is named hybrid-job.csh, then it can be submitted by the command
bsub < hybrid-job.csh
The following specifications in the above script are necessary for running a hybrid parallel job:
  1. The specification of -x requests exclusive use of each node.
  2. The specification of span[ptile=1] requests that 1 MPI process be placed on each node. Thus, there are 16 nodes and each node gets 1 MPI process.
  3. The specification of qc means that you are requesting quad-core nodes. This enables you (most probably) to get nodes with same type of cores on each node and with same number of cores on each node. (If the nodes have different types of cores or different numbers of cores, then some nodes may be under-utilized.) You may change qc to dc to request dual-core nodes.
  4. The source step is necessary for setting up appropriate Hydra MPICH2 related environment variables. Depending on your situation, you may need to source a different file such as /usr/local/apps/env/pgi_mpich2_hydra-105.csh
  5. The command
    setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`
    sets the environment variable OMP_NUM_THREADS to the number of cores on each node regardless of how many cores are there on the node. This ensures that all the cores on each node are utilized.

Job Queues and LSF

A number of LSF queues are configured on the henry2 cluster. Often the best queue will be selected without the user specifing a queue to the bsub command. In some cases LSF may override user queue choices and assign jobs to a more appropriate queue.

Jobs requesting 4 or fewer processors and 15 minutes or less time are assigned to the debug queue and run with minimal wait times. Once a user is satisfied a job is running well, more time will typically be requested.

Queues available to all users support jobs running on up to 128 processors for one day or jobs running for up to a week on up to 16 processors. Jobs that need up to two hours and up to 28 processors are run in a queue that has access to nearly all cluster nodes [generally the queues open to all users only have access to nodes that were purchased with central funding]. Jobs that require 28 or fewer processors (but more than 2 hours) are placed in the single chassis queue. Jobs in this queue are scheduled on nodes located within the same physical chassis - resulting in better message passing bandwidth and lower latency for messages.

Partners, those who have purchased nodes to add to the henry2 cluster, may add the bsub option -q partnerqueueame to place their job in the partner queue. Partner queues are dedicated for use of the partner and their project and have priority access to the quantity of processors the partner has added to the cluster.

A note on LSF job scheduling provides some additional details regarding how LSF is configured on henry2 cluster.

LSF writes some intermediate files in the user's home directory as jobs are starting and running. If the user's disk quota has been exceeded, then the batch job will fail, often without any meaningful error messages or output. The quota command will display usage of /home file system.