Learn best practices for creating a batch submission script.
If the queue is not specified, LSF will attempt to choose the most appropriate queue based on the given resource requirements. For either serial or MPI jobs that do not require special resources, let LSF choose a default queue.
For parallel jobs that do not use MPI, i.e., shared memory jobs, use the LSF specifier #BSUB -R span[hosts=1] to ensure that all cores requested are confined to one node.
Sometimes LSF will choose an inappropriate queue given a very specific set of requirements. For jobs with resource requirements, investigate the available queues in the LSF Resources documentation.
There are three common queues that are not default queues: gpu, standard_ib, and mixed_ib.
For users who have access to a partner queue, using the partner queue may shorten wait times; however, this may not be the case if the partner queue is heavily used by other group members. Not all partner queues have access to all types of hardware. For example, partner queues rarely contain GPUs.
The queues available to a user can be displayed by using
bqueues -u user_name, and the properties of a queue can be displayed by using
bqueues -l queue_name.
For help interpreting the output of bqueues, see this example.
Requesting more cores will not automatically make an application faster. The software must have been written in a way that allows the program to utilize more cores.
Please look at the video on Parallel Jobs, which explains what cores are and how many should be requested.
Serial jobs: If a program is serial, e.g., it does not know how to use multiple cores, then ask for 1 core only. Requesting more than 1 core to avoid queue limitations violates the Acceptable Use Policy (AUP).
#BSUB -n 1
Shared memory jobs: If a program is documented to be multithreaded or to be using shared memory, it may be run with as many cores as that which exist on a given node. For more information on requesting a specific core count, see LSF specification for resource by processor type. The requested number of cores must be confined to a single node by using the span resource specifier:
#BSUB -R span[hosts=1]or the ptile resource specifier:
#BSUB -n #numcores #BSUB -R span[ptile=#numcores]
Distributed memory jobs: If a program is documented to be able to run in distributed memory or to be using MPI, it may be run with many cores and distributed over several nodes. The optimal number of cores and nodes is highly dependent on not only the software but the problem size. Consult the software documentation and conduct short experiments with a small sample data set, a subset of the original data, or the entire original data set for a limited number of time steps. Too few cores may result in a wall clock limit higher than what is allowed in the queues, while a very high core count request could result in more time spent waiting in a queue.
Finally, perform a small test of your application with different numbers of cores, e.g. 2, 4, 8. If the code doesn't get faster, do not run it with more cores. Poor scaling may also indicate a buggy application.back to top
Please look at the video on Parallel Jobs, which explains the definitions for hardware (nodes and cores) or software (MPI) for parallel programming.back to top
Running with the incorrect LSF specifications can result in violating the Acceptable Use Policy, and you may be asked to terminate your jobs. Read the documentation to determine the expected behavior of an application, then confirm the behavior with a short test.
Please look at the video on Parallel Jobs, which explains 'shared memory' and gives a demo of testing code behavior.
When searching through the application's documentation, search for words such as cores, threads, parallel, multithreading. An application usually has a default value, which could be a fixed number like 1 or 8, or it could be all processes available on the node. In some cases it is set to all processes detected minus 1, usually for applications developed for a PC in consideration of the OS. The default threading behavior of these programs can often be changed by adding a command line argument or a function call, e.g.:
-t --threads CPUCOUNT= numThreads()
Some programming tools have parallel functions, including MATLAB's parpool and parfor, and also some R libraries including snow, parallel, doParallel, and foreach. Check the functions used in such scripts before running.
To confirm the threading behavior of an application, do a short interactive test using the following parameters:
bsub -Is -n 8 -R "span[hosts=1]" -x -W 10 tcsh
This will request a node with at least 8 cores. (Increase n to reserve a node with a higher minimum core count. All nodes currently have at least 8 cores.) It will ensure exclusive use of the node. Interactive debugging sessions using the exclusive option should be kept very short to avoid creating long lines in the queue. Make sure to exit the session promptly after the testing is complete.
Before running, confirm the session is on a compute node by doing
echo $HOSTNAME. It should not have login in the name.
module load mymodule ./mycode & htop
The command htop shows the cores active on the node. It also shows the amount of memory used. htop can be confusing as it is not static. To show a snapshot of processes and threads running, use top:
top -n 1 -H
Important: htop/top are to be used to confirm the code's behavior, not to determine it experimentally! The number of threads used may depend on the inputs, and multithreading may come in bursts that are not visible during the htop session. This could be from the code spawning threads as it enters a multithreaded function or subroutine. When in doubt about the threading behavior of an application, use the -x option.
When searching through the application's documentation, search for words such as multiple nodes, distributed memory, MPI. Also, if a code is running in distributed memory, it usually requires a module containing MPI (PrgEnv-intel or openmpi-gcc) and the use of mpirun:
module load openmpi-gcc mpirun mycode
To test whether a code works properly over multiple nodes, i.e., works properly in distributed memory, do a short timing test using the following parameters:
#BSUB -n 2 #BSUB -R span[ptile=1] #BSUB -x #BSUB -W 10This will reserve 1 core on 2 different nodes in LSF. If the code runs properly, the code will execute on both nodes. When the code doesn't work properly in distributed memory (this can be especially true if the user attempts to install their own version of MPI), the code may try to run on two nodes but the communication won't work, leaving the work for a single task, or the two tasks of the program will run on the first node the jobs lands on, resulting in more tasks on the node than requested through LSF. Here, the -x ensures the job doesn't interfere with someone else if it doesn't work as expected.
To make sure the code runs properly, do a timing test. For the first test, use the above ptile=1 example (guarantees that 2 nodes are requested with 1 task scheduled per node), and another with ptile=2 (guarantees the tasks are scheduled to be on the same node). For the timing test to be meaningful, the nodes must have the same (or almost the same) clock speed/memory, or else one node may simply be faster than another. For that, pick a host group or specify a particular resource.
Note the above timing test with -n 2 will be sufficient to show the code doesn't work properly if it fails, but it is not conclusive that it is correct if the expected speed-up does occur; it may simply mean the code is executing fine with both tasks on the first node.
HPC Staff have additional monitoring tools. If in doubt, contact HPC Staff to arrange for a staff monitored test. Staff generally do not have permissions to a user's application, and that is the preferred method of operation. Staff can schedule a time to monitor a job being run by the user.
Already sure the MPI code works properly? Do a timing test anyway. Do not request more cores if the code does not get better performance with more cores. The resources requested should be justified by the efficiency and performance of the code.
The LSF output file may have useful information regarding the parallel or threading behavior of an application. See the following example for more details.back to top
Can I change my code's behavior based on the number of cores I am assigned?
If a code or script can be modified to take a command line argument, the LSF variable $LSB_DJOBS_NUMPROC can be used to set the number of cores used in the program to the number of cores assigned by LSF.
back to top
High memory or multithreading.
Some serial jobs require a very large amount of memory. If the job is serial, the proper number of cores to request is -n 1; however, requesting one core on a 16 core node may result in 15 other jobs being assigned to the same node. That could lead to all jobs failing because of lack of memory. In this case, -x must be used, or the -R "rusage[mem=??]" must reserve enough memory such that LSF should not allocate any additional jobs to the node.
Some programs detect the number of cores on a processor and automatically spawn a number of threads equal to the number of cores on the node. This means that even though -n 4 was used, if the program starts on a node with 16 cores, it will use 16 cores regardless of whether four cores were requested. In this case, -x must be used, or the documentation must be examined to identify how to limit the number of threads spawned by the program.
The video on Parallel Jobs explains the various cases where exclusive should be used.back to top
#BSUB -n 8 #BSUB -R span[hosts=1] #BSUB -R select[qc]back to top
Ask for the necessary amount of time, plus a reasonable buffer.
Specifying the maximum wall clock time for the chosen queue will not only result in longer queue waits for the submitter, but for other users as well. LSF must reserve the proper number of nodes for the fully specified time. Jobs that may have run in between other scheduled jobs are forced to wait.
Do a sample run and examine the LSF output file.
Resource usage summary: CPU time : 209542.33 sec. (1) Max Memory : 15725.28 MB (2) Average Memory : 10982.08 MB Total Requested Memory : - Delta Memory : - Max Swap : 17773 MB Max Processes : 4 Max Threads : 38 Run time : 52134 sec. (3) Turnaround time : 52125 sec.1) CPU time usually should be the wall clock time elapsed times number of cores. (If the number does not seem to reflect this, it is possible that there were performance issues or other problems.)
This page is in development. To suggest a topic or request a clarification, send email.