Find out properties of queues, compute nodes, GPUs.

  • Why is my job still pending?
  • How does LSF determine job priority?
  • How many jobs are running in a particular queue?
  • Which resources have a GPU, and are there any GPUs available right now?
  • Someone is using the P100. How long until her job is finished?
  • How many nodes are there that have ... [M2070 GPUs, nodes with AVX2 instructions, dual quad core nodes, etc.]
  • Using bjobs, I find that EXEC_HOST is bc2e4. What does that mean?
  • For scaling tests, I need to use the same piece of hardware. How do I specify this?
  • Why do I get drastically different run times for the same run script?
  • What kind of hardware does my advisor's partner queue have?
  • Which queues do I have access to?
  • What are the wall clock limits for the debug queue? The single_chassis queue?
  • How can I find out the maximum RAM I can ask for each queue?

    Why is my job still pending?

    Find more info about jobs by using bjobs -l. This will include a list of reasons the job is pending, and it may include an estimate of when the job will start, e.g., Job will start no sooner than indicated time stamp. It is possible that the chosen resources are currently being used, but it is also possible that the specified resources do not exist on the system. For example, LSF will not give an error message upon requesting a 64 core node with 500 GB of memory; it will simply wait until such a node is installed, leaving the job in a forever pending state.

    back to top

    How does LSF determine job priority?

    Job priority is determined by several factors including fair share priority, queue priority, and time of submission.

  • See further details on job priority.
  • back to top

    How many jobs are running in a particular queue?

    Search for jobs being run by all users and filter for those in that particular queue. For example, to check how many jobs are running in the gpu queue, use
    bjobs -u all | grep gpu

    back to top

    Which resources have a GPU, and are there any GPUs available right now?

    You can find which hosts have a GPU by using
    lshosts | grep gpu

    bqueues will show the number of total jobs in the queue (NJOBS), how many are actually running (RUN), and how many are pending (PEND). MAX is the maximum number of cores available. For some queues, like gpu, the MAX is not shown.
    bqueues -l gpu

    back to top

    Someone is using the P100. How long until her job is finished?

    There are multiple steps to find this.

  • Find the hostname of the P100. (n3h39)
    lshosts | grep p100
  • Find if someone is running on the P100.
    bjobs -u all | grep n3h39
  • The job running has JOBID=29400. To get more information on that job, use
    bjobs -l 29400
    RUNLIMIT is the maximum wall clock allowed for this job in minutes. The CPU time used so far in seconds is another output from this bjobs command. It is shown after Resource usage collected.
    back to top

    How many nodes are there that have ... [M2070 GPUs, nodes with AVX2 instructions, dual quad core nodes, etc.]

    The resources for each model is given by lshosts. There is currently one P100 node:

    [unityID@login04 ~]$ lshosts | grep p100
    n3h39       LINUXRH E52650v4   1.0    24 262050M 32767M    Yes (gpu twc sse sse2 ssse3 sse4_1 sse4_2 avx avx2 p100)
    
    Here are the some commands to find other resources:
    [unityID@login04 ~]$ lshosts | grep m2070
    [unityID@login04 ~]$ lshosts | grep avx2
    [unityID@login04 ~]$ lshosts | grep qc
    

    See LSF Resources for more information on specific resources.

    back to top

    Using bjobs, I find that EXEC_HOST is bc2e4. What does that mean?

    EXEC_HOST is the host group the job is running on. To find more about the individual hosts available in that group, use bmgroup

    [unityID@login04 ~]$ bmgroup bc2e4
    GROUP_NAME    HOSTS                     
    bc2e4        n2e4-1 n2e4-2 n2e4-3 n2e4-4 n2e4-5 n2e4-6 n2e4-7 n2e4-8 n2e4-9 n2e4-10 n2e4-11 n2e4-12 n2e4-13 n2e4-14 
    
    To find out more about the specific hosts, e.g., n2e4-5, use
    [unityID@login04 ~]$ lshosts | grep n2e4-3
    n2e4-3      LINUXRH    E5405   1.0     8 16383M 32767M    Yes (qc sse sse2 ssse3 sse4_1)
    
    This shows that node n2e4-3 has processor model E5405, 8 cores (2 quad core processors), 16 G memory, does not support AVX instructions and does not have InfiniBand(ib).
    back to top

    For scaling tests, I need to use the same piece of hardware. How do I specify this?

    If a node from the same group is needed, e.g., same blade or same rack on single_chassis, use the -m option.
    #BSUB -m "bmgroup"
    Example:

    #BSUB -m "blade2a1" 
    

    If the exact same piece of hardware is needed, meaning the same actual node(host), use the -m option.
    #BSUB -m "hostname"
    Example:

    #BSUB -m "n2e4-3"
    
    Note that this may give very long queue wait times. Additionally, it should be verified that the node is contained in the resource pool or queue that it is being submitted to.

    See LSF Resources for more information on specific resources.

    back to top

    Why do I get drastically different run times for the same run script?

    If the resource type is not specified, the queuing system will assign a job wherever it might fit. This will result in the job being executed on different types of hardware - new or old, more or less cores, etc. For consistent run times, specify a particular resource.

    See LSF Resources for more information on specific resources.

    back to top

    What kind of hardware does my advisor's partner queue have?

    Suppose a partner queue is monkey. Do bqueues -l monkey:

    [unityID@login04 ~]$ bqueues -l monkey
    QUEUE: monkey 
    -- partner queue
    
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV PJOBS 
    100   10  Open:Active     264    -    -    -     0     0     0     0     0    0     0
    
    HOSTS:  monkey_ib+10 interconnect_ib+8 blade2h2+4 
    
    This shows that there are 264 cores available on the partner queue monkey. The queue has access to the monkey_ib group and also the interconnect_ib group. To find the hardware,
    [unityID@login04 ~]$ bmgroup monkey_ib 
    GROUP_NAME    HOSTS                     
    monkey_ib  n2g3-2 n2g3-3 n2g3-4 n2g3-5 n2g3-6 n2g3-7 n2g3-8 n2g3-9 n2g3-10 n2g3-11 n2g3-1 
    
    To get more specific hardware info,
    [unityID@login04 ~]$ lshosts n2g3-2
    HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
    n2g3-2      LINUXRH E52650v4   1.0    24 130237M 32767M    Yes (twc sse sse2 ssse3 sse4_1 sse4_2 avx avx2 ib)
    
    This shows that the monkey_ib group consists of eleven 24 core nodes, supporting up to AVX2 instruction, and has InfiniBand.

    back to top

    Which queues do I have access to?

    If the queue is not specified in the job script, LSF will attempt to choose the most appropriate queue. To find the queues that a user has access to, use bqueues -u followed by the login name (Unity ID).
    bqueues -u unityID

    back to top

    What are the wall clock limits for the debug queue? The single_chassis queue?

    Use bqueues -l:

    [unityID@login04 ~]$ bqueues -l debug
    MAXIMUM LIMITS:
    RUNLIMIT                
    100.0 min of servlsf
    
    [unityID@login04 ~]$ bqueues -l single_chassis
    MAXIMUM LIMITS:
    RUNLIMIT                
    5760.0 min of servlsf
    
    At the date of this publication, the limit for the debug queue was 1hr and 40min, and the limit for the single_chassis queue was 4 days. Queue limits are subject to change without notice.
    back to top

    How can I find out the maximum RAM I can ask for each queue?

    LSF will report an error if I ask for more processors or wall time than allocated for a queue, but if I specify too much memory, the jobs is submitted but never runs. i
    Here is an example of how to explore this:

    [unityID@login03 ~]$ bqueues -l single_chassis
    HOSTS:  blade2a1+10 blade2b1+10 blade2b2+10 blade2b3+10 blade2c1+9 blade2c2+9 blade2c3+9 blade2d1+8
    blade2d2+8 blade2e1+8 blade2e2+8 blade2e4+8 blade2e5+8 blade2g1+7 blade2g2+7 blade2f1+5 blade2j1+5 blade3m3+5 
    
    [unityID@login03 ~]$ bmgroup blade2a1
    GROUP_NAME    HOSTS                     
    blade2a1     n2a1-1 n2a1-2 n2a1-3 n2a1-4 n2a1-5 n2a1-6 n2a1-7 n2a1-10 n2a1-8 n2a1-11 n2a1-9 n2a1-12 n2a1-13 n2a1-14 
    
    [unityID@login03 ~]$ lshosts n2a1-1
    HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
    n2a1-1      LINUXRH    E5645   1.0    12 49140M 32767M    Yes (hc sse sse2 ssse3 sse4_1 sse4_2)
    

    back to top
  • Copyright © 2020 · Office of Information Technology · NC State University · Raleigh, NC 27695 · Accessibility · Privacy · University Policies