Serial Profiling and Timing

  • Profiling and Timing A main reason to go to parallel processing is to make codes run faster. Serial speed-ups are a first step.

    The following is a laundry list of ways to improve performance in serial.

  • Profile and time to see which parts of the code are using time,
  • Try compiler optimization, How To Use
  • To search out to see if someone has done good optimized libraries, e.g. BLAS. BLAS libraries
  • Make sure we're looping in the right direction (row or column to make sure data called into cache is reused),
  • Try blocking for data locality,
  • Make sure that I/O is efficient,
  • Know what features of Fortran 90, C and C++, etc. are likely to slow code.

    The material in this web page focuses on how to time codes so that we know how long they take. Knowing which parts of codes take time is the first step to improving performance.
    A good on-line reference for timers and other performance tools is LLNL performance_tools tutorial

  • The Unix Time Command A straightforward timer is the unix shell command. If you've isolated a section of code which can run as a stand-alone program ./fooexec, then in the bash, csh, or tcsh shell, you can simply type

    >time ./fooexec

    to get screen output detailing how long the code took. For example, in a bash shell, I got the following

    >time ./fooexec
    real   0m0.121s
    user   0m0.000s
    sys    0m0.000s
    The real time here is wall clock time. The user and system time correspond to CPU time. In this particular instance, the code did very little, so that .121 seconds was essentially just the system overhead for calling the small executable from the harddrive into execution.

    Generally, if the wall clock time is consistently much longer than the total of the sys and user time, then the time for fetching data to and from hard drives may be taking a good deal of time. In parallel codes, the code may be spending a good deal of time waiting on communication between processors.

    For this code on a second run, the code was already resident in RAM

    >time ./fooexec
    real   0m0.007s
    user   0m0.000s
    sys    0m0.010s

    This timing is unusual in that user and sys (CPU) time should total less than the real or wall clock time. Under csh or tcsh shell the time command can give a good deal of other information about the code's run-time performance, e.g., how much memory it used. (See the man page).

  • In Line Timers
  • Wrapping the gettimeofday function to get a Fortran wall clock timer.
  • Using the MPI_Wtime wall clock timer

  • Wrapping the C clock function to get a Fortran CPU timer. Of course, for purposes of comparing codes or how long it takes for a subroutine to run, we can just call a timer. C and Fortran 90 have portable CPU timers. Fortran 77 does not.

    One way to get portability is to use the standard C clock function. Then from Fortran, use a wrapper to call it. Here's the wrapper. This timer usually works, but the gettimeofday function used above is typically of somewhat higher resolution, measuring "wall clock time" as opposed to "CPU time".

    #include < time.h>
    float ftime_()
    /* printf("Here we are:\n clock=%f12.8\n",(float)clock()); */ 
    {return (float)clock()/CLOCKS_PER_SEC;}

    In some instances, you may need to put an extra underscore after the ftime_ If you wanted to call this from C, you would take away the underscore. Similarly, to link to fortran codes compiled with the xlf compiler on the p690, remove the underscore. Compile it with

    >cc -c timd.c

    and then just include timd.o in the list of object files to be linked into your executable.

    The wrapped function is called as ftime. The Fortran or C code has calls to ftime() as follows:

    Having declared pretim and entim as real*4
    pretim = ftime()
    Code segment to be timed
    entim = ftime() - pretim

    Then entim is the elapsed CPU time for the "Code segment to be timed". In C the declation would be as "float" and semicolons would be required at the ends of lines.

    The clock function is portable in that as part of the C standard, it exists everywhere C does. A disadvantage with the C clock function or Fortran90 cpu_time is that the resolution is often pretty low. Frequently the smallest nonzero time is 1.e-2 or 1.e-3 seconds. So to time a bit of code you have to get it to repeat many times. Then the data stays in cache so the code runs artificially fast. Occasionally, the compiler figures out a loop is repeated with the same data and figures out it is unnecessary. Then times can get really short.

  • Wrapping the C wall clock timer gettimeofday to get a Fortran wall clock timer

    Another common C timing function appropriate for wall clock time is gettimeofday, which returns a struct of which the second component typically has a higher resolution than clock(). The following wrapper on gettimeofday allows a Fortran code to call a subroutine wtime.

    #include < sys/time.h >
    #include < stdlib.h > 
       This timer returns wall clock time.   
       The first entry is current time in seconds.
       The second entry is microseconds. 
    void wtime_(double tim[2], int *ierr2 )
       struct timeval time1; 
       long tim1, tim2; 
       int ierr; 
       double elap; 
       ierr = gettimeofday(&time1, NULL) ; 
       *ierr2 = ierr; 
       if (ierr != 0 ) printf("bad return of gettimeofday, ierr = %d \n",ierr);  
       tim[0] = time1.tv_sec;
       tim[1] = time1.tv_usec; 
    /*   printf("tim1 = %f \n",tim[0]);
       printf("tim2 = %f \n", tim[1]);  */ 

    The following sample Fortran routine calls wtime.
          program testtimers
    c  This program tests the calling of C timers       
    c    from fortran. 
          real*8 sum, tim(2), tim2(2), elapsed 
          integer i, ierr, ierr2
          i = 1
          sum = 0.d0 
          call wtime(tim, ierr)
          print*,' tim1 =', tim(1), tim(2) , ierr
          do while (sum .lt. 19)
            sum = sum + 1./i
            i = i+1
          end do
          call wtime(tim2, ierr2)
          tim(1) = tim2(1) - tim(1) 
          tim(2) = tim2(2) - tim(2) 
    c  It's important to do the differences separately as tim(1) and tim(2)
    c  are O(10^9) and are almost equal to one another.  So converting
    c  to a microsecond number before subtracting results in a coarse resolution. 
          elapsed = 1.e6*tim(1) + tim(2) 
          print*,' elapsed time is ', 1.e6*tim(1) + tim(2) ,' microseconds' 
          elapsed = elapsed/1.e6 
          print*,' elapsed time is ', elapsed,' seconds' 

    If the C code is timw.c and the fortran code is test.f, the compile and link phases could be performed by the following command lines.
    pgf77 -c test.f
    gcc -c timw.c
    pgf77 -o test test.o timw.o 

    For the pgf77 compiler, we could have used an intrinsic timer, but that timer would not be portable for example to g77 compiled code.

    This timer (gettimeofday) typically has a higher resolution than the clock function. For example, in Linux, Solair and AIX, the resolution is microseconds. Again to compile the code with the xlf compiler on the p690, delete the _ after the wtime.

  • Using the MPI_Wtime wall clock timer

    For wall clock time I often prefer just to use the MPI_Wtime() function. If you have an MPI library available, just link to it, then between MPI_INIT() and MPI_Finalize() calls you can even though your code is really serial, use the MPI_Wtime() wall clock function. It usually returns an answer with a resolution in microseconds. The following code was used above and is modified here to instead use the MPI_Wtime call.

    program testtimers c c This program tests the calling of C timers c from fortran. c include 'mpif.h' real*8 sum, startim, entim, elapsed integer i, ierr call MPI_Init(ierr) i = 1 sum = 0.d0 startim = MPI_Wtime() do while (sum .lt. 19) sum = sum + 1./i i = i+1 end do entim = MPI_Wtime() - startim print*,' elapsed time is ', entim, ' seconds' stop end Usually, the resolution of MPI_Wtime() is in microseconds.

  • Profilers Running a code under a profiler produces a table that details how long the code spent in various subroutines. Typically, the profiler is turned on by compiling the code with the -p flag. Then when the code is executed, a log file is produced. This can have various names, by

    >ls -lrt

    will list files in reverse order of time, so the newest files will appear last. So if you've just run a profiled code you can easily find the log file. Each line of the file will correspond to a subroutine, and will tell you how often the subroutine was called and the total time spent in that subroutine. Generally, if we want to speed a code, knowing where the code spends most of its time shows us where to concentrate. Or if a subroutine is called "oodles" of times maybe it should be inlined.

  • Example with the PGI Fortran compiler and profiler

    Here's an example run. It was compiled with the pgf77 compiler using the -p flag. What do we see? After the executable ran, the file "pgprof.out" was produced. It follows.

    PROF NODALL 0 a.out 1093292124 0
    h blade1-13 23023 0 1
    t 1 7
    p 0
    f zge062704_1.f
    r zgebrd110 1 238 1 49.3378 23.3085
    r zgebd3 1 702 1 7.52057e-05 6.07718e-05
    r zlabr2 1 1342 199 26.0292 1.0356
    r zgeupm 1 2064 1393 22.0503 0.918796
    r zgemver 1 2342 207 2.94329 2.94329
    r zgemvt2 1 3113 1393 21.1315 21.1315
    f zrivbrd2.f
    r zrivbrd 1 1 1 109.268 59.9301
    The driver is zrivbrd (contained in the fortran file zrivbrd2.f). According to the logfile, it takes 59 seconds. This is wallclock time, i.e., elapsed time for the entire code exectution. Wallclock time may include time for I/O and also time that the CPU has spent running other user's code. Thus it is important to profile where possible on computational nodes. There we are guarantted that no other user is using the processor. In contrast, on the head node, or in using the p690, there may be a good deal of contention for user resources, resulting in inconsistent runs.

    In this instance (from looking at the code), the driver calls some library routines from LAPACK, which has been linked to but not compiled with the -p flag. Since the LAPACK codes were not compiled with the profiler, the times spent in the library routines are attributed to the driver.

    Another time consuming routine is zgebrd110, which required 23.3 seconds most of which was actually spent in running the BLAS routine dgemm (matrix matrix multiply). The next longest time 21.1 secs, which calls the BLAS matrix vector multiply dgemv. zgemvt2 was called 1393 times, (called only from zgeupm also called 1393 times). The routine zgemver was called 2342 times and required 2.94 seconds (it calls BLAS dgemv and also dger which are rank one updates). For a prettier display of the results, I typed
    >pgprof pgprof.out
    which gave a GUI which gave some better explanations. For instance the 238 is the line of the file zge062704_1.f on which the routine zgebrd110 starts. It also give another informative column of times, which is how long a code spends in a routine and its subroutines. For example zrivbrd and its subroutines required 198.268 seconds.

    These results (blaming all the time on the BLAS calls) seems to indicate that we should make sure we have a good BLAS library. (The results of that experiment will appear below. For a tutorial on the PGPROF profiler, see PGPROF

    The pgprof profiler also has parallel capabilities.

  • gprof, Lint, ftnchek, etc. I happen to know the code we're examining pretty well, as I wrote it. But for someone coming in cold, the PGI log file would be a bit confusing. It doesn't tell us for example, which code calls which. Generally, when looking to modify a code you find yourself constructing a call table.

    One way of course, is by "brute force". Brute force is plausible with some Unix commands. The one indispensable "track it down" command is grep. So for example to search all the .f files in a directory for the ones that contain the characters ZLABR2(,

    >grep ZLABR2( *.f
    Or to search a bunch of .a archive files to find which one has the>GNU elusive function "foobar"

    >nm -A *.a | grep foobar

    where the | pipes the output of "nm -r *.a" which is a massive amount of symbols to grep. Grep throws away all the lines of the output except for those that contain "foobar". The -r flag made sure that the name of the .a file is included on each line of the nm generated symbol table.

    Then by tracking down all the subroutines, you'll eventually construct a tree showing which ones call which. Or you could have just used the -pg compile option. Most of the traditional computer vendors such as IBM, Digital, HP, have their own utitilies which will construct a call table for you. For the open source environments (gcc, g77, etc.), provided you have compiled with the -pg flag. There is also well-known public domain program GNU gprof (by Jay Fenlason), which works with C, Fortran, and Pascal. I hope to give you a longer demo, but "info gprof" would get you going. See also the web page GNU gprof or Class notes from Rice University.

    Call tables are also constructed by programs such as lint for C and ftnchek for Fortran 77. These are public domain, but some supported licensed programs can be purchased. Lint and ftnchek also give a good deal of info about possible programming problems such as mis-typed argument lists, non-portable language constructs, etc.

    One problem with Fortran 90 is that the standard open source tools such as lint, ftnchek and gprof don't yet work with it. So not only do you have to buy a Fortran 90 compiler, but then you have to purchase these as development tools. Fortunately, we have the pgi flavors here. Also on the p690, we have the IBM tools.

  • Speed-ups by BLAS, Compiler Optimization, Intel Compiler Initially, we had an LAPACK time (according to clock) of 52 seconds to bidiagonalize a 1600 by 1600 matrix, corresponding to about 206 Mflops. Where the trial routine requires only 48 seconds. 206 Mflops is not as horrible as it sounds as these are double precsion complex flops, where an add corresponds to two double precsion adds, and a multiply to 4 double precision multiplies. Assuming an equal number of adds and mults, there are 3 double precsion flops to a complex flop, so we have a speed of 620 Mflops. Still the peak speed of the processor is around 6 Gflops.

    Getting rid of the profiling and using the -O4 compiler option made little difference in the times.

    Swapping the PGI supplied BLAS for the ATLAS BLAS dropped the LAPACK time down to 25 seconds, i.e., 430 Mflops of complex, equivalently 1320 Mflops for double. Theory: using the Goto BLAS with Intel compiler and flags will push the rate above 500 Mflops. For directions on linking to ATLAS, Intel, and Goto BLAS libraries on the Blade Center, see BLAS libraries

    But the alternative bidiagonalization version was reduced in time only from 48 seconds to 34 seconds. A first problem was that the block size had been set to 3 for purposes of debugging. Returning the block size to the usual 16, the time went to about one half second more than the current LAPACK routine. Looking carefully at the profiler data, it turned out that the average zgemver calls took longer than the average call to zgemvt. I then realized that most of the zgemver calls (one per call to zlabr2) were actually zgemvt calls with a "zero" update. Fixing these calls so that they do not call a matrix update routine avoided one write of the matrix per call to zlabr2. And reduced the time of execution to .7 seconds less than the current LAPACK call. Some remaining functions that take more than one second are the call to zlabr2 and the call to zgeupm. remaining investigation is whether the conjugation (not expressed as a BLAS call) could be responsible. Perhaps some of the vector operations for which conjugation is done an element at a time need some optimization?

  • Hardware Counters--Papi

    In quantum mechanics or anthropology, observations influence the observed. So using profilers or timers entails using "interrupts" distorting program execution. A less intrusive approach to observing the code in the wild is to use a hard ware counter. It samples: Every so many steps -- say 1000 clock cycles -- it polls, asking what command is being issued on this clock cycle? Then after program execution is complete, you can look and see what proportion of the time was spent doing what.

    These timers are typically developed as part of the chip design process, and may or may not be available to the public. For example, the Digital Alpha chip had a very nice counter, but alas the sys admin never wanted to leave it on. This is because it would have some "drag" effect on the system.

    Dongarra, et. al, have proposed a portable counter PAPI. It runs on most processors, and is public domain. Performance Application Programming Interface

    Why? It's interesting to compare and see what bits of code can sustain the most flops per instruction.

  • Looking at the Assembler Code Most compilers have an option to let you look at the assembler code. Look in the compiler man page. Mostly commonly, it's a -S (as opposed to -c) option. If you've determined that some bit of code could be a problem area then you can look at the assembler code and try various coding options to try to decrease the number of load and store instructions, etc. Personally, I don't actually write the assembler, just rewrite the Fortran or C. It's also interesting to see what the compiler optimizer has done to your code.

  • Conclusions

    We've seen how to time and profile. In an example, we got a big speedup (factor of two) by changing to a different BLAS library. More generally, if we found a part of the code that took a significant amount of time and did not have a corresponding optimized library call, we could try some of the techniques from the next lecture to optimize it ourselves. For example, 10% or so of the time in the example profiled code was not in BLAS routines, so I may have to try to optimize that code by hand. Last modified: January 05 2007 17:12:58.