EPD with MKL support

The upcoming EPD 6.1, released on March 2nd, 2010, links NumPy and SciPy against MKL (Intel Math Kernel Library). This means that on all platforms where the MKL is available, namely Windows, OSX and Linux, we link NumPy and SciPy dynamically against MKL.

Performance statistics

The following results were obtained by running a benchmark program, All results show the execution time (in seconds) of the function, using an N x N random matrix as input, selecting the fasted of three runs. We have also included timings for the numpy included in older EPD version, which was linked against the ATLAS library on Windows and Linux, and to the Accelerate framework on OSX. All tables contain execution times (in seconds) for different functions in numpy.linalg.

Windows 7, 2.4 GHz Intel Core 2 Duo, 3GB Memory, 32-bit EPD

The ATLAS benchmark results were obtained by running the benchmark program on an EPD 5.1.1 install.

func  threads    500        1000        1500        2000
========================================================
det
  ATLAS        0.046       0.327       1.061       2.464
  MKL  1       0.015       0.125       0.390       0.842
       2       0.011       0.078       0.234       0.546
eig
  ATLAS        3.105      24.772      83.522     195.780
  MKL  1       0.764       4.960      15.756      35.708
       2       0.592       4.056      12.979      28.813
eigh
  ATLAS        0.530       3.588      11.716      26.972
  MKL  1       0.172       1.076       3.510       8.096
       2       0.109       0.671       2.246       5.350
eigvals
  ATLAS        1.185       9.484      31.574      74.240
  MKL  1       0.436       2.543       7.426      15.616
       2       0.405       2.184       6.068      12.714
eigvalsh
  ATLAS        0.437       3.510      12.464      29.983
  MKL  1       0.062       0.405       1.653       3.869
       2       0.032       0.265       1.092       3.026
inv
  ATLAS        0.187       1.341       4.306       9.937
  MKL  1       0.062       0.452       1.435       3.230
       2       0.031       0.280       0.842       1.887
svd
  ATLAS        1.045       7.442      24.024      55.427
  MKL  1       0.374       2.683       9.220      21.263
       2       0.265       1.779       6.428      15.740
  

MacOS 10.5.6, 2.33 GHz Intel Core 2 Duo, 3GB Memory, 32-bit EPD

The Accelerate framework always uses the maximum number of threads, i.e. 2 in the table below.

func   threads     200         500        1000        1500        2000
======================================================================
det
  Accelerate  0.003881    0.031205    0.159673    0.402085    0.825475
  MKL  1      0.002485    0.026606    0.156718    0.460423    0.996469
       2      0.002082    0.020490    0.111589    0.322130    0.685631
eig
  Accelerate  0.151646    2.034871   16.147479   54.945526  129.610227
  MKL  1      0.100915    0.791869    5.536657   17.653579   40.505470
       2      0.089906    0.684320    4.738974   15.291821   34.538202
eigh
  Accelerate  0.036075    0.278806    1.579896    4.928605   11.026954
  MKL  1      0.021456    0.193353    1.160592    3.860195    9.211294
       2      0.015369    0.133706    0.801191    2.635106    6.269004
eigvals
  Accelerate  0.066050    0.755720    6.156935   20.873578   51.116664
  MKL  1      0.062604    0.441933    2.737576    7.781239   17.362116
       2      0.058361    0.403419    2.445986    6.893551   15.415134
eigvalsh
  Accelerate  0.015713    0.170914    1.295294    5.317904   13.737133
  MKL  1      0.007585    0.071506    0.450339    1.801380    4.544002
       2      0.006159    0.051889    0.315961    1.321673    3.666880
inv
  Accelerate  0.011015    0.087874    0.457264    1.256251    2.642413
  MKL  1      0.008459    0.090927    0.573598    1.673382    3.686808
       2      0.006394    0.067684    0.369199    1.072284    2.349210
svd
  Accelerate  0.063233    0.481699    3.072801    9.891473   22.554218
  MKL  1      0.042538    0.409626    3.039787   10.266544   24.344418
       2      0.036743    0.304195    2.121625    7.680251   18.703893
  

Linux, 2.40 GHz Intel Core 2 Quad CPU, 8GB Memory, 64-bit EPD

The ATLAS routines always use one thread only.

func   threads     200         500        1000        1500        2000
======================================================================
det
  ATLAS       0.001978    0.019611    0.119601    0.371773    0.805141
  MKL  1      0.001423    0.015214    0.100316    0.319583    0.700174
       2      0.001041    0.010443    0.058342    0.181791    0.395840
       3      0.001087    0.008494    0.044751    0.137921    0.302166
       4      0.001120    0.007905    0.041605    0.120641    0.255480
eig
  ATLAS       0.146478    2.427373   21.558434   73.906174  186.378927
  MKL  1      0.080537    0.632089    4.160941   13.015139   29.167038
       2      0.087990    0.583108    3.597658   10.953161   24.393856
       3      0.088137    0.542243    3.596853   11.027849   24.375258
       4      0.087050    0.538182    3.282770   10.007089   22.272444
eigh
  ATLAS       0.028070    0.214199    1.306518    4.253866    9.625678
  MKL  1      0.017454    0.150351    0.904984    2.973798    6.900463
       2      0.012303    0.099450    0.536283    1.604131    3.838004
       3      0.011904    0.088765    0.464113    1.436107    3.457642
       4      0.011014    0.078823    0.395830    1.198582    2.961812
eigvals
  ATLAS       0.084425    0.969582    9.271397   35.484099   86.982823
  MKL  1      0.055514    0.382756    2.278501    6.140537   13.106408
       2      0.057068    0.380757    2.003261    5.291290   11.005101
       3      0.056770    0.373002    2.067459    5.336213   11.115501
       4      0.060181    0.365830    1.923270    5.063103   10.328758
eigvalsh
  ATLAS       0.009983    0.105646    0.785348    3.212356    8.172313
  MKL  1      0.006320    0.052221    0.330132    1.301236    3.204751
       2      0.005251    0.040802    0.214501    0.669560    1.795785
       3      0.004931    0.036338    0.177316    0.600432    1.735426
       4      0.005339    0.034990    0.164250    0.522566    1.568605
inv
  ATLAS       0.006228    0.066323    0.423782    1.392812    3.053785
  MKL  1      0.003913    0.050813    0.355910    1.186602    2.715534
       2      0.002944    0.031845    0.198924    0.663284    1.486224
       3      0.002529    0.024905    0.153636    0.513211    1.114543
       4      0.002375    0.021784    0.131506    0.437813    0.938628
svd
  ATLAS       0.047632    0.383549    2.719373    8.960831   20.414181
  MKL  1      0.029945    0.283172    2.192700    7.574861   17.597544
       2      0.028022    0.220260    1.305435    4.674676   11.568503
       3      0.027393    0.207454    1.244003    4.425071   11.162057
       4      0.027328    0.199932    1.073903    3.985288   10.409413
  

Linux, 2 1.8 GHz Dual-Core AMD Opteron, 4GB Memory, 64-bit EPD

Note that these results were obtained on an AMD processor. The ATLAS routines always use one thread only.

func  threads      200         500        1000        1500        2000
======================================================================
det
  ATLAS       0.004542    0.052334    0.344503    1.095520    2.448515
  MKL  1      0.003537    0.042843    0.291940    0.933989    2.084115
       2      0.002697    0.026416    0.166778    0.531760    1.181758
       3      0.002251    0.019843    0.123858    0.386968    0.842950
       4      0.002169    0.019846    0.125505    0.386224    0.840673
eig
  ATLAS       0.259763    4.393425   37.621182  135.050718  323.719892
  MKL  1      0.170918    1.779667   11.314629   33.659668   75.974679
       2      0.167206    1.473480    9.096502   26.676907   60.188044
       3      0.158445    1.386303    8.309512   24.419109   54.721468
       4      0.168030    1.354281    8.414269   24.262779   54.756768
eigh
  ATLAS       0.055759    0.478036    3.157796    9.529790   21.862535
  MKL  1      0.031404    0.339818    2.456596    7.765839   18.292306
       2      0.020818    0.204601    1.437134    4.721491   10.401662
       3      0.018524    0.165718    1.108386    3.743436    8.138211
       4      0.018394    0.163665    1.093884    3.839852    8.619979
eigvals
  ATLAS       0.122135    1.721329   15.441173   56.874484  139.097476
  MKL  1      0.100709    0.905109    5.638054   15.726575   34.627801
       2      0.093802    0.836252    4.697184   12.484315   26.458316
       3      0.094680    0.792386    4.620656   11.530036   24.523773
       4      0.094250    0.812030    4.414533   11.633799   24.092832
eigvalsh
  ATLAS       0.016672    0.248108    1.844125    6.192654   14.587824
  MKL  1      0.010020    0.120006    1.097865    3.446052    8.606070
       2      0.007688    0.078474    0.658583    2.407154    5.103678
       3      0.007492    0.065678    0.496656    2.129318    4.908329
       4      0.007383    0.066361    0.491305    1.814141    4.905298
inv
  ATLAS       0.016100    0.180923    1.267138    4.280442    9.528451
  MKL  1      0.012346    0.158630    1.106797    3.631172    8.235542
       2      0.007875    0.090208    0.599726    2.020295    4.473653
       3      0.006477    0.067686    0.448351    1.451953    3.236595
       4      0.006439    0.068643    0.452317    1.441642    3.255666
svd
  ATLAS       0.103712    0.930641    6.289478   19.882772   46.678009
  MKL  1      0.060373    0.779341    6.264379   19.597140   46.323370
       2      0.049049    0.467218    3.705661   11.203994   26.130302
       3      0.045428    0.371707    2.980605    8.911639   21.034134
       4      0.045731    0.377631    2.938632    9.917269   20.900861
  

Optimization example: eig function

Below we show the speed-up over ATLAS (Linux, Windows) and Accelerate (OSX) offered by MKL. All data pertains to benchmarking data from the eig function.

MKL on Linux MKL on Windows MKL on OSX

MKL service interface

Along with the MKL runtime libraries, the MKL package in EPD also contains a small module which exposes functions. The main reason we have added this interface module is because it allows setting the number of computational threads in MKL. This is done as follows:

>>> import mkl
>>> mkl.get_max_threads()
2
>>> mkl.set_num_threads(1)
>>> mkl.get_max_threads()
1
  

The mkl interface module, currently contains the following functions: get_cpu_clocks, get_cpu_frequency, get_max_threads, get_version_string, set_num_threads, thread_free_buffers. These functions call the corresponding MKL service functions, which are declared in mkl_service.h, e.g. the function mkl.get_version_string calls mkl_get_version_string. For more details, see the function docstrings, as well as the MKL documentation.