next up previous contents
Next: More Operations and Information Up: Left-Looking Variant Previous: Level-3 BLAS implementation

Towards further performance improvements

There may be considerable inefficiency associated with achieving all parallelism through calls to the PLAPACK global level-3 BLAS. Sources of inefficiency include:

We now show how the left-looking level-3 BLAS based algorithm given above can be implemented at a lower level, essentially in-lining the call to PLA_Syrk in the implementation given in Figure 8.4.

Consider the code in Figure 8.5. At the top of the loop, we changed the partitioning of the matrix so that tex2html_wrap_inline16339 is guaranteed to exist on one node. We do so by determining the size to be the minimum of the algorithmic blocking size, the split size at the top and left of the current matrix tex2html_wrap_inline16341 . Next, the update of

displaymath16337

is explicitly exposed by the following additions to the code:

Before entering the loop we create objects to hold duplicated versions of tex2html_wrap_inline16353 and local contributions to tex2html_wrap_inline16355 . By taking views into these objects, we can reuse these objects during the execution of the loop. When comparing the two left-looking level-3 BLAS based implementations, we notice that the second implementation is clearly more complex, but that this complexity is still manageable.

It will not always be the case that this more complex implementation will outperform the first. Notice that the implementation that uses only the global BLAS does not limit the algorithmic blocking size by the distribution blocking size. Thus, it may be possible to achieve better load balance by taking the distribution blocking size to be much smaller than the algorithmic blocking size. But then, we could further rewrite chol_left_blas3_alt to decouple the algorithmic and distribution blockings sizes explicitly within the code.


next up previous contents
Next: More Operations and Information Up: Left-Looking Variant Previous: Level-3 BLAS implementation

rvdg@cs.utexas.edu