Next: Performance results Up: Example: Parallelizing Matrix-Matrix Multiplication Previous: Forming

## A more general approach

Let us further consider the matrix-matrix multiply. We claim for each of the four cases given in Equations -, there are actually eight different cases to be considered. The different cases correspond to the cases where each of the matrix dimensions m , n , and k are either large or small. Since there are three dimensions and two possibilities for each dimension, there are cases. These different cases are tabulated in Tables  and . In these tables, the diagrams indicating matrix shapes illustrate the layout of data in its non-transposed format. Transposition is indicated by the superscript `` ''. We will discuss general approaches to each of the different cases in the rest of this section.

Let us reconsider the formation of . Notice that if we partition a given matrix X by columns and by rows:

where , then C can be computed by any of the following formulae:

which suggests besides the panel-panel variant for forming C , there are also matrix-panel and panel-matrix variants. Similarly, there are panel-panel, matrix-panel, and panel-matrix variants for each of the three other matrix-matrix multiplies in Equations -.

### m large, n large, k large:

The choice of variants presented in previous subsections are optimal for this case. The primary issue is that good load balance can be obtained, since all nodes have a reasonable portion of the individual panel-panel, matrix-panel or panel-matrix operation to perform. A secondary issue is that except for the case , all communication can be performed using only broadcasts and reduction-to-one within one or the other dimension of the mesh of nodes. Thus, no packing needs to be performed in the copy and/or reduce. In other words, the individual panels need not be transposed as described in Section 1.4.2.

### m large, n large, k small:

Notice that in the tables, it is indicated that this defaults to a rank-1 update when k = 1 . From this, we deduce that if k is small, a panel-panel variant is the appropriate choice, since in the limit it becomes a rank-1 update. Notice that for the cases where A is transposed (e.g. and ), this means that the contributions from A present themselves as row panels, which must be spread within rows, requiring a transpose of each panel, as described in Section 1.4.2. Similarly, for the cases where B is transposed (e.g. and ), the contributions from B present themselves as column panels, which must be spread within columns, also requiring a transpose of each panel, as described in Section 1.4.2.

### m large, n small, k large:

Notice that in the tables, it is indicated that this defaults to a matrix-vector multiply when n = 1 . From this, we deduce that if n is small, a matrix-panel variant is the appropriate choice, since it defaults to a matrix-vector multiply when n = 1 . For similar reasons as for the last case, depending of whether A and B are to be transposed, the views of matrices C and B change, and the reduction and spreading may require a transpose operation for each panel of C and/or B .

### m large, n small, k small:

In the tables, it is indicated that this defaults to an axpy-like operation when n = 1 and k = 1 . From this, we deduce that if n and k are small, this operation will be at least as efficient as the appropriate case of the axpy operation. One can envision an efficient implementation that redistributes panels of rows and/or columns as multivectors and performs axpy-like operations using all nodes.

### m small, n large, k large:

This case is similar to the one where m is large, and n is small. This time, a panel-matrix variant appears appropriate.

### m small, n large, k small:

This case is similar to the one where m is large, and n is small, suggesting an axpy-like approach.

### m small, n small, k large:

This defaults to an inner product-like operation when m = 1 and n = 1 . One approach could be to redistribute panels of rows and/or columns as multivectors, and performing inner product-like operations using all nodes.

### m small, n small, k small:

This operation becomes the totally degenerate case of a simple scalar multiplication when m = n = k = 1 , n = 1 . From this, we deduce that if m , n , and k are small, one should consider performing it on a single processor, i.e., redistributing the matrices as multiscalars.

Table: Different possible cases for matrix-matrix multiply, Part I.

Table: Different possible cases for matrix-matrix multiply, Part II.

Next: Performance results Up: Example: Parallelizing Matrix-Matrix Multiplication Previous: Forming

rvdg@cs.utexas.edu