Key points are not available for this paper at this time.
In the paper we give a straightforward, highly efficient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. © 1997 by John Wiley & Sons, Ltd.
Geijn et al. (Tue,) studied this question.