Key points are not available for this paper at this time.
The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines. Copyright © 2004 John Wiley & Sons, Ltd.
Whaley et al. (Wed,) studied this question.