The exponential increase in the (serial) performance of computers, which accompanied—and enabled— the advances in finite- and boundary-element methods and their application to increasingly more complex electromagnetic problems throughout the second part of twentieth century, ceased in the last decade as the conventional Dennard scaling of transistors came to an end. Subsequently, the main paradigm for increasing computer performance has become parallel computation on (homogeneous) processor architectures composed of multiple identical compute cores, e.g., a multi-core CPU or a many-core general-purpose GPU, on (heterogeneous) architectures that utilize multiple processor types, e.g., a multi-core CPU with a many-core coprocessor, and on clusters of such architectures. To continue benefiting from the increasing (parallel) performance of computers, finite- and boundary-element methods must be parallelized by using algorithms appropriate for the underlying architecture.
The proposed methodology builds on the concept of regions of acceptable parallelization introduced in (F. Wei and A. E. Yılmaz, “A more scalable and efficient parallelization of the adaptive integral method part I: algorithm,” IEEE Trans. Antennas Prop., Feb. 2014) and extended to heterogeneous clusters in (F. Wei and A. E. Yılmaz, “A systematic approach to judging parallel algorithms: acceptable parallelization regions in the N-P plane,” FEM Int. Workshop, May 2014). While that concept is appropriate for judging different algorithms for parallelizing a given (sequential) method on the same homogeneous/heterogeneous architecture, the proposed methodology is more general and can be used to also compare different methods or architectures. In this article, traditional direct and iterative method- of-moments solution of surface and volume integral equations are implemented on the Stampede cluster, whose nodes consist of two Intel Xeon 16-core processors and one 63-core Intel Xeon Phi co- processor. At the workshop, the proposed methodology will be used to evaluate the performance of the implementations with and without co-processors; implications for fast algorithms will also be proffered.
Presentation slides are available.