Sunday, December 23, 2012

Data Access optimization

Data Access optimization

Typical latency and bandwidth numbers for data transfer to and from different
devices in computer systems.
This sketch shows an overview of several data paths present in modern parallel
computer systems, and typical ranges for their bandwidths and latencies. The functional
units, which actually perform the computational work, sit at the top of this
hierarchy. In terms of bandwidth, the slowest data paths are three to four orders of
magnitude away, and eight in terms of latency. The deeper a data transfer must reach
down through the different levels in order to obtain required operands for some calculation,
the harder the impact on performance. Any optimization attempt should
therefore first aim at reducing traffic over slow data paths, or, should this turn out to
be infeasible, at least make data transfer as efficient as possible.

Optimization tips

Access memory in increasing addresses order.

 In particular:

  • scan arrays in increasing order;
  • scan multidimensional arrays using the rightmost index for innermost loops;
  • in class constructors and in assignment operators (operator=), access member variables in the order of declaration.
Data caches optimize memory access in increasing sequential order.
When a multidimensional array is scanned, the innermost loop should iterate on the last index, the innermost-but-one loop should iterate on the last-but-one index, and so on. In such a way, it is guaranteed that array cells are processed in the same order in which they are arranged in memory

Memory alignment

Keep the compiler default memory alignment.
Compilers use by default an alignment criterion for fundamental types, for which objects may have only memory addresses that are a multiple of particular factors. Such criterion guarantees top performance, but it may add paddings (or holes) between successive objects.
If it is necessary to avoid such paddings for some structures, use the pragma directive only around such structure definitions.

Grouping functions in compilation units

Define in the same compilation unit all the member functions of a class, all the friend functions of such class, and all the member functions of friend classes of such class, except when the resulting file become unwieldy for its size.
In such a way, both the machine code resulting by the compilation of such functions and the static data defined in such classes and functions will have addresses near each other; in addition, even compilers that do not perform whole program optimization may optimize the calls among these functions.

Grouping variables in compilation units

Define every global variable in the compilation unit in which it is used more often.
In such a way, such variables will have addresses near to each other and to the static variables defined in such compilation units; in addition, even compilers that do not perform whole program optimization may optimize the access to such variables from the functions that use them more often.

Private functions and variables in compilation units

Declare in an anonymous namespace the variables and functions that are global to compilation unit, but not used by other compilation units.
In C language and also in C++, such variables and functions may be declared static. Though, in modern C++, the use of static global variables and functions is not recommended, and should be replaced by variables and functions declared in an anonymous namespace.
In both cases, the compiler is notified that such identifiers will never be used by other compilation units. This allows the compilers that do not perform whole program optimization to optimize the usage of such variables and functions




Source
Introduction to High Performance Computing for Scientists and Engineers,
Georg Hager, Gerhard Wellein

No comments:

Post a Comment