Loop Constructs

Loops can be formed with the usual DO-END DO and DO WHILE, or by using an IF/GOTO and a label. Loops must have a single entry and a single exit to be vectorized. The following examples illustrate loop constructs that can and cannot be vectorized.

Example: Vectorizable structure
`subroutine vec(a, b, c) dimension a(100), b(100), c(100) integer i i = 1 do while (i .le. 100) a(i) = b(i) * c(i) if (a(i) .lt. 0.0) a(i) = 0.0 i = i + 1 enddo end subroutine vec`

The following example shows a loop that cannot be vectorized because of the inherent potential for an early exit from the loop.

Example: Non-vectorizable structure
`subroutine no_vec(a, b, c) dimension a(100), b(100), c(100) integer i i = 1 do while (i .le. 100) a(i) = b(i) * c(i) ! The next statement allows early ! exit from the loop and prevents ! vectorization of the loop. if (a(i) .lt. 0.0) go to 10 i = i + 1 enddo 10 continue end subroutine no_vecN END`

Example: Non-vectorizable structure

subroutine no_vec(a, b, c)
  dimension a(100), b(100), c(100)
  integer i
  i = 1
  do while (i .le. 100)
    a(i) = b(i) * c(i) 
! The next statement allows early 
!
exit from the loop and prevents 
! vectorization of the loop.
    if (a(i) .lt. 0.0) go to 10
    i = i + 1
  enddo
  10 continue 
end subroutine no_vecN 
END

Loop Exit Conditions

Loop exit conditions determine the number of iterations a loop executes. For example, fixed indexes for loops determine the iterations. The loop iterations must be countable; in other words, the number of iterations must be expressed as one of the following:

A constant.
A loop invariant term.
A linear function of outermost loop indices.

In the case where a loops exit depends on computation, the loops are not countable. The examples below show loop constructs that are countable and non-countable.

Example: Countable Loop
`subroutine cnt1 (a, b, c, n, lb) dimension a(n), b(n), c(n) integer n, lb, i, count ! Number of iterations is "n - lb + 1" count = n do while (count .ge. lb) a(i) = b(i) * c(i) count = count - 1 i = i + 1 enddo ! lb is not defined within loop end`

Example: Countable Loop

subroutine cnt1 (a, b, c, n, lb)
  dimension a(n), b(n), c(n)
  integer n, lb, i, count 
! Number of iterations is "n - lb + 1"
  count = n
  do while (count .ge. lb)
    a(i) = b(i) * c(i)
    count = count - 1
    i = i + 1
  enddo ! lb is not defined within loop 
end

The following example demonstrates a different countable loop construct.

Example: Countable Loop
`! Number of iterations is (n-m+2)/2 subroutine cnt2 (a, b, c, m, n) dimension a(n), b(n), c(n) integer i, l, m, n i = 1; do l = m,n,2 a(i) = b(i) * c(i) i = i + 1 enddo end`

The following examples demonstrates a loop construct that is non-countable due to dependency loop variant count value.

Example: Non-Countable Loop
`! Number of iterations is dependent on a(i) subroutine foo (a, b, c) dimension a(100),b(100),c(100) integer i i = 1 do while (a(i) .gt. 0.0) a(i) = b(i) * c(i) i = i + 1 enddo end`

Strip-mining and Cleanup

Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as a means of improving memory performance. By fragmenting a large loop into smaller segments or strips, this technique transforms the loop structure in two ways:

By increasing the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm.
By reducing the number of iterations of the loop by a factor of the length of each vector, or number of operations being performed per SIMD operation. In the case of Intel® Streaming SIMD Extensions, this vector or strip-length is reduced by four times: four floating-point data items per single Intel® SSE single-precision floating-point SIMD operation are processed.

First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.

The compiler automatically strip-mines your loop and generates a cleanup loop. For example, assume the compiler attempts to strip-mine the following loop:

Example: Before Vectorization
`i = 1 do while (i<=n) a(i) = b(i) + c(i) ! Original loop code i = i + 1 end do`

The compiler might handle the strip mining and loop cleaning by restructuring the loop in the following manner:

Example: After Vectorization
`!The vectorizer generates the following two loops i = 1 do while (i < (n - mod(n,4))) ! Vector strip-mined loop. a(i:i+3) = b(i:i+3) + c(i:i+3) i = i + 4 end do do while (i <= n) a(i) = b(i) + c(i) !Scalar clean-up loop i = i + 1 end do`

Example: After Vectorization

!The vectorizer generates the following two loops 
i = 1 
do while (i < (n - mod(n,4))) 
! Vector strip-mined loop.
  a(i:i+3) = b(i:i+3) + c(i:i+3)
  i = i + 4 
end do 
do while (i <= n)
  a(i) = b(i) + c(i)     !Scalar clean-up loop
  i = i + 1 
end do

Loop Blocking

It is possible to treat loop blocking as strip-mining in two or more dimensions. Loop blocking is a useful technique for memory performance optimization. The main purpose of loop blocking is to eliminate as many cache misses as possible. This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse.

Consider the following example. The two-dimensional array A is referenced in the j (column) direction and then in the i (row) direction (column-major order); array B is referenced in the opposite manner (row-major order). Assume the memory layout is in column-major order; therefore, the access strides of array A and B for the code would be 1 and MAX, respectively. BS = block_size;MAX must be evenly divisible by BS.

Consider the following loop example code:

Example: Original loop
REAL A(MAX,MAX), B(MAX,MAX) DO I =1, MAX DO J = 1, MAX A(I,J) = A(I,J) + B(J,I) ENDDO ENDDO

The arrays could be blocked into smaller chunks so that the total combined size of the two blocked chunks is smaller than the cache size, which can improve data reuse. One possible way of doing this is demonstrated below:

Example: Transformed Loop after blocking
`REAL A(MAX,MAX), B(MAX,MAX) DO I =1, MAX, BS DO J = 1, MAX, BS DO II = I, I+MAX, BS-1 DO J = J, J+MAX, BS-1 A(II,JJ) = A(II,JJ) + B(JJ,II ENDDO ENDDO ENDDO ENDDO`

Loop Interchange and Subscripts: Matrix Multiply

Loop interchange is often used for improving memory access patterns. Matrix multiplication is commonly written as shown in the following example:

Example: Typical Matrix Multiplication
`subroutine matmul_slow(a, b, c) integer :: i, j, k real :: a(100,100), b(100,100), c(100,100) do i = 1, n do j = 1, n do k = 1, n c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do end subroutine matmul_slow`

Example: Typical Matrix Multiplication

subroutine matmul_slow(a, b, c)
  integer :: i, j, k
  real :: a(100,100), b(100,100), c(100,100)
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c(i,j) = c(i,j) + a(i,k)*b(k,j)
      end do
    end do
  end do 
end subroutine matmul_slow

The use of B(K,J) is not a stride-1 reference and therefore will not be vectorized efficiently.

If the loops are interchanged, however, all the references will become stride-1 as shown in the following example.

Example: Matrix Multiplication with Stride-1
`subroutine matmul_fast(a, b, c) integer :: i, j, k real :: a(100,100), b(100,100), c(100,100) do j = 1, n do k = 1, n do i = 1, n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo end subroutine matmul_fast`

Example: Matrix Multiplication with Stride-1

subroutine matmul_fast(a, b, c)
  integer :: i, j, k
  real :: a(100,100), b(100,100), c(100,100)
  do j = 1, n
    do k = 1, n
      do i = 1, n
        c(i,j) = c(i,j) + a(i,k)*b(k,j)
      enddo
    enddo
  enddo 
end subroutine matmul_fast

Interchanging is not always possible because of dependencies, which can lead to different results.