Intel® Fortran Compiler 17.0 Developer Guide and Reference
OFFLOAD Compiler Directive: Enables statements to execute on the target. This directive only applies to Intel® MIC Architecture.
!DIR$ [OMP] OFFLOAD clause[[,] clause...]
clause |
Can be any of the following:
|
The following arguments are used in the above clause items:
target-name |
Is an identifier that represents the target. The only allowable target name is MIC. |
|||||||||||||||
target-number |
(Required for SIGNAL and WAIT) Is an integer expression whose value is interpreted as shown in the following table. When target-number is specified, the implicit MANDATORY offload is overridden and execution on the CPU is allowed when either the OPTIONAL clause is also specified or optional is also specified in the [Q]offload option.
If you don't specify the target-number argument, the runtime system executes the code on the coprocessor, and if multiple coprocessors are available, on which coprocessor. If no coprocessor is available, the program fails with an error message. For example, in a system with 4 coprocessors:
NoteLeaving data values on the coprocessor from one execution of offloaded code to another is called "data persistence". In a system with multiple coprocessors, you need to specify a target-number to reliably use data persistence. When you use ALLOC_IF or FREE_IF to implement data persistence on the coprocessor, but do not specify a target-number, the runtime system randomly chooses a coprocessor, so the chosen coprocessor could be one on which the data is not available. |
|||||||||||||||
if-specifier |
Is a Boolean expression. If the expression evaluates to true, then the program attempts to offload the statement. If the specified target coprocessor is absent from the system or not available at that time because it is fully loaded, then the offloaded code executes on the CPU. If the expression evaluates to false, then the offloaded code executes on the CPU and none of the other OFFLOAD clauses have any effect. |
|||||||||||||||
tag |
Is a scalar integer expression. Its value is used to coordinate an asynchronous computation or an asynchronous data transfer. When used with SIGNAL, tag is an integer value associated with an asynchronous computation or an asynchronous data transfer. tag can be used in subsequent WAIT clauses in other OFFLOAD, OFFLOAD_TRANSFER, or OFFLOAD_WAIT directives. When used with WAIT, tag is an integer value associated with a previously initiated asynchronous computation or asynchronous data transfer. Use the same tag that you specified in the SIGNAL clause that started the asynchronous computation or data transfer with the OFFLOAD or OFFLOAD_TRANSFER directive. |
|||||||||||||||
offload-parameter |
Can be any of the following data movement clauses:
When a program runs in a heterogeneous environment, program variables are copied back and forth between the CPU and the target. The offload-parameter is a specification for controlling the direction in which variables are copied, and for pointers, the amount of data that is copied. The data selected for transfer is a combination of variables implicitly transferred because they are lexically referenced within offload constructs, and variables explicitly listed in an offload-parameter.
An IN or OUT element-count-expr expression (see description below within modifier) is evaluated at a point in the program before the statement or clause in which it is used. An array variable whose size is known from its declaration is copied in its entirety. If a subset of an array is to be processed, use the name of the starting element of the subset and the element-count-expr to transfer the array subset. Because a data pointer variable not listed in an IN clause is uninitialized within the offload region, it must be assigned a value on the target before it can be referenced. |
|||||||||||||||
identifier |
Is a variable, a subscripted variable, an array slice, or a component reference. The variable or the component reference may have the ALLOCATABLE or POINTER attribute. An array slice may be contiguous or non-contiguous. |
|||||||||||||||
modifier |
Is one of the following:
|
The OFFLOAD directive both transfers data and offloads computation.
The OMP is optional in the syntax. When it is present, the next line, other than a comment, must be an OpenMP* PARALLEL, PARALLEL SECTIONS, or PARALLEL DO directive. Otherwise the compiler issues an error.
When OMP is not present in the syntax, the OFFLOAD directive must be followed by one of the following or the compiler issues an error:
An OpenMP* PARALLEL, PARALLEL SECTIONS, or PARALLEL DO directive
This specifies remote execution of that top-level OpenMP* construct.
A CALL statement
This specifies remote execution of that single procedure call.
An assignment statement where the right side only executes a function
This specifies remote execution of that single function invocation.
You can choose whether to offload a statement based on runtime conditions, such as the size of a data set. The IF (if-specifier) clause lets you specify the condition.
The SIGNAL and WAIT clauses refer to a specific target device, so you must specify target-number in the TARGET clause. If you query a signal before the signal has been initiated, it results in undefined behavior and a runtime abort of the application. For example, if you query a signal (SIG1) on target device 0 that was initiated for target device 1, it results in a runtime abort of the application. This is because the signal (SIG1) was initiated for target device 1, so there is no signal (SIG1) associated with target device 0.
If the if-specifier evaluates to false and a SIGNAL (tag) clause is used in the directive, then the SIGNAL is undefined and any WAIT on this SIGNAL has undefined behavior.
When you specify the STATUS clause, it affects the behavior of optional and mandatory offloads differently when the offload request is not successful:
For an optional offload, the computation is performed on the CPU and the status variable has an appropriate value.
For a mandatory offload, there is no CPU fallback. The program does not terminate. You must examine the value of the status variable, determine the reason the offload failed, and decide what action to take.
For both optional and mandatory offloads, when offload is successful, the status variable has the value OFFLOAD_SUCCESS.
In the data movement clauses (IN, OUT, INOUT, and NOCOPY) and the modifiers ALLOC and INTO, you can specify an array slice of any rank. For an assumed-size dummy array, you can specify the following syntax, interchangeably:
var:length(k), where k is a scalar integer expression computed at runtime representing the number of elements of var to be moved.
var( section-subscript-list ), where section-subscript-list is a comma-separated list of subscript triplets of the form [ subscript ] : [ subscript ] [ : stride ]
Do not use the __MIC__ preprocessor symbol inside a statement following an OMP OFFLOAD directive. However, you can use it in a subprogram called from the directive.
Conceptually, this is the sequence of events when a statement marked for offload is encountered:
If there is no IF clause, go to step 3.
On the host, evaluate the IF expression. If it evaluates to true, go to step 3. Otherwise, execute the region on the host and go to step 17.
Attempt to acquire the target. If successful, go to step 4. Otherwise:
If there is no MANDATORY clause, execute the region on the host and go to step 17.
If there is a STATUS clause, set the var to indicate the error and go to step 17.
Otherwise, terminate the program with an appropriate error message.
On the host, compute all ALLOC_IF, FREE_IF, and element-count-expr expressions used in IN and OUT clauses.
On the host, gather all variable values that are inputs to the offload.
Send the input values from the host to the target.
On the target, allocate memory for variable-length OUT variables.
On the target, copy input values into corresponding target variables.
On the target, execute the offloaded region.
On the target, compute all element-count-expr expressions used in OUT clauses.
On the target, gather all variable values that are outputs of the offload.
Send output values back from the target to the host.
On the host, copy values received into corresponding host variables.
If no error occurred on the target, go to step 17.
If there is a STATUS clause, set the var to indicate the error and go to step 17.
Otherwise, terminate the program with an appropriate error message.
Continue processing the program on the host.
The following example demonstrates offloading a CALL statement or assignment statement. Note that !DIR$ OFFLOAD TARGET (MIC) prefixes the statement designated for offload.
! Offload call of routine calc
!DIR$ OFFLOAD TARGET(MIC)
CALL calc(...)
! Offload call of function recalc
!DIR$ OFFLOAD TARGET(MIC)
X = recalc(...)
The following example demonstrates using the OFFLOAD directive in conjunction with the OpenMP* PARALLEL directive to specify remote execution of the OpenMP construct.
! Offload OpenMP parallel construct
!DIR$ OMP OFFLOAD TARGET(MIC)
!$omp parallel
...
!$omp end parallel
The following example demonstrates how to use a variable-length array to specify a number of elements copied between the CPU and target.
subroutine sample (Z,N,M)
integer, intent(in) :: N,M
real, dimension (N,*) :: Z
...
!dir$ omp offload target(mic) in (Z:length(N*M))
...
end subroutine sample
The following example shows various forms of identifier and use of the ALLOC and INTO modifiers in IN clauses:
subroutine foo
real a(1000,500), b(1000,500), c(2000, 20)
real, pointer :: p(:)
p => c(1:20:2)
!dir$ offload target(mic) in( a : into (b) )
...
!dir$ offload target(mic) in( c(i:j:k,l:m:n) ) ! k and n must be strides of 1
...
!dir$ offload target(mic) in( p(1:20) : alloc (p(1:100)) )
...
end
The following example demonstrates using the OFFLOAD directive, as well as directives OFFLOAD_TRANSFER and OFFLOAD_WAIT.
! Sample use of OFFLOAD, OFFLOAD_TRANSFER, and OFFLOAD_WAIT
module M
integer, parameter :: iter = 10
integer, parameter :: count = 25000
!dir$ options /offload_attribute_target=mic
real, allocatable :: in1(:), in2(:), out1(:), out2(:)
!dir$ end options
integer :: sin1, sin2, sout1, sout2
contains
!dir$ attributes offload:mic ::compute
subroutine compute(x, y)
real, allocatable :: x(:), y(:)
integer :: i
!dir$ omp parallel do num_threads(96) private(i)
do i = 1, count
y(i) = x(i) * x(i)
end do
end subroutine compute
subroutine do_async_in()
integer :: i
! prime loop with initial in1 transfer to target
!dir$ offload_transfer target(mic:0) signal(sin1) &
in( in1 : alloc_if(.false.) free_if(.false.) )
do i = 1, iter
if (mod(i,2) == 0) then
! initiate another in1 data transfer to target, skip if last iteration
!dir$ offload_transfer target(mic:0) if(i /= iter) signal(sin1) &
in( in1 : alloc_if(.false.) free_if(.false.) )
! wait for in2 transfer to complete, then offload computation
!dir$ offload target(mic:0) wait(sin2) &
nocopy(in2) &
out( out2 : alloc_if(.false.) free_if(.false.) )
call compute(in2, out2);
! use out2 results on host
call use_result(out2);
else
! initiate another in2 data transfer to target, skip if last iteration
!dir$ offload_transfer target(mic:0) if(i /= iter) signal(sin2) &
in( in2 : alloc_if(.false.) free_if(.false.) )
! wait for in1 transfer to complete, then offload computation
!dir$ offload target(mic:0) wait(sin1) &
nocopy( in1 ) &
out( out1 : alloc_if(.false.) free_if(.false.) )
call compute(in1, out1);
! use out1 results on host
call use_result(out1)
endif
enddo
end subroutine do_async_in
subroutine do_async_out()
integer :: i
do i = 1, (iter + 1)
if ( mod(i,2) == 0 ) then
if ( i < (iter + 1)) then
! offload computation, leave results on target
!dir$ offload target(mic:0) &
in( in2 : alloc_if(.false.) free_if(.false.) ) &
nocopy( out2 )
call compute(in2, out2)
! transfer out2 results (asynchronously) back to host
!dir$ offload_transfer target(mic:0) signal(sout2) &
out( out2 : alloc_if(.false.) free_if(.false.) )
endif
! wait for out1 results on host
!dir$ offload_wait target(mic:0) wait(sout1)
! use out1 results on host
call use_result(out1);
else
if (i < (iter + 1)) then
! offload computation, leave results on target
!dir$ offload target(mic:0) &
in( in1 : alloc_if(.false.) free_if(.false.) ) &
nocopy( out1 )
call compute(in1, out1)
! transfer out1 results (asynchronously) back to host
!dir$ offload_transfer target(mic:0) signal(sout1) &
out( out1 : alloc_if(.false.) free_if(.false.) )
endif
if (i > 1) then
! wait for out2 results on host
!dir$ offload_wait target(mic:0) wait(sout2)
! use out2 results on host
call use_result(out2)
endif
endif
enddo
end subroutine do_async_out
subroutine do_sync()
integer :: i
do i = 1, iter
! transfer data to host, compute, and return results synchronously
!dir$ offload target(mic:0) &
in( in1 : alloc_if(.false.) free_if(.false.) ) &
out( out1 : alloc_if(.false.) free_if(.false.) )
call compute(in1, out1)
! use out1 results on host
call use_result(out1)
enddo
end subroutine do_sync
subroutine use_result(x)
! use results from offload computations on host
real, allocatable :: x(:)
print*, "USE_RESULT *****************"
end subroutine use_result
end module M
program main
use M
integer :: i
allocate ( in1(count), in2(count), out1(count), out2(count) )
!dir$ omp parallel do num_threads(96) private(i)
do i = 1, count
in1(i) = REAL(i)
in2(i) = REAL(i)
enddo
! Initialize signal variables to unique values
sin1 = 1
sin2 = 2
sout1 = 3
sout2 = 4
! allocate memory on target only
!dir$ offload_transfer target(mic:0) &
nocopy(in1, out1, in2, out2 : alloc_if(.true.) free_if(.false.) )
!dir$ omp parallel do num_threads(96) private(i)
do i = 1, count
out1(i) = 0.
out2(i) = 0.
enddo
! synchronous transfer, compute
call do_sync()
! asynchronous IN transfer, compute
call do_async_in()
! compute, asynchronous OUT transfer
call do_async_out()
! free memory on target only
!dir$ offload_transfer target(mic:0) &
nocopy(in1, out1, in2, out2 : alloc_if(.false.) free_if(.true.) )
deallocate( in1, in2, out1, out2 )
end program main
The following example demonstrates the tag argument:
module mmod
!dir$ attributes offload : mic :: x, y
real :: x (1000) = 1.1
real :: y(1000) = 2.2
integer :: gtag = 1
end module mmod
program mmain
! compile with symbol name "doGLOB" defined to use a global tag
! compile with symbol name "doGLOB" undefined to use a local tag
use mmod
integer :: ktag = 2
call f (ktag)
call g (ktag)
print *, (y (j),j=1,1000,100) ! print every 100th element == 5.1
end program mmain
subroutine f (kktag)
use mmod
integer :: kktag
#ifdef doGLOB
!dir$ offload_transfer target (mic:0) signal (gtag) IN(x) ! copy X to MIC, signal when done
#else
!dir$ offload_transfer target (mic:0) signal (kktag) IN(x) ! copy X to MIC, signal when done
#endif
end subroutine f
subroutine g (kktag)
use mmod
integer :: kktag
#ifdef doGLOB
!dir$ offload begin target (mic:0) wait (gtag) OUT (y) ! wait for offload_transfer in f to copy x
#else
!dir$ offload begin target (mic:0) wait (kktag) OUT (y) ! wait for offload_transfer in f to copy x
#endif
y = x + 4 ! y is all 5.1s
!dir$ end offload
end subroutine g
The following examples demonstrate using the STATUS clause.
Example 1
use mic_lib
type (offload_status) :: stat
!dir$ offload target (mic:1) status (stat)
… ! code to be offloaded to the coprocessor
!dir$ end offload
if (stat%result .ne. OFFLOAD_SUCCESS) then
… ! handle the error condition
end if
Example 2
use mic_lib
type (offload_status) :: stat
real :: p (20), q (20)
integer :: my_mic
!dir$ offload_transfer target (mic), status (stat), in (p, q)
if (stat%result == OFFLOAD_OUT_OF_MEMORY) then
! memory could not be allocated
… ! abandon offload target – use CPU host
else
! data has been transferred, can continue using coprocessor
my_mic = stat%device_number
!dir$ offload target (mic:my_mic), status (stat)
… ! do offload computation on device that was obtained
end if