Intel® Fortran Compiler 17.0 Developer Guide and Reference
This topic only applies when targeting Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
To transfer data between the CPU and the target device, use the OFFLOAD_TRANSFER directive with either all in clauses or all out clauses. Without a signal clause the data transfer is synchronous: The next statement is executed only after the data transfer is complete.
OFFLOAD_TRANSFER with a signal makes the data transfer asynchronous. The tag specified in the signal clause is an address expression associated with that dataset. The data transfer is initiated and the CPU can continue past the directive statement.
A later directive written with a wait clause causes the activity specified in the directive to begin only after all the data associated with the tag has been received or shared with the target device. The data is placed into the variables specified when the data transfer was initiated. These variables must still be accessible.
Alternatively, you can use the non-blocking API OFFLOAD_SIGNALED() to also determine if a section of offloaded code has completed running on a specific target device.
On Intel® MIC Architecture, the signal and wait clauses, the OFFLOAD_WAIT construct and the OFFLOAD_SIGNALED() API refer to a specific target device, so you must specify target-number in the target() clause.
Querying a signal before the signal has been initiated results in undefined behavior, and a runtime abort of the application. For example, consider a query of a signal (SIG1) on target device 0, where the signal was actually initiated for target device 1. The signal was initiated for target device 1, so there is no signal (SIG1) associated with target device 0, and therefore the application aborts.
If, during an asynchronous offload, a signal is created in one thread, Thread A, and waited for in a different thread, Thread B, you are responsible for ensuring that Thread B does not query the signal before Thread A has initiated the asynchronous offload to set up the signal. Thread B querying the signal before Thread A has initiated the asynchronous offload to set up the signal, results in a runtime abort of the application.
If if-specifier evaluates to false and you use a signal (tag) clause, then the signal is undefined and any wait on this signal has undefined behavior.
To transfer data asynchronously from the CPU to the target, use a signal clause in an OFFLOAD_TRANSFER directive with in clauses. The variables listed in the in clauses form a data set. The directive initiates the data transfer of those variables from the CPU to the target. A subsequent OFFLOAD directive with a wait clause that uses the same value for tag as that used in the signal clause causes the statement controlled by the directive to begin execution on the target only after the data transfer is complete.
To transfer data asynchronously from the target to the CPU, use the signal and wait clauses in two different directives. The first offload directive performs the computation, but only initiates the data transfer. The second directive causes a wait for the data transfer to complete.
In the following example, the data transfer of the floating-point array f1 is initiated at line 10, and f2 is initiated at line 12. The offloads do not initiate a computation. Their only purpose is to start transferring f1 and f2 to the target. At line 14 the CPU initiates the computation of the function foo on the target. The function uses the data f1 and f2, whose transfer was initiated earlier. The execution of the offloaded region on the target begins only after the transfer of f1 and f2 completes. The variable result returns the results of the computation.
01 integer, parameter:: n=4086
02 real, allocatable :: f1(:), f2(:), result
03 !dir$ attributes offload:mic :: f1, f2, foo
04 integer :: signal_1, signal_2
05 !dir$ attributes align : 64 :: f1
06 !dir$ attributes align : 64 :: f2
07 allocate(f1(n))
08 allocate(f2(n))
09 f1 = 1.0
10 !dir$ offload_transfer target (mic:0) in(f1) signal(signal_1)
11 f2 = 3.14
12 !dir$ offload_transfer target (mic:0) in(f2) signal(signal_2)
13 !dir$ offload begin target(mic:0) wait (signal_1, signal_2)
14 result = foo(n, f1, f2)
15 !dir$ end offload
Multiple independent asynchronous data transfers can occur at any time. The example below uses offload_transfer to send f1 and f2 to the target at different times, first f1 in line 10, and then f2 in line 13.
01 program main
02 integer, parameter:: n=4086
03 real, allocatable :: f1(:), f2(:), result
04 !dir$ attributes offload:mic :: f1, f2, foo
05 integer :: signal_1, signal_2
06 !dir$ attributes align : 64 :: f1
07 !dir$ attributes align : 64 :: f2
08 allocate(f1(n))
09 allocate(f2(n))
10 !dir$ offload begin target(mic:0) in (f1 ) nocopy (f2) signal(signal_1)
11 call foo(N, f1, f2)
12 !dir$ end offload
13 !dir$ offload_transfer target(mic:0) wait(signal_1) out (f2)
14 end program main
In the following example, the data transfer of the floating-point arrays in1 and in2 is initiated at line 15. The offload does not initiate a computation. Its only purpose is to start transferring in1 to the target. Within the do loop, either in1 or in2 is transferred to the target, and computation starts on whichever set has already been transferred. At line 20 the CPU initiates the computation of the function compute on the target, and tells it to work on in1. At line 24, the CPU initiates the computation of the function compute on the target, but tells it to work on in2, which was transferred at line 23.
The following example double buffers inputs to an offload.
01 module M
02 integer, parameter :: NNN = 100
03 integer, parameter :: count = 25000000
04 integer :: arr(NNN)
05 real :: dd
06 !dir$ attributes offload:mic::arr, dd
07 end module M
08 subroutine do_async_in()
09 !dir$ attributes offload:mic :: compute
10 use m
11 integer i, signal_1, signal_2, iter
12 real, allocatable :: in1(:), in2(:)
13 real, allocatable :: out1(:), out2(:)
14 iter = 10
15 !dir$ offload_transfer target(mic:0) in(in1 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_1)
16 do i=1, iter
17 if (mod(i,2) == 0) then
18 !dir$ offload_transfer target(mic:0) if(i .ne. iter) in(in2 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_2)
19 !dir$ offload target(mic:0) nocopy(in1) wait(signal_1) out(out1 : length(count) alloc_if(.false.) free_if(.false.) )
20 call compute(in1, out1)
21 else
22 !dir$ offload_transfer target(mic:0) if(i .ne. iter) in(in1 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_1)
23 !dir$ offload target(mic:0) nocopy(in2) wait(signal_2) out(out2 : length(count) alloc_if(.false.) free_if(.false.) )
24 call compute(in2, out2)
25 endif
26 end do
27 end subroutine do_async_in