Asynchronous DSP packet queue example

Overview

The dspqueue example illustrates using the Asynchronous DSP packet queue for communication between the host CPU and the cDSP.

The API and the example are supported on Android target only, for Lahaina and later products. See the feature matrix for details on feature support.

Building and running the example

The dspqueue example includes a walkthrough script called dspqueue_walkthrough.py. Please review the generic setup and walkthrough_scripts instructions to learn more about setting up your device and using walkthrough scripts. Walkthrough script automates building and the example as discussed in this section.

Without the walkthrough script, the example can be built like other SDK examples, using make commands such as these:

make android BUILD=Debug
make hexagon BUILD=Debug DSP_ARCH=v68

For more information on the build syntax, please refer to the building reference instructions.

The packet queue API is only supported on target, so the example cannot be run on the simulator. To run the example on a target, push the generated binaries to the device:

adb push android_Debug_aarch64/ship/dspqueue_sample /vendor/bin/
adb shell chmod 777 /vendor/bin/dspqueue_sample
adb push hexagon_Debug_toolv87_v68/ship/libdspqueue_sample_skel.so /vendor/lib/rfsa/adsp/

After this simply execute dspqueue_sample on the target:

adb shell dspqueue_sample

Note that by default the example runs in an unsigned PD.

Example application details

The example application contains two separate test scenarios. The following sections discuss these and any common elements in detail.

Echo Test

The echo test illustrates a basic dspqueue scenario: A host CPU application sends requests to the DSP, which responds back by echoing the same message. The test is driven by echo_test() in src/dspqueue_sample.c. This section discusses the key steps in detail.

dspqueue_create(DSPQUEUE_CDSP, 
                0, // Flags
                256, // Request queue size
                256, // Response queue size
                packet_callback,
                error_callback_fatal,
                (void*)c,  // Callback context
                &queue);

This call creates a new queue. This example uses a relatively small queue (256 bytes for both requests and responses); for use cases with larger messages a larger queue can yield better performance. The two callback functions are used for handling responses from the DSP (packet_callback()) and error handling (error_callback_fatal()). In this example the error callback simply terminates the process with an error.

dspqueue_export(queue, &dsp_queue_id);
dspqueue_sample_start(sample_handle, dsp_queue_id);

These calls export the queue for use on the DSP, and pass the queue ID to the DSP side of the test application. The DSP side is implemented in src/dspqueue_sample_imp.c, with the IDL interface defined in inc/dspqueue_sample.idl as is typical for Hexagon SDK applications. The example uses a regular synchronous FastRPC call to pass the queue ID to the DSP and start the use case - this is common for packet queue applications.

The DSP side implementation uses the queue ID to create a local handle to the same queue:

dspqueue_import(dsp_queue_id, // Queue ID from dspqueue_export
                sample_packet_callback, // Packet callback
                sample_error_callback, // Error callback; no errors expected on the DSP
                (void*)c, // Callback context
                &c->queue);

The DSP implementation has its own set of callback functions. All requests are processed in the packet callback (sample_packet_fallback()). The current implementation does not use a DSP-side error callback, but it is included for completeness.

After the queue has been created and connected, the host CPU sends a sequence of requests to the DSP:

dspqueue_write(queue,
               0, // Flags
               0, NULL, // No buffer references in this packet
               2 * sizeof(uint32_t), (const uint8_t*)msg, // Message
               1000000); // Timeout

The packets consist of two 32-bit words, with no buffer references. The call uses a one-second timeout to catch error situations where the DSP may have stopped responding; alternatively clients can use DSPQUEUE_TIMEOUT_NONE.

On the DSP side incoming request packets are handled in the packet callback. The callback function structure is typical of most dspqueue clients:

while ( 1 ) {
    //...
    err = dspqueue_read_noblock(queue,
                                &flags,
                                SAMPLE_MAX_PACKET_BUFFERS, // Maximum number of buffer references
                                &num_bufs, // Number of buffer references
                                bufs, // Buffer references
                                sizeof(msg), // Max message length
                                &msg_length, // Message length
                                (uint8_t*)msg); // Message
    if ( err == AEE_EWOULDBLOCK ) {
        return;
    }
    switch ( msg[0] ) {
        case SAMPLE_MSG_ECHO:
            resp_msg[0] = SAMPLE_MSG_ECHO_RESP;
            resp_msg[1] = msg[1];
            dspqueue_write(queue,
                           0, // Flags
                           0, NULL, // No buffers
                           8, (const uint8_t*)resp_msg, // Message
                           DSPQUEUE_TIMEOUT_NONE);
        break;
            //...
    }
}

The callback repeatedly reads packets from the queue until it is empty (AEE_EWOULDBLOCK). This ensures all packets are consumed - the client does not necessarily get a separate callback for each packet, and multiple new packets can arrive while a previous callback is being handled.

In this example each received packet is handled directly in the callback function; for the echo test the DSP simply sends the same payload word back to the CPU. Real-world clients would likely start processing in background worker threads, and eventually send a response back to the CPU once all work is complete.

The host CPU receives response packets in a packet callback function virtually identical to the DSP one discussed above.

Finally at the end of the test the queue is closed first on the DSP followed by the host CPU:

dspqueue_sample_stop(sample_handle);
dspqueue_close(queue);

Data Processing Test

The second test in the dspqueue example illustrates using a packet queue to send data processing requests to the DSP. Its structure is very similar to the echo test, so this section only discusses the differences between the two.

The test uses a set of shared memory buffers to send input data and receive output from the DSP. The buffers are allocated and mapped to the DSP before processing can start:

buffers[i] = rpcmem_alloc(RPCMEM_HEAP_ID_SYSTEM, RPCMEM_DEFAULT_FLAGS,
                          BUFFER_SIZE);
fds[i] = rpcmem_to_fd(buffers[i]);
fastrpc_mmap(CDSP_DOMAIN_ID, fds[i], buffers[i], 0, BUFFER_SIZE, FASTRPC_MAP_FD);

rpcmem_alloc() allocates a shareable ION buffer using the RPCMEM library, rpcmem_to_fd() retrieves its corresponding File Descriptor number, and finally fastrpc_mmap() maps it to the DSP. In all calls BUFFER_SIZE is the size of the buffer in bytes.

Queue creation is similar to the echo test, except the processing test uses a larger queue to account for larger packets with buffer references:

dspqueue_create(DSPQUEUE_CDSP,
                0, // Flags
                4096, // Request queue size
                4096, // Response queue size
                packet_callback,
                error_callback_fatal,
                (void*)c,  // Callback context
                &queue);

Each request packet sent to the DSP includes two buffer references: One for input data, and one to hold output from the DSP:

struct dspqueue_buffer bufs[2];
memset(bufs, 0, sizeof(bufs));
bufs[0].fd = fds[input_buf];
bufs[0].flags = (DSPQUEUE_BUFFER_FLAG_REF | // Take a reference
                 DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER | // Flush CPU
                 DSPQUEUE_BUFFER_FLAG_INVALIDATE_RECIPIENT); // Invalidate DSP
bufs[1].fd = fds[output_buf];
bufs[1].flags = (DSPQUEUE_BUFFER_FLAG_REF |
                 DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER);
dspqueue_write(queue,
               0, // Flags - the framework will update this
               2, bufs, // Buffer references
               sizeof(msg), (const uint8_t*)msg, // Message
               1000000); // Timeout

On the host CPU side the example requests the framework to take references to both buffers, flush the buffers on the CPU side to ensure all input data is visible to other devices in the system and no dirty data remains in the caches for the output buffer, and invalidate it on the DSP to ensure it sees an up to date version in memory. The DSP will trigger cache maintenance operations on the output buffer as a part of its response packet.

On the DSP side the packet callback function (sample_packet_callback()) retrieves buffer information from the incoming packet, runs the processing algorithm, and constructs a response:

case SAMPLE_MSG_BYTE_SQUARE:
    len = bufs[0].size;
    byte_square(bufs[0].ptr, bufs[1].ptr, len);
    resp_msg[0] = SAMPLE_MSG_BYTE_SQUARE_RESP;
    memset(resp_bufs, 0, sizeof(resp_bufs));
    resp_bufs[0].fd = bufs[0].fd;
    resp_bufs[0].flags = DSPQUEUE_BUFFER_FLAG_DEREF; // Release reference
    // (If we had written to the input buffer, we'd also need to flush it)
    resp_bufs[1].fd = bufs[1].fd;
    resp_bufs[1].flags = (DSPQUEUE_BUFFER_FLAG_DEREF | // Release reference
                          DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER | // Flush DSP
                          DSPQUEUE_BUFFER_FLAG_INVALIDATE_RECIPIENT); // Invalidate CPU
    dspqueue_write(queue,
                   0, // Flags
                   2, resp_bufs, // Buffer references
                   4, (const uint8_t*)resp_msg, // Message
                   DSPQUEUE_TIMEOUT_NONE);

Note that the response packet also includes a reference to the input buffer to release the reference taken in the request packet. The response also triggers cache maintenance operations for the output buffer to ensure the results are visible to the host CPU.

At the end of the test the processing test unmaps and frees buffers.

Finally the processing test also illustrates how to use early wakeup packets to reduce latency between the DSP and host CPU. See Early Wakeup for more discussion on how and when to use early wakeup packets.

Performance Measurements

The dspqueue example application performs some performance measurements to illustrate when using the packet queue can be more efficient than regular synchronous FastRPC calls. For each scenario the application measures and prints three sets of numbers for each configuration:

Total elapsed time
DSP processing time: The time taken on the DSP to run the processing algorithm
Overhead: The difference between DSP processing time and total elapsed time, divided by the number of operations. This accounts for inter-processor communication, cache maintenance operations, and other overhead from the framework.

The exact values seen will vary on depending on the target device used for testing, but generally the following results should be visible:

For individual operations without early wakeup the packet queue is not more efficient than synchronous FastRPC calls
Early wakeup can significantly reduce the latency for single-shot operations
The packet queue can have significantly lower overhead than synchronous FastRPC calls when the application can queue a larger number of requests.

Additionally, the example also illustrates using persistently mapped buffers to reduce overhead for regular synchronous FastRPC calls.