System optimizations

This page discusses OS and interprocessor considerations that impact performance. For more information on how to optimize algorithmic code running on the DSP, see the DSP optimization page instead.

Offload tasks onto the DSP

Using the DSP offers several benefits:

Compared to the CPU, the Hexagon DSP typically consumes much less power and is less susceptible to thermal concerns.
In many cases that vectorize well on HVX, the DSP performs the same computations in less time (while at a lower clock) than multiple CPU cores.
Moving large blocks of computational software to the DSP keeps the CPU unburdened for other tasks that might work well only on the CPU.

Separately, the DSP is best suited for signal processing tasks, and it excels at any type of operation that can be parallelized. Running such tasks on the DSP uses the DSP to the best of its ability and results in significant gains with respect to power consumption.

In summary, prioritize moving large signal-processing tasks onto the DSP and let the CPU run the control-oriented code and short individual processing functions.

IPC performance considerations

Communication between the CPU and DSP is performed through shared memory with interrupts. Offloading tasks from the CPU onto the DSP comes with a communication overhead.

Because the CPU and DSP do not share a cache, maintenance operations are required on all buffers transacted between them. These operations can take a minimum of a few hundred microseconds (on DSPs without hardware IO-coherence with the CPU L2 cache, as discussed later in this section). Depending on system clock settings and CPU sleep modes enabled, the overhead for each invocation to the DSP could extend to several milliseconds. Hence, it is preferable to offload large tasks onto the DSP instead of invoking the DSP for small trivial tasks. To understand and improve overall performance, it is important to know what factors contribute to this overhead.

FastRPC latency

The latency of a FastRPC synchronous call is the amount of time spent from when the CPU thread initiates a call to the DSP until it can resume its operation, less the amount of time spent by the DSP to execute the task itself. Under optimized conditions, the FastRPC round-trip average latency is on the order of 200 to 700 microseconds on the latest targets. It is recommended to measure average FastRPC latency over multiple RPC calls instead of one call for consistent results as it depends on variable latencies like CPU wake up and scheduler delays.

Reduce FastRPC overhead

We begin with a brief list of recommendations for accomplishing the best FastRPC performance and then detail each of the main contributing factor to FastRPC performance.

Recommendations for best FastRPC performance

Allocate ION buffers using the RPCMEM APIs
Enable FastRPC QoS mode with a low latency tolerance using the remote APIs
- For best performance, use the PM_QOS mode and a recommended QoS latency of 100 microseconds
Use HAP_power APIs to vote for DSP clocks and DCVS according to requirements
- For best performance, vote for TURBO or TURBO_L1, and a vote for a 40 microsecond DSP sleep latency
Make sure you are using an Android kernel build with performance kernel defconfig settings
Use early wakeup hints using the HAP_send_early_signal API if possible

ION buffers

To achieve low FastRPC overhead, it is important to use ION buffers (available on Android targets), which do not require extra copy when shared between the DSP and CPU. Each non-ION buffer passed to the DSP in a FastRPC call is automatically copied by the FastRPC framework, thus resulting in higher FastRPC overhead when large buffers are used.

Register the ION buffers with the FastRPC library. Otherwise, the driver treats unregistered ION buffers as non-ION and results in extra copy.

The RPCMEM library provides an API for allocating shared buffers and automatically registering buffers with the FastRPC library. For more information, see RPCMEM API.

Pre-allocated ION buffers can be directly registered with the FastRPC library using the remote_register_buf() function defined as part of the remote interface. For more information, see remote API.

Cache coherence

Coherency ensures that all processors see the latest data when accessing shared buffers through their respective caches. The FastRPC driver maintains cache coherency between the CPU and DSP for shared buffers that are accessed in a FastRPC call. Hardware-based IO coherency for the CPU is supported on most of the recent Snapdragon chipsets, which helps to reduce FastRPC latency significantly from several milliseconds to approximately one millisecond.

IO coherence, also called one-way coherence, allows DSPs that support it to access the CPU caches on DSP load or store operations to maintain coherence continuously throughout DSP operations on shared buffers. For example, when the DSP populates a shared coherent output buffer, the CPU can read the data immediately without invalidating cache lines.

IO coherency is enabled by default for all buffers. On chipsets without IO coherency hardware support, the FastRPC software driver invalidates and flushes CPU cache lines as necessary to ensure coherency and results in higher latency. For details, see the feature matrix.

The FastRPC software driver also invalidates and flushes DSP cache lines to ensure DSP cache coherency. Cache maintenance time varies and depends on the DSP clocks and size of total buffers used in an RPC call. The driver cleans all cache lines of a user process (instead of cleaning by buffer addresses and lines) when the total size of shared buffers to be cleaned exceeds 1 MB.

CPU wakeup and scheduling delays

Idle CPU cores can enter low power sleep modes for saving power after sending a message to the DSP. Some of the power saving sleep modes include shutting down L1/L2 cache and core clocks. CPU wakeup delay can be several hundreds of microseconds. CPU wakeup delay varies and depends on the current system load and sleep mode.

After receiving a response from the DSP, the CPU handles the response in the interrupt handler and sets a signal to the actual thread waiting for job completion. The CPU scheduler is invoked to schedule the actual waiting thread for further processing; this can add variable latency to the FastRPC overhead.

The Hexagon SDK allows the user to select CPU modes that help manage FastRPC performance in typical conditions by disabling certain sleep states. These modes, referred as PM_QOS and ADAPT_QOS, mitigate the CPU wakeup latency and may be selected using the remote_handle_control() APIs from the remote library. When selecting one of these modes, the user also specifies a wakeup latency, which the driver will try to satisfy by enabling certain features and available techniques available on a given target.

Note: There is no guarantee that the driver can meet the requested latency.

Another way of reducing the wakeup latency is for the DSP to send anticipatory early completion signals to prompt the CPU to wakeup. This approach is suited for situations when the developer can determine when the DSP is about to complete the task that it was assigned. If the CPU wakes up before the DSP has completed its task, it will pull continuously until completion of the FastRPC call. It is currently supported with the HAP_send_early_signal API, which is replacing the deprecated API fastrpc_send_early_signal, both defined in $HEXAGON_SDK_ROOT/incs/HAP_ps.h.

Pre-map buffers to the DSP

The FastRPC driver supports transient and persistent mapping of a buffer to the remote DSP. By default, the FastRPC driver maps and unmaps buffers passed as arguments of FastRPC invocation at the beginning and end of that invocation respectively, consuming several uSec in each invocation. On SM8150, for example it was observed that mapping and unmapping a large buffer is taking ~20 us overhead on the average when L2 cache fully evicted by previous RPC call.

The FastRPC library supports a flag RPCMEM_TRY_MAP_STATIC for implicitly mapping a buffer to the DSP during allocation. The FastRPC library tries to map buffers allocated with the RPCMEM_TRY_MAP_STATIC flag to the remote process of all current and new FastRPC sessions. In case of failure to map, the FastRPC library ignores the error and continues to open the session without pre-mapping the buffer. In case of success, buffers allocated with this flag will be pre-mapped to reduce the latency of upcoming FastRPC calls. Pre-mapped buffers will be automatically unmapped at either buffer free or session close.

Note: RPC memory flag RPCMEM_TRY_MAP_STATIC and buffer attribute FASTRPC_ATTR_TRY_MAP_STATIC are supported from Lahaina and later targets only. Older targets ignore these flags and has no impact on the latency.

The FastRPC driver searches each buffer passed to remote call in static mapping list and re-assign the same virtual address on DSP. As a result, pre-mapping many buffers can also add an extra latency for all FastRPC calls. Hence pre-mapping a buffer with RPCMEM_TRY_MAP_STATIC is recommended to use only for large buffers which are used with latency-critical RPC calls after profiling and estimating the actual reduction in FastRPC overhead.

Pre-map during memory allocation with RPCMEM_TRY_MAP_STATIC flag using the following approach

#include "rpcmem.h"

src = (uint8_t *)rpcmem_alloc(RPCMEM_HEAP_ID_SYSTEM, RPCMEM_DEFAULT_FLAGS | RPCMEM_TRY_MAP_STATIC, srcSize);
//... RPC calls with src buffer as parameter ...
rpcmem_free(src);

When an ION buffer is allocated without using rpcmem_alloc() function, register the buffer with an attribute FASTRPC_ATTR_TRY_MAP_STATIC for pre-map:

#include "remote.h"

remote_register_buf_attr(buffer, size, fd, FASTRPC_ATTR_TRY_MAP_STATIC);
//... RPC calls with src buffer as parameter ...
remote_register_buf(buffer, size, -1); // the -1 argument results in unregistering the buffer

For more information refer to rpcmem API and remote API.

CPU build configuration

The CPU software builds on development platforms might have additional logging and debugging code built in, which can significantly impact FastRPC performance. QTI recommends measuring FastRPC overhead with full performance builds. Production devices always use performance builds. Development builds should use performance kernel defconfig settings.

DSP sleep latency

The DSP can enter low power sleep mode after sending a response to the CPU, and if it becomes idle with no further jobs to process. The time to bring the DSP out of sleep mode after it receives an interrupt from the CPU adds additional latency to FastRPC overhead for processing the newly offloaded job.

The impact of sleep latency on FastRPC latency is higher when the jobs are submitted periodically and the idle DSP enters sleep mode between the jobs. If the jobs are submitted back-to-back, the DSP might not go to sleep and latency impact will be minimal.

The FastRPC driver votes for a default sleep latency during session open. QTI recommends overwriting the sleep latency setting based on the use case and latency requirement. For more information, see the HAP power API.

Speed and power management

Controlling the clock speed of the various performance-critical components on the chip allows the application to trade power for speed. The Hexagon SDK provides a set of APIs that allow the DSP to control its own performance and power modes. To learn more about these APIs, see Performance and power manager.