Resource management

Software running on the Hexagon DSP can use several different resources: hardware threads for code execution, external memory, caches, HVX, HMX, and VTCM. The DSP supports multiple applications running in parallel, each with multiple threads of execution, and thus the DSP resources must be shared across multiple processes.

While many of the resources are managed automatically by the hardware or operating system, other resources must be managed explicitly by the application (notably HMX and VTCM). In both cases, many applications are better off explicitly reserving how much of the resources they need, executing one workload at a time at maximum efficiency when all resources are available, and releasing the resources for other clients to use next.

This section discusses how the different resources are shared explicitly or implicitly, the impact on performance, and the use of the Compute Resource Manager to manage resources.

For additional information see:

Architecture overview for a discussion of Hexagon DSP hardware resources.
DSP OS for information on the QuRT OS and OS-level resources such as threads.
HAP Compute Resource Manager API for full documentation on the Compute Resource Manager APIs.
Feature matrix for an overview of different Hexagon DSP versions in different products and their feature set differences.

Compute Resource Manager

Applications can use the HAP Compute Resource Manager APIs to reserve resources such as HMX and VTCM, and to serialize access to other resources shared automatically. In summary, most clients should do the following:

Allocate and release all resources they need with one resource manager call.
Use HAP_compute_res_attr_set_serialize() to serialize their access to DSP resources.
Enable higher-priority clients to gain access to the shared resources. This step is needed when processing tasks that require more than a few milliseconds (about 5 ms) to complete.

Allowing other higher-priority clients to acquire shared resources may be accomplished in one of two ways:
- Prior to Lahaina, the client should release and re-acquire resources frequently, every few milliseconds.
- Starting with Lahaina, the client should implement instead a resource release callback

For details on the Compute Resource Manager, see the API documentation. The rest of this section discusses resource management performance implications and application design choices.

In most cases, applications should reserve all the resources they require with a single resource manager call for each frame or inference to be processed. This avoids receiving partial allocations and reduces the number of resource manager calls made. However, if a client requires different resources for significantly different time durations, it should make two separate requests instead. For example, a client that requires VTCM through processing an entire 30 ms frame, but will only use HMX for the first few milliseconds of processing, might allocate HMX separately and release it early for other clients to use. However, HMX is unusable without VTCM, so clients that use all VTCM in the system can keep HMX reserved for the duration of their VTCM reservation without additional impact.

Even clients that do not use explicitly managed resources such as HMX and VTCM can benefit from using the resource manager to serialize resource access. This avoids multiple applications competing to access processor cycles or memory, which can lead to unnecessary cache thrashing and context switches. In most cases it is more efficient to run each workload independently to completion, using all the DSP resources available, instead of running multiple workloads simultaneously on different hardware threads. Clients can reserve resources with the HAP_compute_res_attr_set_serialize() attribute set to serialize access to DSP resources; only one client with the serialize flag will run at a given time.

Starting with Lahaina, the Compute Resource Manager supports release callbacks. Clients register a callback function with HAP_compute_res_attr_set_release_callback(), and the callback will be called when a higher-priority client requires some or all of the resources reserved. The client must finish its resource use within a short time window--for example, around five milliseconds--and release the resources back to the resource manager so the higher-priority client can acquire them. Clients can attempt to re-acquire resources immediately after releasing them; the resource manager will ensure the highest-priority client gets the resources as they become available.

On devices earlier than Lahaina, the resource manager does not support release callbacks. On those devices, clients must periodically release their resources to ensure that they do not starve other higher-priority clients. Typically, this is done on a frame or inference boundary; but for long-running operations, clients might need to release and re-acquire resources more frequently.

On targets with the resource release callback available through the compute_resource_attr_set_release_callback() API, clients are not required to release resources periodically if they have work items available. Instead, they can continue processing further workloads until they receive a release callback or run out of work. However, it is important to release resources when the client becomes idle to ensure that lower priority clients can make progress.

For information on which resource manager features are supported on which chipset version, see the Compute Resource Manager documentation and the Feature matrix.

External memory and caches

External memory and DSP internal L1/L2 caches are readily accessible by all active threads. For those resources, your only concern is to ensure that each thread uses these resources as efficiently as possible because they are shared. For example, it is typically preferable to rewrite an algorithm to consume less memory even if doing so does not show any improvement when the algorithm executes in a single-threaded environment; reducing memory bandwidth will reduce the pressure on the memory system and potentially show benefit in a multithreaded context.

Typically, the DSP L2 cache is fully configured as a cache and is managed automatically by the DSP cache controller. For most applications this is appropriate: you should aim to use the cache efficiently, maintaining good cache locality and using prefetch instructions where possible; but, you should let the system manage the cache automatically. However, the L2 cache also supports line locking, where portions of the L2 cache can be locked to specific sections of memory.

L2 line locking is primarily required for two use cases: the camera streamer can use locked L2 as its target, and the UBWCDMA operates between external memory and locked L2 cache lines. Applications that require locked L2 cache can use the CDSP L2 Cache locking manager API to lock and unlock parts of the L2 cache. Only a subset of the L2 cache can be locked (for details, see the API).

L2 cache line locking is only available for signed PDs. For discussion on signed vs unsigned PDs, see the system integration document.

HVX

HVX is shared among software threads without any direct intervention. The DSP has several HVX contexts, which the DSP OS allocates automatically to threads as they execute HVX instructions. The OS also saves and restores HVX registers and state as necessary to switch contexts between threads.

Releasing and acquiring an HVX context is costly because HVX registers contain a large amount of data that must be saved and restored. Avoid unnecessary HVX context switches by ensuring that thread priorities are set appropriately, and do not attempt to use more HVX contexts than the system has. Use qurt_hvx_get_units() to query the number of HVX units that are available.

HMX

HMX is not shared automatically by the OS or hardware. Instead, clients must use the the HAP Compute Resource Manager API to reserve HMX before attempting to use it, and release it promptly after use. For more information on resource management, see Compute Resource Manager above.

Only one application can use HMX at a time, making resource management especially important for HMX-based applications.

Most developers do not use HMX directly, but instead they access it through libraries such as QNN. In this case, the library takes care of HMX resource management.

HMX cannot be used without VTCM, so all clients that require HMX must also allocate at least some VTCM by using the resource manager.

VTCM

Like HMX, VTCM is not an automatically shared resource. Instead, clients must use the HAP Compute Resource Manager API to reserve allocate VTCM before using it, and release it promptly when not in use.

Unlike HMX however, there is a pool of VTCM in the DSP, and multiple clients can allocate subsets of it simultaneously. Managing VTCM allocations across applications is important for the best performance.

Many applications perform better when more VTCM is available to them. This can be especially true for machine learning runtime libraries. For those applications, the best VTCM management strategy is to allocate all VTCM available, use it together with other system resources until the workload is complete or a higher priority client requires some of the resources, and release all resources when done. This can lead to best performance for the application, but it blocks other applications from using VTCM in parallel. It forces them to wait for the current VTCM user to complete before they can allocate any resources, which can lead to delays of several milliseconds.

On devices with critical use cases that require VTCM such as camera streaming, the system integrator can create dedicated VTCM partitions for dedicated use cases. This will reduce the amount of VTCM available for regular applications, but it will ensure that such critical applications always have VTCM available when required. For a discussion on VTCM partitions, see the System integration page.

NOTE: Clients that require HMX must wait for it to become available, even if they have a dedicated VTCM partition available.