aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/accel/qaic/aic100.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/accel/qaic/aic100.rst')
-rw-r--r--Documentation/accel/qaic/aic100.rst515
1 files changed, 515 insertions, 0 deletions
diff --git a/Documentation/accel/qaic/aic100.rst b/Documentation/accel/qaic/aic100.rst
new file mode 100644
index 000000000000..590dae77ea12
--- /dev/null
+++ b/Documentation/accel/qaic/aic100.rst
@@ -0,0 +1,515 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+===============================
+ Qualcomm Cloud AI 100 (AIC100)
+===============================
+
+Overview
+========
+
+The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of
+Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for
+the purpose of efficiently running Artificial Intelligence (AI) Deep Learning
+inference workloads. They are AI accelerators.
+
+The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes
+(x8). An individual SoC on a card can have up to 16 NSPs for running workloads.
+Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR.
+
+Multiple AIC100 cards can be hosted in a single system to scale overall
+performance. AIC100 cards are multi-user capable and able to execute workloads
+from multiple users in a concurrent manner.
+
+Hardware Description
+====================
+
+An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc
+peripherals (PMICs, etc).
+
+An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card),
+or a Dual M.2 card. Both use PCIe to connect to the host system.
+
+As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/
+DeviceID(DID) combination to uniquely identify itself to the host. AIC100
+uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same
+AIC100 DID (0xa100).
+
+AIC100 does not implement FLR (function level reset).
+
+AIC100 implements MSI but does not implement MSI-X. AIC100 prefers 17 MSIs to
+operate (1 for MHI, 16 for the DMA Bridge). Falling back to 1 MSI is possible in
+scenarios where reserving 32 MSIs isn't feasible.
+
+As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device
+hardware. AIC100 provides 3, 64-bit BARs.
+
+* The first BAR is 4K in size, and exposes the MHI interface to the host.
+
+* The second BAR is 2M in size, and exposes the DMA Bridge interface to the
+ host.
+
+* The third BAR is variable in size based on an individual AIC100's
+ configuration, but defaults to 64K. This BAR currently has no purpose.
+
+From the host perspective, AIC100 has several key hardware components -
+
+* MHI (Modem Host Interface)
+* QSM (QAIC Service Manager)
+* NSPs (Neural Signal Processor)
+* DMA Bridge
+* DDR
+
+MHI
+---
+
+AIC100 has one MHI interface over PCIe. MHI itself is documented at
+Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate
+with the QSM. Except for workload data via the DMA Bridge, all interaction with
+the device occurs via MHI.
+
+QSM
+---
+
+QAIC Service Manager. This is an ARM A53 CPU that runs the primary
+firmware of the card and performs on-card management tasks. It also
+communicates with the host via MHI. Each AIC100 has one of
+these.
+
+NSP
+---
+
+Neural Signal Processor. Each AIC100 has up to 16 of these. These are
+the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon
+(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but
+multiple NSPs may be assigned to a single workload. Since each NSP can only run
+one workload, AIC100 is limited to 16 concurrent workloads. Workload
+"scheduling" is under the purview of the host. AIC100 does not automatically
+timeslice.
+
+DMA Bridge
+----------
+
+The DMA Bridge is custom DMA engine that manages the flow of data
+in and out of workloads. AIC100 has one of these. The DMA Bridge has 16
+channels, each consisting of a set of request/response FIFOs. Each active
+workload is assigned a single DMA Bridge channel. The DMA Bridge exposes
+hardware registers to manage the FIFOs (head/tail pointers), but requires host
+memory to store the FIFOs.
+
+DDR
+---
+
+AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR.
+This DDR is used to store workloads, data for the workloads, and is used by the
+QSM for managing the device. NSPs are granted access to sections of the DDR by
+the QSM. The host does not have direct access to the DDR, and must make
+requests to the QSM to transfer data to the DDR.
+
+High-level Use Flow
+===================
+
+AIC100 is a multi-user, programmable accelerator typically used for running
+neural networks in inferencing mode to efficiently perform AI operations.
+AIC100 is not intended for training neural networks. AIC100 can be utilized
+for generic compute workloads.
+
+Assuming a user wants to utilize AIC100, they would follow these steps:
+
+1. Compile the workload into an ELF targeting the NSP(s)
+2. Make requests to the QSM to load the workload and related artifacts into the
+ device DDR
+3. Make a request to the QSM to activate the workload onto a set of idle NSPs
+4. Make requests to the DMA Bridge to send input data to the workload to be
+ processed, and other requests to receive processed output data from the
+ workload.
+5. Once the workload is no longer required, make a request to the QSM to
+ deactivate the workload, thus putting the NSPs back into an idle state.
+6. Once the workload and related artifacts are no longer needed for future
+ sessions, make requests to the QSM to unload the data from DDR. This frees
+ the DDR to be used by other users.
+
+
+Boot Flow
+=========
+
+AIC100 uses a flashless boot flow, derived from Qualcomm MSMs.
+
+When AIC100 is first powered on, it begins executing PBL (Primary Bootloader)
+from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host
+Interface) component of MHI.
+
+Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader)
+image. The PBL pulls the image from the host, validates it, and begins
+execution of SBL.
+
+SBL initializes MHI, and uses MHI to notify the host that the device has entered
+the SBL stage. SBL performs a number of operations:
+
+* SBL initializes the majority of hardware (anything PBL left uninitialized),
+ including DDR.
+* SBL offloads the bootlog to the host.
+* SBL synchronizes timestamps with the host for future logging.
+* SBL uses the Sahara protocol to obtain the runtime firmware images from the
+ host.
+
+Once SBL has obtained and validated the runtime firmware, it brings the NSPs out
+of reset, and jumps into the QSM.
+
+The QSM uses MHI to notify the host that the device has entered the QSM stage
+(AMSS in MHI terms). At this point, the AIC100 device is fully functional, and
+ready to process workloads.
+
+Userspace components
+====================
+
+Compiler
+--------
+
+An open compiler for AIC100 based on upstream LLVM can be found at:
+https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc
+
+Usermode Driver (UMD)
+---------------------
+
+An open UMD that interfaces with the qaic kernel driver can be found at:
+https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100
+
+Sahara loader
+-------------
+
+An open implementation of the Sahara protocol called kickstart can be found at:
+https://github.com/andersson/qdl
+
+MHI Channels
+============
+
+AIC100 defines a number of MHI channels for different purposes. This is a list
+of the defined channels, and their uses.
+
++----------------+---------+----------+----------------------------------------+
+| Channel name | IDs | EEs | Purpose |
++================+=========+==========+========================================+
+| QAIC_LOOPBACK | 0 & 1 | AMSS | Any data sent to the device on this |
+| | | | channel is sent back to the host. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_SAHARA | 2 & 3 | SBL | Used by SBL to obtain the runtime |
+| | | | firmware from the host. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_DIAG | 4 & 5 | AMSS | Used to communicate with QSM via the |
+| | | | DIAG protocol. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_SSR | 6 & 7 | AMSS | Used to notify the host of subsystem |
+| | | | restart events, and to offload SSR |
+| | | | crashdumps. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_QDSS | 8 & 9 | AMSS | Used for the Qualcomm Debug Subsystem. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_CONTROL | 10 & 11 | AMSS | Used for the Neural Network Control |
+| | | | (NNC) protocol. This is the primary |
+| | | | channel between host and QSM for |
+| | | | managing workloads. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_LOGGING | 12 & 13 | SBL | Used by the SBL to send the bootlog to |
+| | | | the host. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_STATUS | 14 & 15 | AMSS | Used to notify the host of Reliability,|
+| | | | Accessibility, Serviceability (RAS) |
+| | | | events. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_TELEMETRY | 16 & 17 | AMSS | Used to get/set power/thermal/etc |
+| | | | attributes. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_DEBUG | 18 & 19 | AMSS | Not used. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_TIMESYNC | 20 & 21 | SBL | Used to synchronize timestamps in the |
+| | | | device side logs with the host time |
+| | | | source. |
++----------------+---------+----------+----------------------------------------+
+| QAIC_TIMESYNC | 22 & 23 | AMSS | Used to periodically synchronize |
+| _PERIODIC | | | timestamps in the device side logs with|
+| | | | the host time source. |
++----------------+---------+----------+----------------------------------------+
+
+DMA Bridge
+==========
+
+Overview
+--------
+
+The DMA Bridge is one of the main interfaces to the host from the device
+(the other being MHI). As part of activating a workload to run on NSPs, the QSM
+assigns that network a DMA Bridge channel. A workload's DMA Bridge channel
+(DBC for short) is solely for the use of that workload and is not shared with
+other workloads.
+
+Each DBC is a pair of FIFOs that manage data in and out of the workload. One
+FIFO is the request FIFO. The other FIFO is the response FIFO.
+
+Each DBC contains 4 registers in hardware:
+
+* Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the
+ latest item in the FIFO the device has consumed.
+* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host
+ increments this register to add new items to the FIFO.
+* Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates
+ the latest item in the FIFO the host has consumed.
+* Response FIFO tail pointer (offset 0xc). Read only by the host. Device
+ increments this register to add new items to the FIFO.
+
+The values in each register are indexes in the FIFO. To get the location of the
+FIFO element pointed to by the register: FIFO base address + register * element
+size.
+
+DBC registers are exposed to the host via the second BAR. Each DBC consumes
+4KB of space in the BAR.
+
+The actual FIFOs are backed by host memory. When sending a request to the QSM
+to activate a network, the host must donate memory to be used for the FIFOs.
+Due to internal mapping limitations of the device, a single contiguous chunk of
+memory must be provided per DBC, which hosts both FIFOs. The request FIFO will
+consume the beginning of the memory chunk, and the response FIFO will consume
+the end of the memory chunk.
+
+Request FIFO
+------------
+
+A request FIFO element has the following structure:
+
+.. code-block:: c
+
+ struct request_elem {
+ u16 req_id;
+ u8 seq_id;
+ u8 pcie_dma_cmd;
+ u32 reserved;
+ u64 pcie_dma_source_addr;
+ u64 pcie_dma_dest_addr;
+ u32 pcie_dma_len;
+ u32 reserved;
+ u64 doorbell_addr;
+ u8 doorbell_attr;
+ u8 reserved;
+ u16 reserved;
+ u32 doorbell_data;
+ u32 sem_cmd0;
+ u32 sem_cmd1;
+ u32 sem_cmd2;
+ u32 sem_cmd3;
+ };
+
+Request field descriptions:
+
+req_id
+ request ID. A request FIFO element and a response FIFO element with
+ the same request ID refer to the same command.
+
+seq_id
+ sequence ID within a request. Ignored by the DMA Bridge.
+
+pcie_dma_cmd
+ describes the DMA element of this request.
+
+ * Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic
+ and generates a MSI when this request is complete, and QSM
+ configures the DMA Bridge to look at this bit.
+ * Bits(6:5) are reserved.
+ * Bit(4) is the completion code flag, and indicates that the DMA Bridge
+ shall generate a response FIFO element when this request is
+ complete.
+ * Bit(3) indicates if this request is a linked list transfer(0) or a bulk
+ transfer(1).
+ * Bit(2) is reserved.
+ * Bits(1:0) indicate the type of transfer. No transfer(0), to device(1),
+ from device(2). Value 3 is illegal.
+
+pcie_dma_source_addr
+ source address for a bulk transfer, or the address of the linked list.
+
+pcie_dma_dest_addr
+ destination address for a bulk transfer.
+
+pcie_dma_len
+ length of the bulk transfer. Note that the size of this field
+ limits transfers to 4G in size.
+
+doorbell_addr
+ address of the doorbell to ring when this request is complete.
+
+doorbell_attr
+ doorbell attributes.
+
+ * Bit(7) indicates if a write to a doorbell is to occur.
+ * Bits(6:2) are reserved.
+ * Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit,
+ 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address
+ must be naturally aligned to the specified length.
+
+doorbell_data
+ data to write to the doorbell. Only the bits corresponding to
+ the doorbell length are valid.
+
+sem_cmdN
+ semaphore command.
+
+ * Bit(31) indicates this semaphore command is enabled.
+ * Bit(30) is the to-device DMA fence. Block this request until all
+ to-device DMA transfers are complete.
+ * Bit(29) is the from-device DMA fence. Block this request until all
+ from-device DMA transfers are complete.
+ * Bits(28:27) are reserved.
+ * Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the
+ specified value. 2 is increment. 3 is decrement. 4 is wait
+ until the semaphore is equal to the specified value. 5 is wait
+ until the semaphore is greater or equal to the specified value.
+ 6 is "P", wait until semaphore is greater than 0, then
+ decrement by 1. 7 is reserved.
+ * Bit(23) is reserved.
+ * Bit(22) is the semaphore sync. 0 is post sync, which means that the
+ semaphore operation is done after the DMA transfer. 1 is
+ presync, which gates the DMA transfer. Only one presync is
+ allowed per request.
+ * Bit(21) is reserved.
+ * Bits(20:16) is the index of the semaphore to operate on.
+ * Bits(15:12) are reserved.
+ * Bits(11:0) are the semaphore value to use in operations.
+
+Overall, a request is processed in 4 steps:
+
+1. If specified, the presync semaphore condition must be true
+2. If enabled, the DMA transfer occurs
+3. If specified, the postsync semaphore conditions must be true
+4. If enabled, the doorbell is written
+
+By using the semaphores in conjunction with the workload running on the NSPs,
+the data pipeline can be synchronized such that the host can queue multiple
+requests of data for the workload to process, but the DMA Bridge will only copy
+the data into the memory of the workload when the workload is ready to process
+the next input.
+
+Response FIFO
+-------------
+
+Once a request is fully processed, a response FIFO element is generated if
+specified in pcie_dma_cmd. The structure of a response FIFO element:
+
+.. code-block:: c
+
+ struct response_elem {
+ u16 req_id;
+ u16 completion_code;
+ };
+
+req_id
+ matches the req_id of the request that generated this element.
+
+completion_code
+ status of this request. 0 is success. Non-zero is an error.
+
+The DMA Bridge will generate a MSI to the host as a reaction to activity in the
+response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation
+algorithm, where it will only generate a MSI when the response FIFO transitions
+from empty to non-empty (unless force MSI is enabled and triggered). In
+response to this MSI, the host is expected to drain the response FIFO, and must
+take care to handle any race conditions between draining the FIFO, and the
+device inserting elements into the FIFO.
+
+Neural Network Control (NNC) Protocol
+=====================================
+
+The NNC protocol is how the host makes requests to the QSM to manage workloads.
+It uses the QAIC_CONTROL MHI channel.
+
+Each NNC request is packaged into a message. Each message is a series of
+transactions. A passthrough type transaction can contain elements known as
+commands.
+
+QSM requires NNC messages be little endian encoded and the fields be naturally
+aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment
+must be maintained.
+
+A message contains a header and then a series of transactions. A message may be
+at most 4K in size from QSM to the host. From the host to the QSM, a message
+can be at most 64K (maximum size of a single MHI packet), but there is a
+continuation feature where message N+1 can be marked as a continuation of
+message N. This is used for exceedingly large DMA xfer transactions.
+
+Transaction descriptions
+------------------------
+
+passthrough
+ Allows userspace to send an opaque payload directly to the QSM.
+ This is used for NNC commands. Userspace is responsible for managing
+ the QSM message requirements in the payload.
+
+dma_xfer
+ DMA transfer. Describes an object that the QSM should DMA into the
+ device via address and size tuples.
+
+activate
+ Activate a workload onto NSPs. The host must provide memory to be
+ used by the DBC.
+
+deactivate
+ Deactivate an active workload and return the NSPs to idle.
+
+status
+ Query the QSM about it's NNC implementation. Returns the NNC version,
+ and if CRC is used.
+
+terminate
+ Release a user's resources.
+
+dma_xfer_cont
+ Continuation of a previous DMA transfer. If a DMA transfer
+ cannot be specified in a single message (highly fragmented), this
+ transaction can be used to specify more ranges.
+
+validate_partition
+ Query to QSM to determine if a partition identifier is valid.
+
+Each message is tagged with a user id, and a partition id. The user id allows
+QSM to track resources, and release them when the user goes away (eg the process
+crashes). A partition id identifies the resource partition that QSM manages,
+which this message applies to.
+
+Messages may have CRCs. Messages should have CRCs applied until the QSM
+reports via the status transaction that CRCs are not needed. The QSM on the
+SA9000P requires CRCs for black channel safing.
+
+Subsystem Restart (SSR)
+=======================
+
+SSR is the concept of limiting the impact of an error. An AIC100 device may
+have multiple users, each with their own workload running. If the workload of
+one user crashes, the fallout of that should be limited to that workload and not
+impact other workloads. SSR accomplishes this.
+
+If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI
+channel. This notification identifies the workload by it's assigned DBC. A
+multi-stage recovery process is then used to cleanup both sides, and get the
+DBC/NSPs into a working state.
+
+When SSR occurs, any state in the workload is lost. Any inputs that were in
+process, or queued by not yet serviced, are lost. The loaded artifacts will
+remain in on-card DDR, but the host will need to re-activate the workload if
+it desires to recover the workload.
+
+Reliability, Accessibility, Serviceability (RAS)
+================================================
+
+AIC100 is expected to be deployed in server systems where RAS ideology is
+applied. Simply put, RAS is the concept of detecting, classifying, and
+reporting errors. While PCIe has AER (Advanced Error Reporting) which factors
+into RAS, AER does not allow for a device to report details about internal
+errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event
+occurs, QSM will report the event with appropriate details via the QAIC_STATUS
+MHI channel. A sysadmin may determine that a particular device needs
+additional service based on RAS reports.
+
+Telemetry
+=========
+
+QSM has the ability to report various physical attributes of the device, and in
+some cases, to allow the host to control them. Examples include thermal limits,
+thermal readings, and power readings. These items are communicated via the
+QAIC_TELEMETRY MHI channel.