aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/sparc/oradax/oracle-dax.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/sparc/oradax/oracle-dax.txt')
-rw-r--r--Documentation/sparc/oradax/oracle-dax.txt429
1 files changed, 0 insertions, 429 deletions
diff --git a/Documentation/sparc/oradax/oracle-dax.txt b/Documentation/sparc/oradax/oracle-dax.txt
deleted file mode 100644
index 9d53ac93286f..000000000000
--- a/Documentation/sparc/oradax/oracle-dax.txt
+++ /dev/null
@@ -1,429 +0,0 @@
-Oracle Data Analytics Accelerator (DAX)
----------------------------------------
-
-DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
-(DAX2) processor chips, and has direct access to the CPU's L3 caches
-as well as physical memory. It can perform several operations on data
-streams with various input and output formats. A driver provides a
-transport mechanism and has limited knowledge of the various opcodes
-and data formats. A user space library provides high level services
-and translates these into low level commands which are then passed
-into the driver and subsequently the Hypervisor and the coprocessor.
-The library is the recommended way for applications to use the
-coprocessor, and the driver interface is not intended for general use.
-This document describes the general flow of the driver, its
-structures, and its programmatic interface. It also provides example
-code sufficient to write user or kernel applications that use DAX
-functionality.
-
-The user library is open source and available at:
- https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
-
-The Hypervisor interface to the coprocessor is described in detail in
-the accompanying document, dax-hv-api.txt, which is a plain text
-excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
-Specification" version 3.0.20+15, dated 2017-09-25.
-
-
-High Level Overview
--------------------
-
-A coprocessor request is described by a Command Control Block
-(CCB). The CCB contains an opcode and various parameters. The opcode
-specifies what operation is to be done, and the parameters specify
-options, flags, sizes, and addresses. The CCB (or an array of CCBs)
-is passed to the Hypervisor, which handles queueing and scheduling of
-requests to the available coprocessor execution units. A status code
-returned indicates if the request was submitted successfully or if
-there was an error. One of the addresses given in each CCB is a
-pointer to a "completion area", which is a 128 byte memory block that
-is written by the coprocessor to provide execution status. No
-interrupt is generated upon completion; the completion area must be
-polled by software to find out when a transaction has finished, but
-the M7 and later processors provide a mechanism to pause the virtual
-processor until the completion status has been updated by the
-coprocessor. This is done using the monitored load and mwait
-instructions, which are described in more detail later. The DAX
-coprocessor was designed so that after a request is submitted, the
-kernel is no longer involved in the processing of it. The polling is
-done at the user level, which results in almost zero latency between
-completion of a request and resumption of execution of the requesting
-thread.
-
-
-Addressing Memory
------------------
-
-The kernel does not have access to physical memory in the Sun4v
-architecture, as there is an additional level of memory virtualization
-present. This intermediate level is called "real" memory, and the
-kernel treats this as if it were physical. The Hypervisor handles the
-translations between real memory and physical so that each logical
-domain (LDOM) can have a partition of physical memory that is isolated
-from that of other LDOMs. When the kernel sets up a virtual mapping,
-it specifies a virtual address and the real address to which it should
-be mapped.
-
-The DAX coprocessor can only operate on physical memory, so before a
-request can be fed to the coprocessor, all the addresses in a CCB must
-be converted into physical addresses. The kernel cannot do this since
-it has no visibility into physical addresses. So a CCB may contain
-either the virtual or real addresses of the buffers or a combination
-of them. An "address type" field is available for each address that
-may be given in the CCB. In all cases, the Hypervisor will translate
-all the addresses to physical before dispatching to hardware. Address
-translations are performed using the context of the process initiating
-the request.
-
-
-The Driver API
---------------
-
-An application makes requests to the driver via the write() system
-call, and gets results (if any) via read(). The completion areas are
-made accessible via mmap(), and are read-only for the application.
-
-The request may either be an immediate command or an array of CCBs to
-be submitted to the hardware.
-
-Each open instance of the device is exclusive to the thread that
-opened it, and must be used by that thread for all subsequent
-operations. The driver open function creates a new context for the
-thread and initializes it for use. This context contains pointers and
-values used internally by the driver to keep track of submitted
-requests. The completion area buffer is also allocated, and this is
-large enough to contain the completion areas for many concurrent
-requests. When the device is closed, any outstanding transactions are
-flushed and the context is cleaned up.
-
-On a DAX1 system (M7), the device will be called "oradax1", while on a
-DAX2 system (M8) it will be "oradax2". If an application requires one
-or the other, it should simply attempt to open the appropriate
-device. Only one of the devices will exist on any given system, so the
-name can be used to determine what the platform supports.
-
-The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
-all of these, success is indicated by a return value from write()
-equal to the number of bytes given in the call. Otherwise -1 is
-returned and errno is set.
-
-CCB_DEQUEUE
-
-Tells the driver to clean up resources associated with past
-requests. Since no interrupt is generated upon the completion of a
-request, the driver must be told when it may reclaim resources. No
-further status information is returned, so the user should not
-subsequently call read().
-
-CCB_KILL
-
-Kills a CCB during execution. The CCB is guaranteed to not continue
-executing once this call returns successfully. On success, read() must
-be called to retrieve the result of the action.
-
-CCB_INFO
-
-Retrieves information about a currently executing CCB. Note that some
-Hypervisors might return 'notfound' when the CCB is in 'inprogress'
-state. To ensure a CCB in the 'notfound' state will never be executed,
-CCB_KILL must be invoked on that CCB. Upon success, read() must be
-called to retrieve the details of the action.
-
-Submission of an array of CCBs for execution
-
-A write() whose length is a multiple of the CCB size is treated as a
-submit operation. The file offset is treated as the index of the
-completion area to use, and may be set via lseek() or using the
-pwrite() system call. If -1 is returned then errno is set to indicate
-the error. Otherwise, the return value is the length of the array that
-was actually accepted by the coprocessor. If the accepted length is
-equal to the requested length, then the submission was completely
-successful and there is no further status needed; hence, the user
-should not subsequently call read(). Partial acceptance of the CCB
-array is indicated by a return value less than the requested length,
-and read() must be called to retrieve further status information. The
-status will reflect the error caused by the first CCB that was not
-accepted, and status_data will provide additional data in some cases.
-
-MMAP
-
-The mmap() function provides access to the completion area allocated
-in the driver. Note that the completion area is not writeable by the
-user process, and the mmap call must not specify PROT_WRITE.
-
-
-Completion of a Request
------------------------
-
-The first byte in each completion area is the command status which is
-updated by the coprocessor hardware. Software may take advantage of
-new M7/M8 processor capabilities to efficiently poll this status byte.
-First, a "monitored load" is achieved via a Load from Alternate Space
-(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
-"monitored wait" is achieved via the mwait instruction (a write to
-%asr28). This instruction is like pause in that it suspends execution
-of the virtual processor for the given number of nanoseconds, but in
-addition will terminate early when one of several events occur. If the
-block of data containing the monitored location is modified, then the
-mwait terminates. This causes software to resume execution immediately
-(without a context switch or kernel to user transition) after a
-transaction completes. Thus the latency between transaction completion
-and resumption of execution may be just a few nanoseconds.
-
-
-Application Life Cycle of a DAX Submission
-------------------------------------------
-
- - open dax device
- - call mmap() to get the completion area address
- - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
- - submit CCB via write() or pwrite()
- - go into a loop executing monitored load + monitored wait and
- terminate when the command status indicates the request is complete
- (CCB_KILL or CCB_INFO may be used any time as necessary)
- - perform a CCB_DEQUEUE
- - call munmap() for completion area
- - close the dax device
-
-
-Memory Constraints
-------------------
-
-The DAX hardware operates only on physical addresses. Therefore, it is
-not aware of virtual memory mappings and the discontiguities that may
-exist in the physical memory that a virtual buffer maps to. There is
-no I/O TLB or any scatter/gather mechanism. All buffers, whether input
-or output, must reside in a physically contiguous region of memory.
-
-The Hypervisor translates all addresses within a CCB to physical
-before handing off the CCB to DAX. The Hypervisor determines the
-virtual page size for each virtual address given, and uses this to
-program a size limit for each address. This prevents the coprocessor
-from reading or writing beyond the bound of the virtual page, even
-though it is accessing physical memory directly. A simpler way of
-saying this is that a DAX operation will never "cross" a virtual page
-boundary. If an 8k virtual page is used, then the data is strictly
-limited to 8k. If a user's buffer is larger than 8k, then a larger
-page size must be used, or the transaction size will be truncated to
-8k.
-
-Huge pages. A user may allocate huge pages using standard interfaces.
-Memory buffers residing on huge pages may be used to achieve much
-larger DAX transaction sizes, but the rules must still be followed,
-and no transaction will cross a page boundary, even a huge page. A
-major caveat is that Linux on Sparc presents 8Mb as one of the huge
-page sizes. Sparc does not actually provide a 8Mb hardware page size,
-and this size is synthesized by pasting together two 4Mb pages. The
-reasons for this are historical, and it creates an issue because only
-half of this 8Mb page can actually be used for any given buffer in a
-DAX request, and it must be either the first half or the second half;
-it cannot be a 4Mb chunk in the middle, since that crosses a
-(hardware) page boundary. Note that this entire issue may be hidden by
-higher level libraries.
-
-
-CCB Structure
--------------
-A CCB is an array of 8 64-bit words. Several of these words provide
-command opcodes, parameters, flags, etc., and the rest are addresses
-for the completion area, output buffer, and various inputs:
-
- struct ccb {
- u64 control;
- u64 completion;
- u64 input0;
- u64 access;
- u64 input1;
- u64 op_data;
- u64 output;
- u64 table;
- };
-
-See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
-each of these fields, and see dax-hv-api.txt for a complete description
-of the Hypervisor API available to the guest OS (ie, Linux kernel).
-
-The first word (control) is examined by the driver for the following:
- - CCB version, which must be consistent with hardware version
- - Opcode, which must be one of the documented allowable commands
- - Address types, which must be set to "virtual" for all the addresses
- given by the user, thereby ensuring that the application can
- only access memory that it owns
-
-
-Example Code
-------------
-
-The DAX is accessible to both user and kernel code. The kernel code
-can make hypercalls directly while the user code must use wrappers
-provided by the driver. The setup of the CCB is nearly identical for
-both; the only difference is in preparation of the completion area. An
-example of user code is given now, with kernel code afterwards.
-
-In order to program using the driver API, the file
-arch/sparc/include/uapi/asm/oradax.h must be included.
-
-First, the proper device must be opened. For M7 it will be
-/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
-procedure is to attempt to open both, as only one will succeed:
-
- fd = open("/dev/oradax1", O_RDWR);
- if (fd < 0)
- fd = open("/dev/oradax2", O_RDWR);
- if (fd < 0)
- /* No DAX found */
-
-Next, the completion area must be mapped:
-
- completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
-
-All input and output buffers must be fully contained in one hardware
-page, since as explained above, the DAX is strictly constrained by
-virtual page boundaries. In addition, the output buffer must be
-64-byte aligned and its size must be a multiple of 64 bytes because
-the coprocessor writes in units of cache lines.
-
-This example demonstrates the DAX Scan command, which takes as input a
-vector and a match value, and produces a bitmap as the output. For
-each input element that matches the value, the corresponding bit is
-set in the output.
-
-In this example, the input vector consists of a series of single bits,
-and the match value is 0. So each 0 bit in the input will produce a 1
-in the output, and vice versa, which produces an output bitmap which
-is the input bitmap inverted.
-
-For details of all the parameters and bits used in this CCB, please
-refer to section 36.2.1.3 of the DAX Hypervisor API document, which
-describes the Scan command in detail.
-
- ccb->control = /* Table 36.1, CCB Header Format */
- (2L << 48) /* command = Scan Value */
- | (3L << 40) /* output address type = primary virtual */
- | (3L << 34) /* primary input address type = primary virtual */
- /* Section 36.2.1, Query CCB Command Formats */
- | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
- | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
- | (8 << 10) /* 36.2.1.1.6 output format = bit vector */
- | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
- | (31 << 0); /* 36.2.1.3 Disable second scan criteria */
-
- ccb->completion = 0; /* Completion area address, to be filled in by driver */
-
- ccb->input0 = (unsigned long) input; /* primary input address */
-
- ccb->access = /* Section 36.2.1.2, Data Access Control */
- (2 << 24) /* Primary input length format = bits */
- | (nbits - 1); /* number of bits in primary input stream, minus 1 */
-
- ccb->input1 = 0; /* secondary input address, unused */
-
- ccb->op_data = 0; /* scan criteria (value to be matched) */
-
- ccb->output = (unsigned long) output; /* output address */
-
- ccb->table = 0; /* table address, unused */
-
-The CCB submission is a write() or pwrite() system call to the
-driver. If the call fails, then a read() must be used to retrieve the
-status:
-
- if (pwrite(fd, ccb, 64, 0) != 64) {
- struct ccb_exec_result status;
- read(fd, &status, sizeof(status));
- /* bail out */
- }
-
-After a successful submission of the CCB, the completion area may be
-polled to determine when the DAX is finished. Detailed information on
-the contents of the completion area can be found in section 36.2.2 of
-the DAX HV API document.
-
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
-
- if (status) /* 0 indicates command in progress */
- break;
-
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
-
-A completion area status of 1 indicates successful completion of the
-CCB and validity of the output bitmap, which may be used immediately.
-All other non-zero values indicate error conditions which are
-described in section 36.2.2.
-
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
-
-After the completion area has been processed, the driver must be
-notified that it can release any resources associated with the
-request. This is done via the dequeue operation:
-
- struct dax_command cmd;
- cmd.command = CCB_DEQUEUE;
- if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
- /* bail out */
- }
-
-Finally, normal program cleanup should be done, i.e., unmapping
-completion area, closing the dax device, freeing memory etc.
-
-[Kernel example]
-
-The only difference in using the DAX in kernel code is the treatment
-of the completion area. Unlike user applications which mmap the
-completion area allocated by the driver, kernel code must allocate its
-own memory to use for the completion area, and this address and its
-type must be given in the CCB:
-
- ccb->control |= /* Table 36.1, CCB Header Format */
- (3L << 32); /* completion area address type = primary virtual */
-
- ccb->completion = (unsigned long) completion_area; /* Completion area address */
-
-The dax submit hypercall is made directly. The flags used in the
-ccb_submit call are documented in the DAX HV API in section 36.3.1.
-
-#include <asm/hypervisor.h>
-
- hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
- HV_CCB_QUERY_CMD |
- HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
- HV_CCB_VA_PRIVILEGED,
- 0, &bytes_accepted, &status_data);
-
- if (hv_rv != HV_EOK) {
- /* hv_rv is an error code, status_data contains */
- /* potential additional status, see 36.3.1.1 */
- }
-
-After the submission, the completion area polling code is identical to
-that in user land:
-
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
-
- if (status) /* 0 indicates command in progress */
- break;
-
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
-
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
-
-The output bitmap is ready for consumption immediately after the
-completion status indicates success.