aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/core-api
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/core-api')
-rw-r--r--Documentation/core-api/assoc_array.rst6
-rw-r--r--Documentation/core-api/boot-time-mm.rst71
-rw-r--r--Documentation/core-api/gfp_mask-from-fs-io.rst2
-rw-r--r--Documentation/core-api/index.rst4
-rw-r--r--Documentation/core-api/kernel-api.rst6
-rw-r--r--Documentation/core-api/memory-allocation.rst124
-rw-r--r--Documentation/core-api/memory-hotplug.rst125
-rw-r--r--Documentation/core-api/mm-api.rst34
-rw-r--r--Documentation/core-api/printk-formats.rst29
-rw-r--r--Documentation/core-api/xarray.rst471
10 files changed, 787 insertions, 85 deletions
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst
index 8231b915c939..792bbf9939e1 100644
--- a/Documentation/core-api/assoc_array.rst
+++ b/Documentation/core-api/assoc_array.rst
@@ -34,7 +34,7 @@ properties:
8. The array can iterated over. The objects will not necessarily come out in
key order.
-9. The array can be iterated over whilst it is being modified, provided the
+9. The array can be iterated over while it is being modified, provided the
RCU readlock is being held by the iterator. Note, however, under these
circumstances, some objects may be seen more than once. If this is a
problem, the iterator should lock against modification. Objects will not
@@ -42,7 +42,7 @@ properties:
10. Objects in the array can be looked up by means of their index key.
-11. Objects can be looked up whilst the array is being modified, provided the
+11. Objects can be looked up while the array is being modified, provided the
RCU readlock is being held by the thread doing the look up.
The implementation uses a tree of 16-pointer nodes internally that are indexed
@@ -273,7 +273,7 @@ The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't
enough memory.
It is possible for other threads to iterate over or search the array under
-the RCU read lock whilst this function is in progress. The caller should
+the RCU read lock while this function is in progress. The caller should
lock exclusively against other modifiers of the array.
diff --git a/Documentation/core-api/boot-time-mm.rst b/Documentation/core-api/boot-time-mm.rst
index 03cb1643f46f..e5ec9f1a563d 100644
--- a/Documentation/core-api/boot-time-mm.rst
+++ b/Documentation/core-api/boot-time-mm.rst
@@ -5,54 +5,23 @@ Boot time memory management
Early system initialization cannot use "normal" memory management
simply because it is not set up yet. But there is still need to
allocate memory for various data structures, for instance for the
-physical page allocator. To address this, a specialized allocator
-called the :ref:`Boot Memory Allocator <bootmem>`, or bootmem, was
-introduced. Several years later PowerPC developers added a "Logical
-Memory Blocks" allocator, which was later adopted by other
-architectures and renamed to :ref:`memblock <memblock>`. There is also
-a compatibility layer called `nobootmem` that translates bootmem
-allocation interfaces to memblock calls.
+physical page allocator.
-The selection of the early allocator is done using
-``CONFIG_NO_BOOTMEM`` and ``CONFIG_HAVE_MEMBLOCK`` kernel
-configuration options. These options are enabled or disabled
-statically by the architectures' Kconfig files.
-
-* Architectures that rely only on bootmem select
- ``CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=n``.
-* The users of memblock with the nobootmem compatibility layer set
- ``CONFIG_NO_BOOTMEM=y && CONFIG_HAVE_MEMBLOCK=y``.
-* And for those that use both memblock and bootmem the configuration
- includes ``CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=y``.
-
-Whichever allocator is used, it is the responsibility of the
-architecture specific initialization to set it up in
-:c:func:`setup_arch` and tear it down in :c:func:`mem_init` functions.
+A specialized allocator called ``memblock`` performs the
+boot time memory management. The architecture specific initialization
+must set it up in :c:func:`setup_arch` and tear it down in
+:c:func:`mem_init` functions.
Once the early memory management is available it offers a variety of
functions and macros for memory allocations. The allocation request
may be directed to the first (and probably the only) node or to a
particular node in a NUMA system. There are API variants that panic
-when an allocation fails and those that don't. And more recent and
-advanced memblock even allows controlling its own behaviour.
-
-.. _bootmem:
-
-Bootmem
-=======
-
-(mostly stolen from Mel Gorman's "Understanding the Linux Virtual
-Memory Manager" `book`_)
-
-.. _book: https://www.kernel.org/doc/gorman/
-
-.. kernel-doc:: mm/bootmem.c
- :doc: bootmem overview
+when an allocation fails and those that don't.
-.. _memblock:
+Memblock also offers a variety of APIs that control its own behaviour.
-Memblock
-========
+Memblock Overview
+=================
.. kernel-doc:: mm/memblock.c
:doc: memblock overview
@@ -61,26 +30,6 @@ Memblock
Functions and structures
========================
-Common API
-----------
-
-The functions that are described in this section are available
-regardless of what early memory manager is enabled.
-
-.. kernel-doc:: mm/nobootmem.c
-
-Bootmem specific API
---------------------
-
-These interfaces available only with bootmem, i.e when ``CONFIG_NO_BOOTMEM=n``
-
-.. kernel-doc:: include/linux/bootmem.h
-.. kernel-doc:: mm/bootmem.c
- :nodocs:
-
-Memblock specific API
----------------------
-
Here is the description of memblock data structures, functions and
macros. Some of them are actually internal, but since they are
documented it would be silly to omit them. Besides, reading the
@@ -89,4 +38,4 @@ really happens under the hood.
.. kernel-doc:: include/linux/memblock.h
.. kernel-doc:: mm/memblock.c
- :nodocs:
+ :functions:
diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
index e0df8f416582..e7c32a8de126 100644
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -1,3 +1,5 @@
+.. _gfp_mask_from_fs_io:
+
=================================
GFP masks used from FS/IO context
=================================
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 26b735cefb93..3adee82be311 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -21,16 +21,20 @@ Core utilities
local_ops
workqueue
genericirq
+ xarray
flexible-arrays
librs
genalloc
errseq
printk-formats
circular-buffers
+ memory-allocation
mm-api
gfp_mask-from-fs-io
timekeeping
boot-time-mm
+ memory-hotplug
+
Interfaces for kernel debugging
===============================
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 3431337ee4e6..cdd24943fbcc 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -291,12 +291,6 @@ Block Devices
.. kernel-doc:: block/blk-lib.c
:export:
-.. kernel-doc:: block/blk-tag.c
- :export:
-
-.. kernel-doc:: block/blk-tag.c
- :internal:
-
.. kernel-doc:: block/blk-integrity.c
:export:
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
new file mode 100644
index 000000000000..8954a88ff5b7
--- /dev/null
+++ b/Documentation/core-api/memory-allocation.rst
@@ -0,0 +1,124 @@
+.. _memory-allocation:
+
+=======================
+Memory Allocation Guide
+=======================
+
+Linux provides a variety of APIs for memory allocation. You can
+allocate small chunks using `kmalloc` or `kmem_cache_alloc` families,
+large virtually contiguous areas using `vmalloc` and its derivatives,
+or you can directly request pages from the page allocator with
+`alloc_pages`. It is also possible to use more specialized allocators,
+for instance `cma_alloc` or `zs_malloc`.
+
+Most of the memory allocation APIs use GFP flags to express how that
+memory should be allocated. The GFP acronym stands for "get free
+pages", the underlying memory allocation function.
+
+Diversity of the allocation APIs combined with the numerous GFP flags
+makes the question "How should I allocate memory?" not that easy to
+answer, although very likely you should use
+
+::
+
+ kzalloc(<size>, GFP_KERNEL);
+
+Of course there are cases when other allocation APIs and different GFP
+flags must be used.
+
+Get Free Page flags
+===================
+
+The GFP flags control the allocators behavior. They tell what memory
+zones can be used, how hard the allocator should try to find free
+memory, whether the memory can be accessed by the userspace etc. The
+:ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>` provides
+reference documentation for the GFP flags and their combinations and
+here we briefly outline their recommended usage:
+
+ * Most of the time ``GFP_KERNEL`` is what you need. Memory for the
+ kernel data structures, DMAable memory, inode cache, all these and
+ many other allocations types can use ``GFP_KERNEL``. Note, that
+ using ``GFP_KERNEL`` implies ``GFP_RECLAIM``, which means that
+ direct reclaim may be triggered under memory pressure; the calling
+ context must be allowed to sleep.
+ * If the allocation is performed from an atomic context, e.g interrupt
+ handler, use ``GFP_NOWAIT``. This flag prevents direct reclaim and
+ IO or filesystem operations. Consequently, under memory pressure
+ ``GFP_NOWAIT`` allocation is likely to fail. Allocations which
+ have a reasonable fallback should be using ``GFP_NOWARN``.
+ * If you think that accessing memory reserves is justified and the kernel
+ will be stressed unless allocation succeeds, you may use ``GFP_ATOMIC``.
+ * Untrusted allocations triggered from userspace should be a subject
+ of kmem accounting and must have ``__GFP_ACCOUNT`` bit set. There
+ is the handy ``GFP_KERNEL_ACCOUNT`` shortcut for ``GFP_KERNEL``
+ allocations that should be accounted.
+ * Userspace allocations should use either of the ``GFP_USER``,
+ ``GFP_HIGHUSER`` or ``GFP_HIGHUSER_MOVABLE`` flags. The longer
+ the flag name the less restrictive it is.
+
+ ``GFP_HIGHUSER_MOVABLE`` does not require that allocated memory
+ will be directly accessible by the kernel and implies that the
+ data is movable.
+
+ ``GFP_HIGHUSER`` means that the allocated memory is not movable,
+ but it is not required to be directly accessible by the kernel. An
+ example may be a hardware allocation that maps data directly into
+ userspace but has no addressing limitations.
+
+ ``GFP_USER`` means that the allocated memory is not movable and it
+ must be directly accessible by the kernel.
+
+You may notice that quite a few allocations in the existing code
+specify ``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to
+prevent recursion deadlocks caused by direct memory reclaim calling
+back into the FS or IO paths and blocking on already held
+resources. Since 4.12 the preferred way to address this issue is to
+use new scope APIs described in
+:ref:`Documentation/core-api/gfp_mask-from-fs-io.rst <gfp_mask_from_fs_io>`.
+
+Other legacy GFP flags are ``GFP_DMA`` and ``GFP_DMA32``. They are
+used to ensure that the allocated memory is accessible by hardware
+with limited addressing capabilities. So unless you are writing a
+driver for a device with such restrictions, avoid using these flags.
+And even with hardware with restrictions it is preferable to use
+`dma_alloc*` APIs.
+
+Selecting memory allocator
+==========================
+
+The most straightforward way to allocate memory is to use a function
+from the :c:func:`kmalloc` family. And, to be on the safe size it's
+best to use routines that set memory to zero, like
+:c:func:`kzalloc`. If you need to allocate memory for an array, there
+are :c:func:`kmalloc_array` and :c:func:`kcalloc` helpers.
+
+The maximal size of a chunk that can be allocated with `kmalloc` is
+limited. The actual limit depends on the hardware and the kernel
+configuration, but it is a good practice to use `kmalloc` for objects
+smaller than page size.
+
+For large allocations you can use :c:func:`vmalloc` and
+:c:func:`vzalloc`, or directly request pages from the page
+allocator. The memory allocated by `vmalloc` and related functions is
+not physically contiguous.
+
+If you are not sure whether the allocation size is too large for
+`kmalloc`, it is possible to use :c:func:`kvmalloc` and its
+derivatives. It will try to allocate memory with `kmalloc` and if the
+allocation fails it will be retried with `vmalloc`. There are
+restrictions on which GFP flags can be used with `kvmalloc`; please
+see :c:func:`kvmalloc_node` reference documentation. Note that
+`kvmalloc` may return memory that is not physically contiguous.
+
+If you need to allocate many identical objects you can use the slab
+cache allocator. The cache should be set up with
+:c:func:`kmem_cache_create` before it can be used. Afterwards
+:c:func:`kmem_cache_alloc` and its convenience wrappers can allocate
+memory from that cache.
+
+When the allocated memory is no longer needed it must be freed. You
+can use :c:func:`kvfree` for the memory allocated with `kmalloc`,
+`vmalloc` and `kvmalloc`. The slab caches should be freed with
+:c:func:`kmem_cache_free`. And don't forget to destroy the cache with
+:c:func:`kmem_cache_destroy`.
diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst
new file mode 100644
index 000000000000..de7467e48067
--- /dev/null
+++ b/Documentation/core-api/memory-hotplug.rst
@@ -0,0 +1,125 @@
+.. _memory_hotplug:
+
+==============
+Memory hotplug
+==============
+
+Memory hotplug event notifier
+=============================
+
+Hotplugging events are sent to a notification queue.
+
+There are six types of notification defined in ``include/linux/memory.h``:
+
+MEM_GOING_ONLINE
+ Generated before new memory becomes available in order to be able to
+ prepare subsystems to handle memory. The page allocator is still unable
+ to allocate from the new memory.
+
+MEM_CANCEL_ONLINE
+ Generated if MEM_GOING_ONLINE fails.
+
+MEM_ONLINE
+ Generated when memory has successfully brought online. The callback may
+ allocate pages from the new memory.
+
+MEM_GOING_OFFLINE
+ Generated to begin the process of offlining memory. Allocations are no
+ longer possible from the memory but some of the memory to be offlined
+ is still in use. The callback can be used to free memory known to a
+ subsystem from the indicated memory block.
+
+MEM_CANCEL_OFFLINE
+ Generated if MEM_GOING_OFFLINE fails. Memory is available again from
+ the memory block that we attempted to offline.
+
+MEM_OFFLINE
+ Generated after offlining memory is complete.
+
+A callback routine can be registered by calling::
+
+ hotplug_memory_notifier(callback_func, priority)
+
+Callback functions with higher values of priority are called before callback
+functions with lower values.
+
+A callback function must have the following prototype::
+
+ int callback_func(
+ struct notifier_block *self, unsigned long action, void *arg);
+
+The first argument of the callback function (self) is a pointer to the block
+of the notifier chain that points to the callback function itself.
+The second argument (action) is one of the event types described above.
+The third argument (arg) passes a pointer of struct memory_notify::
+
+ struct memory_notify {
+ unsigned long start_pfn;
+ unsigned long nr_pages;
+ int status_change_nid_normal;
+ int status_change_nid_high;
+ int status_change_nid;
+ }
+
+- start_pfn is start_pfn of online/offline memory.
+- nr_pages is # of pages of online/offline memory.
+- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
+ is (will be) set/clear, if this is -1, then nodemask status is not changed.
+- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
+ is (will be) set/clear, if this is -1, then nodemask status is not changed.
+- status_change_nid is set node id when N_MEMORY of nodemask is (will be)
+ set/clear. It means a new(memoryless) node gets new memory by online and a
+ node loses all memory. If this is -1, then nodemask status is not changed.
+
+ If status_changed_nid* >= 0, callback should create/discard structures for the
+ node if necessary.
+
+The callback routine shall return one of the values
+NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP
+defined in ``include/linux/notifier.h``
+
+NOTIFY_DONE and NOTIFY_OK have no effect on the further processing.
+
+NOTIFY_BAD is used as response to the MEM_GOING_ONLINE, MEM_GOING_OFFLINE,
+MEM_ONLINE, or MEM_OFFLINE action to cancel hotplugging. It stops
+further processing of the notification queue.
+
+NOTIFY_STOP stops further processing of the notification queue.
+
+Locking Internals
+=================
+
+When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
+the device_hotplug_lock should be held to:
+
+- synchronize against online/offline requests (e.g. via sysfs). This way, memory
+ block devices can only be accessed (.online/.state attributes) by user
+ space once memory has been fully added. And when removing memory, we
+ know nobody is in critical sections.
+- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+
+Especially, there is a possible lock inversion that is avoided using
+device_hotplug_lock when adding memory and user space tries to online that
+memory faster than expected:
+
+- device_online() will first take the device_lock(), followed by
+ mem_hotplug_lock
+- add_memory_resource() will first take the mem_hotplug_lock, followed by
+ the device_lock() (while creating the devices, during bus_add_device()).
+
+As the device is visible to user space before taking the device_lock(), this
+can result in a lock inversion.
+
+onlining/offlining of memory should be done via device_online()/
+device_offline() - to make sure it is properly synchronized to actions
+via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+
+When adding/removing/onlining/offlining memory or adding/removing
+heterogeneous/device memory, we should always hold the mem_hotplug_lock in
+write mode to serialise memory hotplug (e.g. access to global/zone
+variables).
+
+In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
+mode allows for a quite efficient get_online_mems/put_online_mems
+implementation, so code accessing memory can protect from that memory
+vanishing.
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 46ae3537fb12..aa8e54b85221 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -14,6 +14,8 @@ User Space Memory Access
.. kernel-doc:: mm/util.c
:functions: get_user_pages_fast
+.. _mm-api-gfp-flags:
+
Memory Allocation Controls
==========================
@@ -44,11 +46,20 @@ The Slab Cache
.. kernel-doc:: mm/slab.c
:export:
+.. kernel-doc:: mm/slab_common.c
+ :export:
+
.. kernel-doc:: mm/util.c
:functions: kfree_const kvmalloc_node kvfree
-More Memory Management Functions
-================================
+Virtually Contiguous Mappings
+=============================
+
+.. kernel-doc:: mm/vmalloc.c
+ :export:
+
+File Mapping and Page Cache
+===========================
.. kernel-doc:: mm/readahead.c
:export:
@@ -56,23 +67,28 @@ More Memory Management Functions
.. kernel-doc:: mm/filemap.c
:export:
-.. kernel-doc:: mm/memory.c
+.. kernel-doc:: mm/page-writeback.c
:export:
-.. kernel-doc:: mm/vmalloc.c
+.. kernel-doc:: mm/truncate.c
:export:
-.. kernel-doc:: mm/page_alloc.c
- :internal:
+Memory pools
+============
.. kernel-doc:: mm/mempool.c
:export:
+DMA pools
+=========
+
.. kernel-doc:: mm/dmapool.c
:export:
-.. kernel-doc:: mm/page-writeback.c
- :export:
+More Memory Management Functions
+================================
-.. kernel-doc:: mm/truncate.c
+.. kernel-doc:: mm/memory.c
:export:
+
+.. kernel-doc:: mm/page_alloc.c
diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 25dc591cb110..a7fae4538946 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -376,15 +376,15 @@ correctness of the format string and va_list arguments.
Passed by reference.
-kobjects
---------
+Device tree nodes
+-----------------
::
%pOF[fnpPcCF]
-For printing kobject based structs (device nodes). Default behaviour is
+For printing device tree node structures. Default behaviour is
equivalent to %pOFf.
- f - device node full_name
@@ -412,6 +412,24 @@ Examples::
Passed by reference.
+Time and date (struct rtc_time)
+-------------------------------
+
+::
+
+ %ptR YYYY-mm-ddTHH:MM:SS
+ %ptRd YYYY-mm-dd
+ %ptRt HH:MM:SS
+ %ptR[dt][r]
+
+For printing date and time as represented by struct rtc_time structure in
+human readable format.
+
+By default year will be incremented by 1900 and month by 1. Use %ptRr (raw)
+to suppress this behaviour.
+
+Passed by reference.
+
struct clk
----------
@@ -420,9 +438,8 @@ struct clk
%pC pll1
%pCn pll1
-For printing struct clk structures. %pC and %pCn print the name
-(Common Clock Framework) or address (legacy clock framework) of the
-structure.
+For printing struct clk structures. %pC and %pCn print the name of the clock
+(Common Clock Framework) or a unique 32-bit ID (legacy clock framework).
Passed by reference.
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
new file mode 100644
index 000000000000..5d54b27c6eba
--- /dev/null
+++ b/Documentation/core-api/xarray.rst
@@ -0,0 +1,471 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+======
+XArray
+======
+
+:Author: Matthew Wilcox
+
+Overview
+========
+
+The XArray is an abstract data type which behaves like a very large array
+of pointers. It meets many of the same needs as a hash or a conventional
+resizable array. Unlike a hash, it allows you to sensibly go to the
+next or previous entry in a cache-efficient manner. In contrast to a
+resizable array, there is no need to copy data or change MMU mappings in
+order to grow the array. It is more memory-efficient, parallelisable
+and cache friendly than a doubly-linked list. It takes advantage of
+RCU to perform lookups without locking.
+
+The XArray implementation is efficient when the indices used are densely
+clustered; hashing the object and using the hash as the index will not
+perform well. The XArray is optimised for small indices, but still has
+good performance with large indices. If your index can be larger than
+``ULONG_MAX`` then the XArray is not the data type for you. The most
+important user of the XArray is the page cache.
+
+Each non-``NULL`` entry in the array has three bits associated with
+it called marks. Each mark may be set or cleared independently of
+the others. You can iterate over entries which are marked.
+
+Normal pointers may be stored in the XArray directly. They must be 4-byte
+aligned, which is true for any pointer returned from :c:func:`kmalloc` and
+:c:func:`alloc_page`. It isn't true for arbitrary user-space pointers,
+nor for function pointers. You can store pointers to statically allocated
+objects, as long as those objects have an alignment of at least 4.
+
+You can also store integers between 0 and ``LONG_MAX`` in the XArray.
+You must first convert it into an entry using :c:func:`xa_mk_value`.
+When you retrieve an entry from the XArray, you can check whether it is
+a value entry by calling :c:func:`xa_is_value`, and convert it back to
+an integer by calling :c:func:`xa_to_value`.
+
+Some users want to store tagged pointers instead of using the marks
+described above. They can call :c:func:`xa_tag_pointer` to create an
+entry with a tag, :c:func:`xa_untag_pointer` to turn a tagged entry
+back into an untagged pointer and :c:func:`xa_pointer_tag` to retrieve
+the tag of an entry. Tagged pointers use the same bits that are used
+to distinguish value entries from normal pointers, so each user must
+decide whether they want to store value entries or tagged pointers in
+any particular XArray.
+
+The XArray does not support storing :c:func:`IS_ERR` pointers as some
+conflict with value entries or internal entries.
+
+An unusual feature of the XArray is the ability to create entries which
+occupy a range of indices. Once stored to, looking up any index in
+the range will return the same entry as looking up any other index in
+the range. Setting a mark on one index will set it on all of them.
+Storing to any index will store to all of them. Multi-index entries can
+be explicitly split into smaller entries, or storing ``NULL`` into any
+entry will cause the XArray to forget about the range.
+
+Normal API
+==========
+
+Start by initialising an XArray, either with :c:func:`DEFINE_XARRAY`
+for statically allocated XArrays or :c:func:`xa_init` for dynamically
+allocated ones. A freshly-initialised XArray contains a ``NULL``
+pointer at every index.
+
+You can then set entries using :c:func:`xa_store` and get entries
+using :c:func:`xa_load`. xa_store will overwrite any entry with the
+new entry and return the previous entry stored at that index. You can
+use :c:func:`xa_erase` instead of calling :c:func:`xa_store` with a
+``NULL`` entry. There is no difference between an entry that has never
+been stored to, one that has been erased and one that has most recently
+had ``NULL`` stored to it.
+
+You can conditionally replace an entry at an index by using
+:c:func:`xa_cmpxchg`. Like :c:func:`cmpxchg`, it will only succeed if
+the entry at that index has the 'old' value. It also returns the entry
+which was at that index; if it returns the same entry which was passed as
+'old', then :c:func:`xa_cmpxchg` succeeded.
+
+If you want to only store a new entry to an index if the current entry
+at that index is ``NULL``, you can use :c:func:`xa_insert` which
+returns ``-EEXIST`` if the entry is not empty.
+
+You can enquire whether a mark is set on an entry by using
+:c:func:`xa_get_mark`. If the entry is not ``NULL``, you can set a mark
+on it by using :c:func:`xa_set_mark` and remove the mark from an entry by
+calling :c:func:`xa_clear_mark`. You can ask whether any entry in the
+XArray has a particular mark set by calling :c:func:`xa_marked`.
+
+You can copy entries out of the XArray into a plain array by calling
+:c:func:`xa_extract`. Or you can iterate over the present entries in
+the XArray by calling :c:func:`xa_for_each`. You may prefer to use
+:c:func:`xa_find` or :c:func:`xa_find_after` to move to the next present
+entry in the XArray.
+
+Calling :c:func:`xa_store_range` stores the same entry in a range
+of indices. If you do this, some of the other operations will behave
+in a slightly odd way. For example, marking the entry at one index
+may result in the entry being marked at some, but not all of the other
+indices. Storing into one index may result in the entry retrieved by
+some, but not all of the other indices changing.
+
+Sometimes you need to ensure that a subsequent call to :c:func:`xa_store`
+will not need to allocate memory. The :c:func:`xa_reserve` function
+will store a reserved entry at the indicated index. Users of the
+normal API will see this entry as containing ``NULL``. If you do
+not need to use the reserved entry, you can call :c:func:`xa_release`
+to remove the unused entry. If another user has stored to the entry
+in the meantime, :c:func:`xa_release` will do nothing; if instead you
+want the entry to become ``NULL``, you should use :c:func:`xa_erase`.
+Using :c:func:`xa_insert` on a reserved entry will fail.
+
+If all entries in the array are ``NULL``, the :c:func:`xa_empty` function
+will return ``true``.
+
+Finally, you can remove all entries from an XArray by calling
+:c:func:`xa_destroy`. If the XArray entries are pointers, you may wish
+to free the entries first. You can do this by iterating over all present
+entries in the XArray using the :c:func:`xa_for_each` iterator.
+
+Allocating XArrays
+------------------
+
+If you use :c:func:`DEFINE_XARRAY_ALLOC` to define the XArray, or
+initialise it by passing ``XA_FLAGS_ALLOC`` to :c:func:`xa_init_flags`,
+the XArray changes to track whether entries are in use or not.
+
+You can call :c:func:`xa_alloc` to store the entry at any unused index
+in the XArray. If you need to modify the array from interrupt context,
+you can use :c:func:`xa_alloc_bh` or :c:func:`xa_alloc_irq` to disable
+interrupts while allocating the ID.
+
+Using :c:func:`xa_store`, :c:func:`xa_cmpxchg` or :c:func:`xa_insert`
+will mark the entry as being allocated. Unlike a normal XArray, storing
+``NULL`` will mark the entry as being in use, like :c:func:`xa_reserve`.
+To free an entry, use :c:func:`xa_erase` (or :c:func:`xa_release` if
+you only want to free the entry if it's ``NULL``).
+
+You cannot use ``XA_MARK_0`` with an allocating XArray as this mark
+is used to track whether an entry is free or not. The other marks are
+available for your use.
+
+Memory allocation
+-----------------
+
+The :c:func:`xa_store`, :c:func:`xa_cmpxchg`, :c:func:`xa_alloc`,
+:c:func:`xa_reserve` and :c:func:`xa_insert` functions take a gfp_t
+parameter in case the XArray needs to allocate memory to store this entry.
+If the entry is being deleted, no memory allocation needs to be performed,
+and the GFP flags specified will be ignored.
+
+It is possible for no memory to be allocatable, particularly if you pass
+a restrictive set of GFP flags. In that case, the functions return a
+special value which can be turned into an errno using :c:func:`xa_err`.
+If you don't need to know exactly which error occurred, using
+:c:func:`xa_is_err` is slightly more efficient.
+
+Locking
+-------
+
+When using the Normal API, you do not have to worry about locking.
+The XArray uses RCU and an internal spinlock to synchronise access:
+
+No lock needed:
+ * :c:func:`xa_empty`
+ * :c:func:`xa_marked`
+
+Takes RCU read lock:
+ * :c:func:`xa_load`
+ * :c:func:`xa_for_each`
+ * :c:func:`xa_find`
+ * :c:func:`xa_find_after`
+ * :c:func:`xa_extract`
+ * :c:func:`xa_get_mark`
+
+Takes xa_lock internally:
+ * :c:func:`xa_store`
+ * :c:func:`xa_store_bh`
+ * :c:func:`xa_store_irq`
+ * :c:func:`xa_insert`
+ * :c:func:`xa_insert_bh`
+ * :c:func:`xa_insert_irq`
+ * :c:func:`xa_erase`
+ * :c:func:`xa_erase_bh`
+ * :c:func:`xa_erase_irq`
+ * :c:func:`xa_cmpxchg`
+ * :c:func:`xa_cmpxchg_bh`
+ * :c:func:`xa_cmpxchg_irq`
+ * :c:func:`xa_store_range`
+ * :c:func:`xa_alloc`
+ * :c:func:`xa_alloc_bh`
+ * :c:func:`xa_alloc_irq`
+ * :c:func:`xa_reserve`
+ * :c:func:`xa_reserve_bh`
+ * :c:func:`xa_reserve_irq`
+ * :c:func:`xa_destroy`
+ * :c:func:`xa_set_mark`
+ * :c:func:`xa_clear_mark`
+
+Assumes xa_lock held on entry:
+ * :c:func:`__xa_store`
+ * :c:func:`__xa_insert`
+ * :c:func:`__xa_erase`
+ * :c:func:`__xa_cmpxchg`
+ * :c:func:`__xa_alloc`
+ * :c:func:`__xa_reserve`
+ * :c:func:`__xa_set_mark`
+ * :c:func:`__xa_clear_mark`
+
+If you want to take advantage of the lock to protect the data structures
+that you are storing in the XArray, you can call :c:func:`xa_lock`
+before calling :c:func:`xa_load`, then take a reference count on the
+object you have found before calling :c:func:`xa_unlock`. This will
+prevent stores from removing the object from the array between looking
+up the object and incrementing the refcount. You can also use RCU to
+avoid dereferencing freed memory, but an explanation of that is beyond
+the scope of this document.
+
+The XArray does not disable interrupts or softirqs while modifying
+the array. It is safe to read the XArray from interrupt or softirq
+context as the RCU lock provides enough protection.
+
+If, for example, you want to store entries in the XArray in process
+context and then erase them in softirq context, you can do that this way::
+
+ void foo_init(struct foo *foo)
+ {
+ xa_init_flags(&foo->array, XA_FLAGS_LOCK_BH);
+ }
+
+ int foo_store(struct foo *foo, unsigned long index, void *entry)
+ {
+ int err;
+
+ xa_lock_bh(&foo->array);
+ err = xa_err(__xa_store(&foo->array, index, entry, GFP_KERNEL));
+ if (!err)
+ foo->count++;
+ xa_unlock_bh(&foo->array);
+ return err;
+ }
+
+ /* foo_erase() is only called from softirq context */
+ void foo_erase(struct foo *foo, unsigned long index)
+ {
+ xa_lock(&foo->array);
+ __xa_erase(&foo->array, index);
+ foo->count--;
+ xa_unlock(&foo->array);
+ }
+
+If you are going to modify the XArray from interrupt or softirq context,
+you need to initialise the array using :c:func:`xa_init_flags`, passing
+``XA_FLAGS_LOCK_IRQ`` or ``XA_FLAGS_LOCK_BH``.
+
+The above example also shows a common pattern of wanting to extend the
+coverage of the xa_lock on the store side to protect some statistics
+associated with the array.
+
+Sharing the XArray with interrupt context is also possible, either
+using :c:func:`xa_lock_irqsave` in both the interrupt handler and process
+context, or :c:func:`xa_lock_irq` in process context and :c:func:`xa_lock`
+in the interrupt handler. Some of the more common patterns have helper
+functions such as :c:func:`xa_store_bh`, :c:func:`xa_store_irq`,
+:c:func:`xa_erase_bh`, :c:func:`xa_erase_irq`, :c:func:`xa_cmpxchg_bh`
+and :c:func:`xa_cmpxchg_irq`.
+
+Sometimes you need to protect access to the XArray with a mutex because
+that lock sits above another mutex in the locking hierarchy. That does
+not entitle you to use functions like :c:func:`__xa_erase` without taking
+the xa_lock; the xa_lock is used for lockdep validation and will be used
+for other purposes in the future.
+
+The :c:func:`__xa_set_mark` and :c:func:`__xa_clear_mark` functions are also
+available for situations where you look up an entry and want to atomically
+set or clear a mark. It may be more efficient to use the advanced API
+in this case, as it will save you from walking the tree twice.
+
+Advanced API
+============
+
+The advanced API offers more flexibility and better performance at the
+cost of an interface which can be harder to use and has fewer safeguards.
+No locking is done for you by the advanced API, and you are required
+to use the xa_lock while modifying the array. You can choose whether
+to use the xa_lock or the RCU lock while doing read-only operations on
+the array. You can mix advanced and normal operations on the same array;
+indeed the normal API is implemented in terms of the advanced API. The
+advanced API is only available to modules with a GPL-compatible license.
+
+The advanced API is based around the xa_state. This is an opaque data
+structure which you declare on the stack using the :c:func:`XA_STATE`
+macro. This macro initialises the xa_state ready to start walking
+around the XArray. It is used as a cursor to maintain the position
+in the XArray and let you compose various operations together without
+having to restart from the top every time.
+
+The xa_state is also used to store errors. You can call
+:c:func:`xas_error` to retrieve the error. All operations check whether
+the xa_state is in an error state before proceeding, so there's no need
+for you to check for an error after each call; you can make multiple
+calls in succession and only check at a convenient point. The only
+errors currently generated by the XArray code itself are ``ENOMEM`` and
+``EINVAL``, but it supports arbitrary errors in case you want to call
+:c:func:`xas_set_err` yourself.
+
+If the xa_state is holding an ``ENOMEM`` error, calling :c:func:`xas_nomem`
+will attempt to allocate more memory using the specified gfp flags and
+cache it in the xa_state for the next attempt. The idea is that you take
+the xa_lock, attempt the operation and drop the lock. The operation
+attempts to allocate memory while holding the lock, but it is more
+likely to fail. Once you have dropped the lock, :c:func:`xas_nomem`
+can try harder to allocate more memory. It will return ``true`` if it
+is worth retrying the operation (i.e. that there was a memory error *and*
+more memory was allocated). If it has previously allocated memory, and
+that memory wasn't used, and there is no error (or some error that isn't
+``ENOMEM``), then it will free the memory previously allocated.
+
+Internal Entries
+----------------
+
+The XArray reserves some entries for its own purposes. These are never
+exposed through the normal API, but when using the advanced API, it's
+possible to see them. Usually the best way to handle them is to pass them
+to :c:func:`xas_retry`, and retry the operation if it returns ``true``.
+
+.. flat-table::
+ :widths: 1 1 6
+
+ * - Name
+ - Test
+ - Usage
+
+ * - Node
+ - :c:func:`xa_is_node`
+ - An XArray node. May be visible when using a multi-index xa_state.
+
+ * - Sibling
+ - :c:func:`xa_is_sibling`
+ - A non-canonical entry for a multi-index entry. The value indicates
+ which slot in this node has the canonical entry.
+
+ * - Retry
+ - :c:func:`xa_is_retry`
+ - This entry is currently being modified by a thread which has the
+ xa_lock. The node containing this entry may be freed at the end
+ of this RCU period. You should restart the lookup from the head
+ of the array.
+
+ * - Zero
+ - :c:func:`xa_is_zero`
+ - Zero entries appear as ``NULL`` through the Normal API, but occupy
+ an entry in the XArray which can be used to reserve the index for
+ future use. This is used by allocating XArrays for allocated entries
+ which are ``NULL``.
+
+Other internal entries may be added in the future. As far as possible, they
+will be handled by :c:func:`xas_retry`.
+
+Additional functionality
+------------------------
+
+The :c:func:`xas_create_range` function allocates all the necessary memory
+to store every entry in a range. It will set ENOMEM in the xa_state if
+it cannot allocate memory.
+
+You can use :c:func:`xas_init_marks` to reset the marks on an entry
+to their default state. This is usually all marks clear, unless the
+XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case mark 0 is set
+and all other marks are clear. Replacing one entry with another using
+:c:func:`xas_store` will not reset the marks on that entry; if you want
+the marks reset, you should do that explicitly.
+
+The :c:func:`xas_load` will walk the xa_state as close to the entry
+as it can. If you know the xa_state has already been walked to the
+entry and need to check that the entry hasn't changed, you can use
+:c:func:`xas_reload` to save a function call.
+
+If you need to move to a different index in the XArray, call
+:c:func:`xas_set`. This resets the cursor to the top of the tree, which
+will generally make the next operation walk the cursor to the desired
+spot in the tree. If you want to move to the next or previous index,
+call :c:func:`xas_next` or :c:func:`xas_prev`. Setting the index does
+not walk the cursor around the array so does not require a lock to be
+held, while moving to the next or previous index does.
+
+You can search for the next present entry using :c:func:`xas_find`. This
+is the equivalent of both :c:func:`xa_find` and :c:func:`xa_find_after`;
+if the cursor has been walked to an entry, then it will find the next
+entry after the one currently referenced. If not, it will return the
+entry at the index of the xa_state. Using :c:func:`xas_next_entry` to
+move to the next present entry instead of :c:func:`xas_find` will save
+a function call in the majority of cases at the expense of emitting more
+inline code.
+
+The :c:func:`xas_find_marked` function is similar. If the xa_state has
+not been walked, it will return the entry at the index of the xa_state,
+if it is marked. Otherwise, it will return the first marked entry after
+the entry referenced by the xa_state. The :c:func:`xas_next_marked`
+function is the equivalent of :c:func:`xas_next_entry`.
+
+When iterating over a range of the XArray using :c:func:`xas_for_each`
+or :c:func:`xas_for_each_marked`, it may be necessary to temporarily stop
+the iteration. The :c:func:`xas_pause` function exists for this purpose.
+After you have done the necessary work and wish to resume, the xa_state
+is in an appropriate state to continue the iteration after the entry
+you last processed. If you have interrupts disabled while iterating,
+then it is good manners to pause the iteration and reenable interrupts
+every ``XA_CHECK_SCHED`` entries.
+
+The :c:func:`xas_get_mark`, :c:func:`xas_set_mark` and
+:c:func:`xas_clear_mark` functions require the xa_state cursor to have
+been moved to the appropriate location in the xarray; they will do
+nothing if you have called :c:func:`xas_pause` or :c:func:`xas_set`
+immediately before.
+
+You can call :c:func:`xas_set_update` to have a callback function
+called each time the XArray updates a node. This is used by the page
+cache workingset code to maintain its list of nodes which contain only
+shadow entries.
+
+Multi-Index Entries
+-------------------
+
+The XArray has the ability to tie multiple indices together so that
+operations on one index affect all indices. For example, storing into
+any index will change the value of the entry retrieved from any index.
+Setting or clearing a mark on any index will set or clear the mark
+on every index that is tied together. The current implementation
+only allows tying ranges which are aligned powers of two together;
+eg indices 64-127 may be tied together, but 2-6 may not be. This may
+save substantial quantities of memory; for example tying 512 entries
+together will save over 4kB.
+
+You can create a multi-index entry by using :c:func:`XA_STATE_ORDER`
+or :c:func:`xas_set_order` followed by a call to :c:func:`xas_store`.
+Calling :c:func:`xas_load` with a multi-index xa_state will walk the
+xa_state to the right location in the tree, but the return value is not
+meaningful, potentially being an internal entry or ``NULL`` even when there
+is an entry stored within the range. Calling :c:func:`xas_find_conflict`
+will return the first entry within the range or ``NULL`` if there are no
+entries in the range. The :c:func:`xas_for_each_conflict` iterator will
+iterate over every entry which overlaps the specified range.
+
+If :c:func:`xas_load` encounters a multi-index entry, the xa_index
+in the xa_state will not be changed. When iterating over an XArray
+or calling :c:func:`xas_find`, if the initial index is in the middle
+of a multi-index entry, it will not be altered. Subsequent calls
+or iterations will move the index to the first index in the range.
+Each entry will only be returned once, no matter how many indices it
+occupies.
+
+Using :c:func:`xas_next` or :c:func:`xas_prev` with a multi-index xa_state
+is not supported. Using either of these functions on a multi-index entry
+will reveal sibling entries; these should be skipped over by the caller.
+
+Storing ``NULL`` into any index of a multi-index entry will set the entry
+at every index to ``NULL`` and dissolve the tie. Splitting a multi-index
+entry into entries occupying smaller ranges is not yet supported.
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/xarray.h
+.. kernel-doc:: lib/xarray.c