aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/core-api
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/core-api')
-rw-r--r--Documentation/core-api/asm-annotations.rst222
-rw-r--r--Documentation/core-api/atomic_ops.rst664
-rw-r--r--Documentation/core-api/cachetlb.rst103
-rw-r--r--Documentation/core-api/cpu_hotplug.rst649
-rw-r--r--Documentation/core-api/debugging-via-ohci1394.rst185
-rw-r--r--Documentation/core-api/dma-api-howto.rst915
-rw-r--r--Documentation/core-api/dma-api.rst846
-rw-r--r--Documentation/core-api/dma-attributes.rst132
-rw-r--r--Documentation/core-api/dma-isa-lpc.rst152
-rw-r--r--Documentation/core-api/entry.rst279
-rw-r--r--Documentation/core-api/genericirq.rst4
-rw-r--r--Documentation/core-api/idr.rst35
-rw-r--r--Documentation/core-api/index.rst32
-rw-r--r--Documentation/core-api/irq/concepts.rst24
-rw-r--r--Documentation/core-api/irq/index.rst11
-rw-r--r--Documentation/core-api/irq/irq-affinity.rst70
-rw-r--r--Documentation/core-api/irq/irq-domain.rst297
-rw-r--r--Documentation/core-api/irq/irqflags-tracing.rst52
-rw-r--r--Documentation/core-api/kernel-api.rst82
-rw-r--r--Documentation/core-api/kobject.rst40
-rw-r--r--Documentation/core-api/kref.rst323
-rw-r--r--Documentation/core-api/local_ops.rst2
-rw-r--r--Documentation/core-api/maple_tree.rst221
-rw-r--r--Documentation/core-api/memory-allocation.rst65
-rw-r--r--Documentation/core-api/memory-hotplug.rst3
-rw-r--r--Documentation/core-api/mm-api.rst71
-rw-r--r--Documentation/core-api/netlink.rst102
-rw-r--r--Documentation/core-api/packing.rst2
-rw-r--r--Documentation/core-api/padata.rst61
-rw-r--r--Documentation/core-api/pin_user_pages.rst88
-rw-r--r--Documentation/core-api/printk-basics.rst112
-rw-r--r--Documentation/core-api/printk-formats.rst155
-rw-r--r--Documentation/core-api/printk-index.rst137
-rw-r--r--Documentation/core-api/protection-keys.rst43
-rw-r--r--Documentation/core-api/rbtree.rst429
-rw-r--r--Documentation/core-api/symbol-namespaces.rst30
-rw-r--r--Documentation/core-api/this_cpu_ops.rst337
-rw-r--r--Documentation/core-api/timekeeping.rst1
-rw-r--r--Documentation/core-api/unaligned-memory-access.rst265
-rw-r--r--Documentation/core-api/watch_queue.rst343
-rw-r--r--Documentation/core-api/workqueue.rst409
-rw-r--r--Documentation/core-api/wrappers/atomic_bitops.rst18
-rw-r--r--Documentation/core-api/wrappers/atomic_t.rst19
-rw-r--r--Documentation/core-api/wrappers/memory-barriers.rst18
-rw-r--r--Documentation/core-api/xarray.rst28
45 files changed, 6987 insertions, 1089 deletions
diff --git a/Documentation/core-api/asm-annotations.rst b/Documentation/core-api/asm-annotations.rst
new file mode 100644
index 000000000000..11c96d3f9ad6
--- /dev/null
+++ b/Documentation/core-api/asm-annotations.rst
@@ -0,0 +1,222 @@
+Assembler Annotations
+=====================
+
+Copyright (c) 2017-2019 Jiri Slaby
+
+This document describes the new macros for annotation of data and code in
+assembly. In particular, it contains information about ``SYM_FUNC_START``,
+``SYM_FUNC_END``, ``SYM_CODE_START``, and similar.
+
+Rationale
+---------
+Some code like entries, trampolines, or boot code needs to be written in
+assembly. The same as in C, such code is grouped into functions and
+accompanied with data. Standard assemblers do not force users into precisely
+marking these pieces as code, data, or even specifying their length.
+Nevertheless, assemblers provide developers with such annotations to aid
+debuggers throughout assembly. On top of that, developers also want to mark
+some functions as *global* in order to be visible outside of their translation
+units.
+
+Over time, the Linux kernel has adopted macros from various projects (like
+``binutils``) to facilitate such annotations. So for historic reasons,
+developers have been using ``ENTRY``, ``END``, ``ENDPROC``, and other
+annotations in assembly. Due to the lack of their documentation, the macros
+are used in rather wrong contexts at some locations. Clearly, ``ENTRY`` was
+intended to denote the beginning of global symbols (be it data or code).
+``END`` used to mark the end of data or end of special functions with
+*non-standard* calling convention. In contrast, ``ENDPROC`` should annotate
+only ends of *standard* functions.
+
+When these macros are used correctly, they help assemblers generate a nice
+object with both sizes and types set correctly. For example, the result of
+``arch/x86/lib/putuser.S``::
+
+ Num: Value Size Type Bind Vis Ndx Name
+ 25: 0000000000000000 33 FUNC GLOBAL DEFAULT 1 __put_user_1
+ 29: 0000000000000030 37 FUNC GLOBAL DEFAULT 1 __put_user_2
+ 32: 0000000000000060 36 FUNC GLOBAL DEFAULT 1 __put_user_4
+ 35: 0000000000000090 37 FUNC GLOBAL DEFAULT 1 __put_user_8
+
+This is not only important for debugging purposes. When there are properly
+annotated objects like this, tools can be run on them to generate more useful
+information. In particular, on properly annotated objects, ``objtool`` can be
+run to check and fix the object if needed. Currently, ``objtool`` can report
+missing frame pointer setup/destruction in functions. It can also
+automatically generate annotations for the ORC unwinder
+(Documentation/arch/x86/orc-unwinder.rst)
+for most code. Both of these are especially important to support reliable
+stack traces which are in turn necessary for kernel live patching
+(Documentation/livepatch/livepatch.rst).
+
+Caveat and Discussion
+---------------------
+As one might realize, there were only three macros previously. That is indeed
+insufficient to cover all the combinations of cases:
+
+* standard/non-standard function
+* code/data
+* global/local symbol
+
+There was a discussion_ and instead of extending the current ``ENTRY/END*``
+macros, it was decided that brand new macros should be introduced instead::
+
+ So how about using macro names that actually show the purpose, instead
+ of importing all the crappy, historic, essentially randomly chosen
+ debug symbol macro names from the binutils and older kernels?
+
+.. _discussion: https://lore.kernel.org/r/20170217104757.28588-1-jslaby@suse.cz
+
+Macros Description
+------------------
+
+The new macros are prefixed with the ``SYM_`` prefix and can be divided into
+three main groups:
+
+1. ``SYM_FUNC_*`` -- to annotate C-like functions. This means functions with
+ standard C calling conventions. For example, on x86, this means that the
+ stack contains a return address at the predefined place and a return from
+ the function can happen in a standard way. When frame pointers are enabled,
+ save/restore of frame pointer shall happen at the start/end of a function,
+ respectively, too.
+
+ Checking tools like ``objtool`` should ensure such marked functions conform
+ to these rules. The tools can also easily annotate these functions with
+ debugging information (like *ORC data*) automatically.
+
+2. ``SYM_CODE_*`` -- special functions called with special stack. Be it
+ interrupt handlers with special stack content, trampolines, or startup
+ functions.
+
+ Checking tools mostly ignore checking of these functions. But some debug
+ information still can be generated automatically. For correct debug data,
+ this code needs hints like ``UNWIND_HINT_REGS`` provided by developers.
+
+3. ``SYM_DATA*`` -- obviously data belonging to ``.data`` sections and not to
+ ``.text``. Data do not contain instructions, so they have to be treated
+ specially by the tools: they should not treat the bytes as instructions,
+ nor assign any debug information to them.
+
+Instruction Macros
+~~~~~~~~~~~~~~~~~~
+This section covers ``SYM_FUNC_*`` and ``SYM_CODE_*`` enumerated above.
+
+``objtool`` requires that all code must be contained in an ELF symbol. Symbol
+names that have a ``.L`` prefix do not emit symbol table entries. ``.L``
+prefixed symbols can be used within a code region, but should be avoided for
+denoting a range of code via ``SYM_*_START/END`` annotations.
+
+* ``SYM_FUNC_START`` and ``SYM_FUNC_START_LOCAL`` are supposed to be **the
+ most frequent markings**. They are used for functions with standard calling
+ conventions -- global and local. Like in C, they both align the functions to
+ architecture specific ``__ALIGN`` bytes. There are also ``_NOALIGN`` variants
+ for special cases where developers do not want this implicit alignment.
+
+ ``SYM_FUNC_START_WEAK`` and ``SYM_FUNC_START_WEAK_NOALIGN`` markings are
+ also offered as an assembler counterpart to the *weak* attribute known from
+ C.
+
+ All of these **shall** be coupled with ``SYM_FUNC_END``. First, it marks
+ the sequence of instructions as a function and computes its size to the
+ generated object file. Second, it also eases checking and processing such
+ object files as the tools can trivially find exact function boundaries.
+
+ So in most cases, developers should write something like in the following
+ example, having some asm instructions in between the macros, of course::
+
+ SYM_FUNC_START(memset)
+ ... asm insns ...
+ SYM_FUNC_END(memset)
+
+ In fact, this kind of annotation corresponds to the now deprecated ``ENTRY``
+ and ``ENDPROC`` macros.
+
+* ``SYM_FUNC_ALIAS``, ``SYM_FUNC_ALIAS_LOCAL``, and ``SYM_FUNC_ALIAS_WEAK`` can
+ be used to define multiple names for a function. The typical use is::
+
+ SYM_FUNC_START(__memset)
+ ... asm insns ...
+ SYN_FUNC_END(__memset)
+ SYM_FUNC_ALIAS(memset, __memset)
+
+ In this example, one can call ``__memset`` or ``memset`` with the same
+ result, except the debug information for the instructions is generated to
+ the object file only once -- for the non-``ALIAS`` case.
+
+* ``SYM_CODE_START`` and ``SYM_CODE_START_LOCAL`` should be used only in
+ special cases -- if you know what you are doing. This is used exclusively
+ for interrupt handlers and similar where the calling convention is not the C
+ one. ``_NOALIGN`` variants exist too. The use is the same as for the ``FUNC``
+ category above::
+
+ SYM_CODE_START_LOCAL(bad_put_user)
+ ... asm insns ...
+ SYM_CODE_END(bad_put_user)
+
+ Again, every ``SYM_CODE_START*`` **shall** be coupled by ``SYM_CODE_END``.
+
+ To some extent, this category corresponds to deprecated ``ENTRY`` and
+ ``END``. Except ``END`` had several other meanings too.
+
+* ``SYM_INNER_LABEL*`` is used to denote a label inside some
+ ``SYM_{CODE,FUNC}_START`` and ``SYM_{CODE,FUNC}_END``. They are very similar
+ to C labels, except they can be made global. An example of use::
+
+ SYM_CODE_START(ftrace_caller)
+ /* save_mcount_regs fills in first two parameters */
+ ...
+
+ SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL)
+ /* Load the ftrace_ops into the 3rd parameter */
+ ...
+
+ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
+ call ftrace_stub
+ ...
+ retq
+ SYM_CODE_END(ftrace_caller)
+
+Data Macros
+~~~~~~~~~~~
+Similar to instructions, there is a couple of macros to describe data in the
+assembly.
+
+* ``SYM_DATA_START`` and ``SYM_DATA_START_LOCAL`` mark the start of some data
+ and shall be used in conjunction with either ``SYM_DATA_END``, or
+ ``SYM_DATA_END_LABEL``. The latter adds also a label to the end, so that
+ people can use ``lstack`` and (local) ``lstack_end`` in the following
+ example::
+
+ SYM_DATA_START_LOCAL(lstack)
+ .skip 4096
+ SYM_DATA_END_LABEL(lstack, SYM_L_LOCAL, lstack_end)
+
+* ``SYM_DATA`` and ``SYM_DATA_LOCAL`` are variants for simple, mostly one-line
+ data::
+
+ SYM_DATA(HEAP, .long rm_heap)
+ SYM_DATA(heap_end, .long rm_stack)
+
+ In the end, they expand to ``SYM_DATA_START`` with ``SYM_DATA_END``
+ internally.
+
+Support Macros
+~~~~~~~~~~~~~~
+All the above reduce themselves to some invocation of ``SYM_START``,
+``SYM_END``, or ``SYM_ENTRY`` at last. Normally, developers should avoid using
+these.
+
+Further, in the above examples, one could see ``SYM_L_LOCAL``. There are also
+``SYM_L_GLOBAL`` and ``SYM_L_WEAK``. All are intended to denote linkage of a
+symbol marked by them. They are used either in ``_LABEL`` variants of the
+earlier macros, or in ``SYM_START``.
+
+
+Overriding Macros
+~~~~~~~~~~~~~~~~~
+Architecture can also override any of the macros in their own
+``asm/linkage.h``, including macros specifying the type of a symbol
+(``SYM_T_FUNC``, ``SYM_T_OBJECT``, and ``SYM_T_NONE``). As every macro
+described in this file is surrounded by ``#ifdef`` + ``#endif``, it is enough
+to define the macros differently in the aforementioned architecture-dependent
+header.
diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst
deleted file mode 100644
index 724583453e1f..000000000000
--- a/Documentation/core-api/atomic_ops.rst
+++ /dev/null
@@ -1,664 +0,0 @@
-=======================================================
-Semantics and Behavior of Atomic and Bitmask Operations
-=======================================================
-
-:Author: David S. Miller
-
-This document is intended to serve as a guide to Linux port
-maintainers on how to implement atomic counter, bitops, and spinlock
-interfaces properly.
-
-Atomic Type And Operations
-==========================
-
-The atomic_t type should be defined as a signed integer and
-the atomic_long_t type as a signed long integer. Also, they should
-be made opaque such that any kind of cast to a normal C integer type
-will fail. Something like the following should suffice::
-
- typedef struct { int counter; } atomic_t;
- typedef struct { long counter; } atomic_long_t;
-
-Historically, counter has been declared volatile. This is now discouraged.
-See :ref:`Documentation/process/volatile-considered-harmful.rst
-<volatile_considered_harmful>` for the complete rationale.
-
-local_t is very similar to atomic_t. If the counter is per CPU and only
-updated by one CPU, local_t is probably more appropriate. Please see
-:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of
-local_t.
-
-The first operations to implement for atomic_t's are the initializers and
-plain writes. ::
-
- #define ATOMIC_INIT(i) { (i) }
- #define atomic_set(v, i) ((v)->counter = (i))
-
-The first macro is used in definitions, such as::
-
- static atomic_t my_counter = ATOMIC_INIT(1);
-
-The initializer is atomic in that the return values of the atomic operations
-are guaranteed to be correct reflecting the initialized value if the
-initializer is used before runtime. If the initializer is used at runtime, a
-proper implicit or explicit read memory barrier is needed before reading the
-value with atomic_read from another thread.
-
-As with all of the ``atomic_`` interfaces, replace the leading ``atomic_``
-with ``atomic_long_`` to operate on atomic_long_t.
-
-The second interface can be used at runtime, as in::
-
- struct foo { atomic_t counter; };
- ...
-
- struct foo *k;
-
- k = kmalloc(sizeof(*k), GFP_KERNEL);
- if (!k)
- return -ENOMEM;
- atomic_set(&k->counter, 0);
-
-The setting is atomic in that the return values of the atomic operations by
-all threads are guaranteed to be correct reflecting either the value that has
-been set with this operation or set with another operation. A proper implicit
-or explicit memory barrier is needed before the value set with the operation
-is guaranteed to be readable with atomic_read from another thread.
-
-Next, we have::
-
- #define atomic_read(v) ((v)->counter)
-
-which simply reads the counter value currently visible to the calling thread.
-The read is atomic in that the return value is guaranteed to be one of the
-values initialized or modified with the interface operations if a proper
-implicit or explicit memory barrier is used after possible runtime
-initialization by any other thread and the value is modified only with the
-interface operations. atomic_read does not guarantee that the runtime
-initialization by any other thread is visible yet, so the user of the
-interface must take care of that with a proper implicit or explicit memory
-barrier.
-
-.. warning::
-
- ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS!
-
- Some architectures may choose to use the volatile keyword, barriers, or
- inline assembly to guarantee some degree of immediacy for atomic_read()
- and atomic_set(). This is not uniformly guaranteed, and may change in
- the future, so all users of atomic_t should treat atomic_read() and
- atomic_set() as simple C statements that may be reordered or optimized
- away entirely by the compiler or processor, and explicitly invoke the
- appropriate compiler and/or memory barrier for each use case. Failure
- to do so will result in code that may suddenly break when used with
- different architectures or compiler optimizations, or even changes in
- unrelated code which changes how the compiler optimizes the section
- accessing atomic_t variables.
-
-Properly aligned pointers, longs, ints, and chars (and unsigned
-equivalents) may be atomically loaded from and stored to in the same
-sense as described for atomic_read() and atomic_set(). The READ_ONCE()
-and WRITE_ONCE() macros should be used to prevent the compiler from using
-optimizations that might otherwise optimize accesses out of existence on
-the one hand, or that might create unsolicited accesses on the other.
-
-For example consider the following code::
-
- while (a > 0)
- do_something();
-
-If the compiler can prove that do_something() does not store to the
-variable a, then the compiler is within its rights transforming this to
-the following::
-
- if (a > 0)
- for (;;)
- do_something();
-
-If you don't want the compiler to do this (and you probably don't), then
-you should use something like the following::
-
- while (READ_ONCE(a) > 0)
- do_something();
-
-Alternatively, you could place a barrier() call in the loop.
-
-For another example, consider the following code::
-
- tmp_a = a;
- do_something_with(tmp_a);
- do_something_else_with(tmp_a);
-
-If the compiler can prove that do_something_with() does not store to the
-variable a, then the compiler is within its rights to manufacture an
-additional load as follows::
-
- tmp_a = a;
- do_something_with(tmp_a);
- tmp_a = a;
- do_something_else_with(tmp_a);
-
-This could fatally confuse your code if it expected the same value
-to be passed to do_something_with() and do_something_else_with().
-
-The compiler would be likely to manufacture this additional load if
-do_something_with() was an inline function that made very heavy use
-of registers: reloading from variable a could save a flush to the
-stack and later reload. To prevent the compiler from attacking your
-code in this manner, write the following::
-
- tmp_a = READ_ONCE(a);
- do_something_with(tmp_a);
- do_something_else_with(tmp_a);
-
-For a final example, consider the following code, assuming that the
-variable a is set at boot time before the second CPU is brought online
-and never changed later, so that memory barriers are not needed::
-
- if (a)
- b = 9;
- else
- b = 42;
-
-The compiler is within its rights to manufacture an additional store
-by transforming the above code into the following::
-
- b = 42;
- if (a)
- b = 9;
-
-This could come as a fatal surprise to other code running concurrently
-that expected b to never have the value 42 if a was zero. To prevent
-the compiler from doing this, write something like::
-
- if (a)
- WRITE_ONCE(b, 9);
- else
- WRITE_ONCE(b, 42);
-
-Don't even -think- about doing this without proper use of memory barriers,
-locks, or atomic operations if variable a can change at runtime!
-
-.. warning::
-
- ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER!
-
-Now, we move onto the atomic operation interfaces typically implemented with
-the help of assembly code. ::
-
- void atomic_add(int i, atomic_t *v);
- void atomic_sub(int i, atomic_t *v);
- void atomic_inc(atomic_t *v);
- void atomic_dec(atomic_t *v);
-
-These four routines add and subtract integral values to/from the given
-atomic_t value. The first two routines pass explicit integers by
-which to make the adjustment, whereas the latter two use an implicit
-adjustment value of "1".
-
-One very important aspect of these two routines is that they DO NOT
-require any explicit memory barriers. They need only perform the
-atomic_t counter update in an SMP safe manner.
-
-Next, we have::
-
- int atomic_inc_return(atomic_t *v);
- int atomic_dec_return(atomic_t *v);
-
-These routines add 1 and subtract 1, respectively, from the given
-atomic_t and return the new counter value after the operation is
-performed.
-
-Unlike the above routines, it is required that these primitives
-include explicit memory barriers that are performed before and after
-the operation. It must be done such that all memory operations before
-and after the atomic operation calls are strongly ordered with respect
-to the atomic operation itself.
-
-For example, it should behave as if a smp_mb() call existed both
-before and after the atomic operation.
-
-If the atomic instructions used in an implementation provide explicit
-memory barrier semantics which satisfy the above requirements, that is
-fine as well.
-
-Let's move on::
-
- int atomic_add_return(int i, atomic_t *v);
- int atomic_sub_return(int i, atomic_t *v);
-
-These behave just like atomic_{inc,dec}_return() except that an
-explicit counter adjustment is given instead of the implicit "1".
-This means that like atomic_{inc,dec}_return(), the memory barrier
-semantics are required.
-
-Next::
-
- int atomic_inc_and_test(atomic_t *v);
- int atomic_dec_and_test(atomic_t *v);
-
-These two routines increment and decrement by 1, respectively, the
-given atomic counter. They return a boolean indicating whether the
-resulting counter value was zero or not.
-
-Again, these primitives provide explicit memory barrier semantics around
-the atomic operation::
-
- int atomic_sub_and_test(int i, atomic_t *v);
-
-This is identical to atomic_dec_and_test() except that an explicit
-decrement is given instead of the implicit "1". This primitive must
-provide explicit memory barrier semantics around the operation::
-
- int atomic_add_negative(int i, atomic_t *v);
-
-The given increment is added to the given atomic counter value. A boolean
-is return which indicates whether the resulting counter value is negative.
-This primitive must provide explicit memory barrier semantics around
-the operation.
-
-Then::
-
- int atomic_xchg(atomic_t *v, int new);
-
-This performs an atomic exchange operation on the atomic variable v, setting
-the given new value. It returns the old value that the atomic variable v had
-just before the operation.
-
-atomic_xchg must provide explicit memory barriers around the operation. ::
-
- int atomic_cmpxchg(atomic_t *v, int old, int new);
-
-This performs an atomic compare exchange operation on the atomic value v,
-with the given old and new values. Like all atomic_xxx operations,
-atomic_cmpxchg will only satisfy its atomicity semantics as long as all
-other accesses of \*v are performed through atomic_xxx operations.
-
-atomic_cmpxchg must provide explicit memory barriers around the operation,
-although if the comparison fails then no memory ordering guarantees are
-required.
-
-The semantics for atomic_cmpxchg are the same as those defined for 'cas'
-below.
-
-Finally::
-
- int atomic_add_unless(atomic_t *v, int a, int u);
-
-If the atomic value v is not equal to u, this function adds a to v, and
-returns non zero. If v is equal to u then it returns zero. This is done as
-an atomic operation.
-
-atomic_add_unless must provide explicit memory barriers around the
-operation unless it fails (returns 0).
-
-atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
-
-
-If a caller requires memory barrier semantics around an atomic_t
-operation which does not return a value, a set of interfaces are
-defined which accomplish this::
-
- void smp_mb__before_atomic(void);
- void smp_mb__after_atomic(void);
-
-Preceding a non-value-returning read-modify-write atomic operation with
-smp_mb__before_atomic() and following it with smp_mb__after_atomic()
-provides the same full ordering that is provided by value-returning
-read-modify-write atomic operations.
-
-For example, smp_mb__before_atomic() can be used like so::
-
- obj->dead = 1;
- smp_mb__before_atomic();
- atomic_dec(&obj->ref_count);
-
-It makes sure that all memory operations preceding the atomic_dec()
-call are strongly ordered with respect to the atomic counter
-operation. In the above example, it guarantees that the assignment of
-"1" to obj->dead will be globally visible to other cpus before the
-atomic counter decrement.
-
-Without the explicit smp_mb__before_atomic() call, the
-implementation could legally allow the atomic counter update visible
-to other cpus before the "obj->dead = 1;" assignment.
-
-A missing memory barrier in the cases where they are required by the
-atomic_t implementation above can have disastrous results. Here is
-an example, which follows a pattern occurring frequently in the Linux
-kernel. It is the use of atomic counters to implement reference
-counting, and it works such that once the counter falls to zero it can
-be guaranteed that no other entity can be accessing the object::
-
- static void obj_list_add(struct obj *obj, struct list_head *head)
- {
- obj->active = 1;
- list_add(&obj->list, head);
- }
-
- static void obj_list_del(struct obj *obj)
- {
- list_del(&obj->list);
- obj->active = 0;
- }
-
- static void obj_destroy(struct obj *obj)
- {
- BUG_ON(obj->active);
- kfree(obj);
- }
-
- struct obj *obj_list_peek(struct list_head *head)
- {
- if (!list_empty(head)) {
- struct obj *obj;
-
- obj = list_entry(head->next, struct obj, list);
- atomic_inc(&obj->refcnt);
- return obj;
- }
- return NULL;
- }
-
- void obj_poke(void)
- {
- struct obj *obj;
-
- spin_lock(&global_list_lock);
- obj = obj_list_peek(&global_list);
- spin_unlock(&global_list_lock);
-
- if (obj) {
- obj->ops->poke(obj);
- if (atomic_dec_and_test(&obj->refcnt))
- obj_destroy(obj);
- }
- }
-
- void obj_timeout(struct obj *obj)
- {
- spin_lock(&global_list_lock);
- obj_list_del(obj);
- spin_unlock(&global_list_lock);
-
- if (atomic_dec_and_test(&obj->refcnt))
- obj_destroy(obj);
- }
-
-.. note::
-
- This is a simplification of the ARP queue management in the generic
- neighbour discover code of the networking. Olaf Kirch found a bug wrt.
- memory barriers in kfree_skb() that exposed the atomic_t memory barrier
- requirements quite clearly.
-
-Given the above scheme, it must be the case that the obj->active
-update done by the obj list deletion be visible to other processors
-before the atomic counter decrement is performed.
-
-Otherwise, the counter could fall to zero, yet obj->active would still
-be set, thus triggering the assertion in obj_destroy(). The error
-sequence looks like this::
-
- cpu 0 cpu 1
- obj_poke() obj_timeout()
- obj = obj_list_peek();
- ... gains ref to obj, refcnt=2
- obj_list_del(obj);
- obj->active = 0 ...
- ... visibility delayed ...
- atomic_dec_and_test()
- ... refcnt drops to 1 ...
- atomic_dec_and_test()
- ... refcount drops to 0 ...
- obj_destroy()
- BUG() triggers since obj->active
- still seen as one
- obj->active update visibility occurs
-
-With the memory barrier semantics required of the atomic_t operations
-which return values, the above sequence of memory visibility can never
-happen. Specifically, in the above case the atomic_dec_and_test()
-counter decrement would not become globally visible until the
-obj->active update does.
-
-As a historical note, 32-bit Sparc used to only allow usage of
-24-bits of its atomic_t type. This was because it used 8 bits
-as a spinlock for SMP safety. Sparc32 lacked a "compare and swap"
-type instruction. However, 32-bit Sparc has since been moved over
-to a "hash table of spinlocks" scheme, that allows the full 32-bit
-counter to be realized. Essentially, an array of spinlocks are
-indexed into based upon the address of the atomic_t being operated
-on, and that lock protects the atomic operation. Parisc uses the
-same scheme.
-
-Another note is that the atomic_t operations returning values are
-extremely slow on an old 386.
-
-
-Atomic Bitmask
-==============
-
-We will now cover the atomic bitmask operations. You will find that
-their SMP and memory barrier semantics are similar in shape and scope
-to the atomic_t ops above.
-
-Native atomic bit operations are defined to operate on objects aligned
-to the size of an "unsigned long" C data type, and are least of that
-size. The endianness of the bits within each "unsigned long" are the
-native endianness of the cpu. ::
-
- void set_bit(unsigned long nr, volatile unsigned long *addr);
- void clear_bit(unsigned long nr, volatile unsigned long *addr);
- void change_bit(unsigned long nr, volatile unsigned long *addr);
-
-These routines set, clear, and change, respectively, the bit number
-indicated by "nr" on the bit mask pointed to by "ADDR".
-
-They must execute atomically, yet there are no implicit memory barrier
-semantics required of these interfaces. ::
-
- int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
- int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
- int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
-
-Like the above, except that these routines return a boolean which
-indicates whether the changed bit was set _BEFORE_ the atomic bit
-operation.
-
-
-.. warning::
- It is incredibly important that the value be a boolean, ie. "0" or "1".
- Do not try to be fancy and save a few instructions by declaring the
- above to return "long" and just returning something like "old_val &
- mask" because that will not work.
-
-For one thing, this return value gets truncated to int in many code
-paths using these interfaces, so on 64-bit if the bit is set in the
-upper 32-bits then testers will never see that.
-
-One great example of where this problem crops up are the thread_info
-flag operations. Routines such as test_and_set_ti_thread_flag() chop
-the return value into an int. There are other places where things
-like this occur as well.
-
-These routines, like the atomic_t counter operations returning values,
-must provide explicit memory barrier semantics around their execution.
-All memory operations before the atomic bit operation call must be
-made visible globally before the atomic bit operation is made visible.
-Likewise, the atomic bit operation must be visible globally before any
-subsequent memory operation is made visible. For example::
-
- obj->dead = 1;
- if (test_and_set_bit(0, &obj->flags))
- /* ... */;
- obj->killed = 1;
-
-The implementation of test_and_set_bit() must guarantee that
-"obj->dead = 1;" is visible to cpus before the atomic memory operation
-done by test_and_set_bit() becomes visible. Likewise, the atomic
-memory operation done by test_and_set_bit() must become visible before
-"obj->killed = 1;" is visible.
-
-Finally there is the basic operation::
-
- int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);
-
-Which returns a boolean indicating if bit "nr" is set in the bitmask
-pointed to by "addr".
-
-If explicit memory barriers are required around {set,clear}_bit() (which do
-not return a value, and thus does not need to provide memory barrier
-semantics), two interfaces are provided::
-
- void smp_mb__before_atomic(void);
- void smp_mb__after_atomic(void);
-
-They are used as follows, and are akin to their atomic_t operation
-brothers::
-
- /* All memory operations before this call will
- * be globally visible before the clear_bit().
- */
- smp_mb__before_atomic();
- clear_bit( ... );
-
- /* The clear_bit() will be visible before all
- * subsequent memory operations.
- */
- smp_mb__after_atomic();
-
-There are two special bitops with lock barrier semantics (acquire/release,
-same as spinlocks). These operate in the same way as their non-_lock/unlock
-postfixed variants, except that they are to provide acquire/release semantics,
-respectively. This means they can be used for bit_spin_trylock and
-bit_spin_unlock type operations without specifying any more barriers. ::
-
- int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
- void clear_bit_unlock(unsigned long nr, unsigned long *addr);
- void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
-
-The __clear_bit_unlock version is non-atomic, however it still implements
-unlock barrier semantics. This can be useful if the lock itself is protecting
-the other bits in the word.
-
-Finally, there are non-atomic versions of the bitmask operations
-provided. They are used in contexts where some other higher-level SMP
-locking scheme is being used to protect the bitmask, and thus less
-expensive non-atomic operations may be used in the implementation.
-They have names similar to the above bitmask operation interfaces,
-except that two underscores are prefixed to the interface name. ::
-
- void __set_bit(unsigned long nr, volatile unsigned long *addr);
- void __clear_bit(unsigned long nr, volatile unsigned long *addr);
- void __change_bit(unsigned long nr, volatile unsigned long *addr);
- int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
- int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
- int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
-
-These non-atomic variants also do not require any special memory
-barrier semantics.
-
-The routines xchg() and cmpxchg() must provide the same exact
-memory-barrier semantics as the atomic and bit operations returning
-values.
-
-.. note::
-
- If someone wants to use xchg(), cmpxchg() and their variants,
- linux/atomic.h should be included rather than asm/cmpxchg.h, unless the
- code is in arch/* and can take care of itself.
-
-Spinlocks and rwlocks have memory barrier expectations as well.
-The rule to follow is simple:
-
-1) When acquiring a lock, the implementation must make it globally
- visible before any subsequent memory operation.
-
-2) When releasing a lock, the implementation must make it such that
- all previous memory operations are globally visible before the
- lock release.
-
-Which finally brings us to _atomic_dec_and_lock(). There is an
-architecture-neutral version implemented in lib/dec_and_lock.c,
-but most platforms will wish to optimize this in assembler. ::
-
- int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock);
-
-Atomically decrement the given counter, and if will drop to zero
-atomically acquire the given spinlock and perform the decrement
-of the counter to zero. If it does not drop to zero, do nothing
-with the spinlock.
-
-It is actually pretty simple to get the memory barrier correct.
-Simply satisfy the spinlock grab requirements, which is make
-sure the spinlock operation is globally visible before any
-subsequent memory operation.
-
-We can demonstrate this operation more clearly if we define
-an abstract atomic operation::
-
- long cas(long *mem, long old, long new);
-
-"cas" stands for "compare and swap". It atomically:
-
-1) Compares "old" with the value currently at "mem".
-2) If they are equal, "new" is written to "mem".
-3) Regardless, the current value at "mem" is returned.
-
-As an example usage, here is what an atomic counter update
-might look like::
-
- void example_atomic_inc(long *counter)
- {
- long old, new, ret;
-
- while (1) {
- old = *counter;
- new = old + 1;
-
- ret = cas(counter, old, new);
- if (ret == old)
- break;
- }
- }
-
-Let's use cas() in order to build a pseudo-C atomic_dec_and_lock()::
-
- int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
- {
- long old, new, ret;
- int went_to_zero;
-
- went_to_zero = 0;
- while (1) {
- old = atomic_read(atomic);
- new = old - 1;
- if (new == 0) {
- went_to_zero = 1;
- spin_lock(lock);
- }
- ret = cas(atomic, old, new);
- if (ret == old)
- break;
- if (went_to_zero) {
- spin_unlock(lock);
- went_to_zero = 0;
- }
- }
-
- return went_to_zero;
- }
-
-Now, as far as memory barriers go, as long as spin_lock()
-strictly orders all subsequent memory operations (including
-the cas()) with respect to itself, things will be fine.
-
-Said another way, _atomic_dec_and_lock() must guarantee that
-a counter dropping to zero is never made visible before the
-spinlock being acquired.
-
-.. note::
-
- Note that this also means that for the case where the counter is not
- dropping to zero, there are no memory ordering requirements.
diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index 93cb65d52720..889fc84ccd1b 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -88,13 +88,17 @@ changes occur:
This is used primarily during fault processing.
-5) ``void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)``
+5) ``void update_mmu_cache_range(struct vm_fault *vmf,
+ struct vm_area_struct *vma, unsigned long address, pte_t *ptep,
+ unsigned int nr)``
- At the end of every page fault, this routine is invoked to
- tell the architecture specific code that a translation
- now exists at virtual address "address" for address space
- "vma->vm_mm", in the software page tables.
+ At the end of every page fault, this routine is invoked to tell
+ the architecture specific code that translations now exists
+ in the software page tables for address space "vma->vm_mm"
+ at virtual address "address" for "nr" consecutive pages.
+
+ This routine is also invoked in various other places which pass
+ a NULL "vmf".
A port may use this information in any way it so chooses.
For example, it could use this event to pre-load TLB
@@ -213,9 +217,9 @@ Here are the routines, one by one:
there will be no entries in the cache for the kernel address
space for virtual addresses in the range 'start' to 'end-1'.
- The first of these two routines is invoked after map_vm_area()
+ The first of these two routines is invoked after vmap_range()
has installed the page table entries. The second is invoked
- before unmap_kernel_range() deletes the page table entries.
+ before vunmap_range() deletes the page table entries.
There exists another whole class of cpu cache issues which currently
require a whole different set of interfaces to handle properly.
@@ -269,12 +273,17 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.
- ``void flush_dcache_page(struct page *page)``
+ ``void flush_dcache_folio(struct folio *folio)``
+
+ This routines must be called when:
- Any time the kernel writes to a page cache page, _OR_
- the kernel is about to read from a page cache page and
- user space shared/writable mappings of this page potentially
- exist, this routine is called.
+ a) the kernel did write to a page that is in the page cache page
+ and / or in high memory
+ b) the kernel is about to read from a page cache page and user space
+ shared/writable mappings of this page potentially exist. Note
+ that {get,pin}_user_pages{_fast} already call flush_dcache_folio
+ on any page found in the user address space and thus driver
+ code rarely needs to take this into account.
.. note::
@@ -284,38 +293,35 @@ maps this page at its virtual address.
handling vfs symlinks in the page cache need not call
this interface at all.
- The phrase "kernel writes to a page cache page" means,
- specifically, that the kernel executes store instructions
- that dirty data in that page at the page->virtual mapping
- of that page. It is important to flush here to handle
- D-cache aliasing, to make sure these kernel stores are
- visible to user space mappings of that page.
+ The phrase "kernel writes to a page cache page" means, specifically,
+ that the kernel executes store instructions that dirty data in that
+ page at the kernel virtual mapping of that page. It is important to
+ flush here to handle D-cache aliasing, to make sure these kernel stores
+ are visible to user space mappings of that page.
- The corollary case is just as important, if there are users
- which have shared+writable mappings of this file, we must make
- sure that kernel reads of these pages will see the most recent
- stores done by the user.
+ The corollary case is just as important, if there are users which have
+ shared+writable mappings of this file, we must make sure that kernel
+ reads of these pages will see the most recent stores done by the user.
- If D-cache aliasing is not an issue, this routine may
- simply be defined as a nop on that architecture.
+ If D-cache aliasing is not an issue, this routine may simply be defined
+ as a nop on that architecture.
- There is a bit set aside in page->flags (PG_arch_1) as
- "architecture private". The kernel guarantees that,
- for pagecache pages, it will clear this bit when such
- a page first enters the pagecache.
+ There is a bit set aside in folio->flags (PG_arch_1) as "architecture
+ private". The kernel guarantees that, for pagecache pages, it will
+ clear this bit when such a page first enters the pagecache.
This allows these interfaces to be implemented much more
- efficiently. It allows one to "defer" (perhaps indefinitely)
- the actual flush if there are currently no user processes
- mapping this page. See sparc64's flush_dcache_page and
- update_mmu_cache implementations for an example of how to go
- about doing this.
-
- The idea is, first at flush_dcache_page() time, if
- page->mapping->i_mmap is an empty tree, just mark the architecture
- private page flag bit. Later, in update_mmu_cache(), a check is
- made of this flag bit, and if set the flush is done and the flag
- bit is cleared.
+ efficiently. It allows one to "defer" (perhaps indefinitely) the
+ actual flush if there are currently no user processes mapping this
+ page. See sparc64's flush_dcache_folio and update_mmu_cache_range
+ implementations for an example of how to go about doing this.
+
+ The idea is, first at flush_dcache_folio() time, if
+ folio_flush_mapping() returns a mapping, and mapping_mapped() on that
+ mapping returns %false, just mark the architecture private page
+ flag bit. Later, in update_mmu_cache_range(), a check is made
+ of this flag bit, and if set the flush is done and the flag bit
+ is cleared.
.. important::
@@ -345,25 +351,12 @@ maps this page at its virtual address.
When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
- get_user_pages()). Note: flush_dcache_page() deliberately
+ get_user_pages()). Note: flush_dcache_folio() deliberately
doesn't work for an anonymous page. The default
implementation is a nop (and should remain so for all coherent
architectures). For incoherent architectures, it should flush
the cache of the page at vmaddr.
- ``void flush_kernel_dcache_page(struct page *page)``
-
- When the kernel needs to modify a user page is has obtained
- with kmap, it calls this function after all modifications are
- complete (but before kunmapping it) to bring the underlying
- page up to date. It is assumed here that the user has no
- incoherent cached copies (i.e. the original page was obtained
- from a mechanism like get_user_pages()). The default
- implementation is a nop and should remain so on all coherent
- architectures. On incoherent architectures, this should flush
- the kernel cache for page (using page_address(page)).
-
-
``void flush_icache_range(unsigned long start, unsigned long end)``
When the kernel stores into addresses that it will execute
@@ -375,7 +368,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
All the functionality of flush_icache_page can be implemented in
- flush_dcache_page and update_mmu_cache. In the future, the hope
+ flush_dcache_folio and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.
The final category of APIs is for I/O to deliberately aliased address
diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7817f7..dcb0e379e5e8 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -2,12 +2,13 @@
CPU hotplug in the Kernel
=========================
-:Date: December, 2016
+:Date: September, 2021
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
- Rusty Russell <rusty@rustcorp.com.au>,
- Srivatsa Vaddagiri <vatsa@in.ibm.com>,
- Ashok Raj <ashok.raj@intel.com>,
- Joel Schopp <jschopp@austin.ibm.com>
+ Rusty Russell <rusty@rustcorp.com.au>,
+ Srivatsa Vaddagiri <vatsa@in.ibm.com>,
+ Ashok Raj <ashok.raj@intel.com>,
+ Joel Schopp <jschopp@austin.ibm.com>,
+ Thomas Gleixner <tglx@linutronix.de>
Introduction
============
@@ -30,33 +31,20 @@ which didn't support these methods.
Command Line Switches
=====================
``maxcpus=n``
- Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using
+ Restrict boot time CPUs to *n*. Say if you have four CPUs, using
``maxcpus=2`` will only boot two. You can choose to bring the
other CPUs later online.
``nr_cpus=n``
- Restrict the total amount CPUs the kernel will support. If the number
- supplied here is lower than the number of physically available CPUs than
+ Restrict the total amount of CPUs the kernel will support. If the number
+ supplied here is lower than the number of physically available CPUs, then
those CPUs can not be brought online later.
-``additional_cpus=n``
- Use this to limit hotpluggable CPUs. This option sets
- ``cpu_possible_mask = cpu_present_mask + additional_cpus``
-
- This option is limited to the IA64 architecture.
-
``possible_cpus=n``
This option sets ``possible_cpus`` bits in ``cpu_possible_mask``.
This option is limited to the X86 and S390 architecture.
-``cede_offline={"off","on"}``
- Use this option to disable/enable putting offlined processors to an extended
- ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
- ``cede_offline`` is set to "on".
-
- This option is limited to the PowerPC architecture.
-
``cpu0_hotplug``
Allow to shutdown CPU0.
@@ -98,9 +86,10 @@ Never use anything other than ``cpumask_t`` to represent bitmap of CPUs.
Using CPU hotplug
=================
+
The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently
available on multiple architectures including ARM, MIPS, PowerPC and X86. The
-configuration is done via the sysfs interface: ::
+configuration is done via the sysfs interface::
$ ls -lh /sys/devices/system/cpu
total 0
@@ -120,35 +109,27 @@ configuration is done via the sysfs interface: ::
The files *offline*, *online*, *possible*, *present* represent the CPU masks.
Each CPU folder contains an *online* file which controls the logical on (1) and
-off (0) state. To logically shutdown CPU4: ::
+off (0) state. To logically shutdown CPU4::
$ echo 0 > /sys/devices/system/cpu/cpu4/online
smpboot: CPU 4 is now offline
Once the CPU is shutdown, it will be removed from */proc/interrupts*,
*/proc/cpuinfo* and should also not be shown visible by the *top* command. To
-bring CPU4 back online: ::
+bring CPU4 back online::
$ echo 1 > /sys/devices/system/cpu/cpu4/online
smpboot: Booting Node 0 Processor 4 APIC 0x1
-The CPU is usable again. This should work on all CPUs. CPU0 is often special
-and excluded from CPU hotplug. On X86 the kernel option
-*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to
-shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be
-used. Some known dependencies of CPU0:
-
-* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline.
-* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected.
-
-Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies
-on CPU0.
+The CPU is usable again. This should work on all CPUs, but CPU0 is often special
+and excluded from CPU hotplug.
The CPU hotplug coordination
============================
The offline case
----------------
+
Once a CPU has been logically shutdown the teardown callbacks of registered
hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating
at state ``CPUHP_OFFLINE``. This includes:
@@ -163,105 +144,491 @@ at state ``CPUHP_OFFLINE``. This includes:
* Once all services are migrated, kernel calls an arch specific routine
``__cpu_disable()`` to perform arch specific cleanup.
-Using the hotplug API
----------------------
-It is possible to receive notifications once a CPU is offline or onlined. This
-might be important to certain drivers which need to perform some kind of setup
-or clean up functions based on the number of available CPUs: ::
-
- #include <linux/cpuhotplug.h>
-
- ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online",
- Y_online, Y_prepare_down);
-
-*X* is the subsystem and *Y* the particular driver. The *Y_online* callback
-will be invoked during registration on all online CPUs. If an error
-occurs during the online callback the *Y_prepare_down* callback will be
-invoked on all CPUs on which the online callback was previously invoked.
-After registration completed, the *Y_online* callback will be invoked
-once a CPU is brought online and *Y_prepare_down* will be invoked when a
-CPU is shutdown. All resources which were previously allocated in
-*Y_online* should be released in *Y_prepare_down*.
-The return value *ret* is negative if an error occurred during the
-registration process. Otherwise a positive value is returned which
-contains the allocated hotplug for dynamically allocated states
-(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states.
-
-The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a
-dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state.
-During the removal of a hotplug state the teardown callback will be invoked.
-
-Multiple instances
-~~~~~~~~~~~~~~~~~~
-If a driver has multiple instances and each instance needs to perform the
-callback independently then it is likely that a ''multi-state'' should be used.
-First a multi-state state needs to be registered: ::
-
- ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online,
- Y_online, Y_prepare_down);
- Y_hp_online = ret;
-
-The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()``
-except it prepares the callbacks for a multi state and does not invoke
-the callbacks. This is a one time setup.
-Once a new instance is allocated, you need to register this new instance: ::
-
- ret = cpuhp_state_add_instance(Y_hp_online, &d->node);
-
-This function will add this instance to your previously allocated
-*Y_hp_online* state and invoke the previously registered callback
-(*Y_online*) on all online CPUs. The *node* element is a ``struct
-hlist_node`` member of your per-instance data structure.
-
-On removal of the instance: ::
- cpuhp_state_remove_instance(Y_hp_online, &d->node)
-
-should be invoked which will invoke the teardown callback on all online
-CPUs.
-
-Manual setup
-~~~~~~~~~~~~
-Usually it is handy to invoke setup and teardown callbacks on registration or
-removal of a state because usually the operation needs to performed once a CPU
-goes online (offline) and during initial setup (shutdown) of the driver. However
-each registration and removal function is also available with a ``_nocalls``
-suffix which does not invoke the provided callbacks if the invocation of the
-callbacks is not desired. During the manual setup (or teardown) the functions
-``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU
-hotplug operations.
-
-
-The ordering of the events
---------------------------
-The hotplug states are defined in ``include/linux/cpuhotplug.h``:
-
-* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the
- CPU is up.
-* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked
- just the after the CPU has been brought up. The interrupts are off and
- the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE*
- the callbacks are invoked on the target CPU.
-* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are
- reserved for the dynamic allocation.
-* The states are invoked in the reverse order on CPU shutdown starting with
- *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are
- invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*.
-
-A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough.
-However if an earlier invocation during the bring up or shutdown is required
-then an explicit state should be acquired. An explicit state might also be
-required if the hotplug event requires specific ordering in respect to
-another hotplug event.
+
+The CPU hotplug API
+===================
+
+CPU hotplug state machine
+-------------------------
+
+CPU hotplug uses a trivial state machine with a linear state space from
+CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown
+callback.
+
+When a CPU is onlined, the startup callbacks are invoked sequentially until
+the state CPUHP_ONLINE is reached. They can also be invoked when the
+callbacks of a state are set up or an instance is added to a multi-instance
+state.
+
+When a CPU is offlined the teardown callbacks are invoked in the reverse
+order sequentially until the state CPUHP_OFFLINE is reached. They can also
+be invoked when the callbacks of a state are removed or an instance is
+removed from a multi-instance state.
+
+If a usage site requires only a callback in one direction of the hotplug
+operations (CPU online or CPU offline) then the other not-required callback
+can be set to NULL when the state is set up.
+
+The state space is divided into three sections:
+
+* The PREPARE section
+
+ The PREPARE section covers the state space from CPUHP_OFFLINE to
+ CPUHP_BRINGUP_CPU.
+
+ The startup callbacks in this section are invoked before the CPU is
+ started during a CPU online operation. The teardown callbacks are invoked
+ after the CPU has become dysfunctional during a CPU offline operation.
+
+ The callbacks are invoked on a control CPU as they can't obviously run on
+ the hotplugged CPU which is either not yet started or has become
+ dysfunctional already.
+
+ The startup callbacks are used to setup resources which are required to
+ bring a CPU successfully online. The teardown callbacks are used to free
+ resources or to move pending work to an online CPU after the hotplugged
+ CPU became dysfunctional.
+
+ The startup callbacks are allowed to fail. If a callback fails, the CPU
+ online operation is aborted and the CPU is brought down to the previous
+ state (usually CPUHP_OFFLINE) again.
+
+ The teardown callbacks in this section are not allowed to fail.
+
+* The STARTING section
+
+ The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1
+ and CPUHP_AP_ONLINE.
+
+ The startup callbacks in this section are invoked on the hotplugged CPU
+ with interrupts disabled during a CPU online operation in the early CPU
+ setup code. The teardown callbacks are invoked with interrupts disabled
+ on the hotplugged CPU during a CPU offline operation shortly before the
+ CPU is completely shut down.
+
+ The callbacks in this section are not allowed to fail.
+
+ The callbacks are used for low level hardware initialization/shutdown and
+ for core subsystems.
+
+* The ONLINE section
+
+ The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and
+ CPUHP_ONLINE.
+
+ The startup callbacks in this section are invoked on the hotplugged CPU
+ during a CPU online operation. The teardown callbacks are invoked on the
+ hotplugged CPU during a CPU offline operation.
+
+ The callbacks are invoked in the context of the per CPU hotplug thread,
+ which is pinned on the hotplugged CPU. The callbacks are invoked with
+ interrupts and preemption enabled.
+
+ The callbacks are allowed to fail. When a callback fails the hotplug
+ operation is aborted and the CPU is brought back to the previous state.
+
+CPU online/offline operations
+-----------------------------
+
+A successful online operation looks like this::
+
+ [CPUHP_OFFLINE]
+ [CPUHP_OFFLINE + 1]->startup() -> success
+ [CPUHP_OFFLINE + 2]->startup() -> success
+ [CPUHP_OFFLINE + 3] -> skipped because startup == NULL
+ ...
+ [CPUHP_BRINGUP_CPU]->startup() -> success
+ === End of PREPARE section
+ [CPUHP_BRINGUP_CPU + 1]->startup() -> success
+ ...
+ [CPUHP_AP_ONLINE]->startup() -> success
+ === End of STARTUP section
+ [CPUHP_AP_ONLINE + 1]->startup() -> success
+ ...
+ [CPUHP_ONLINE - 1]->startup() -> success
+ [CPUHP_ONLINE]
+
+A successful offline operation looks like this::
+
+ [CPUHP_ONLINE]
+ [CPUHP_ONLINE - 1]->teardown() -> success
+ ...
+ [CPUHP_AP_ONLINE + 1]->teardown() -> success
+ === Start of STARTUP section
+ [CPUHP_AP_ONLINE]->teardown() -> success
+ ...
+ [CPUHP_BRINGUP_ONLINE - 1]->teardown()
+ ...
+ === Start of PREPARE section
+ [CPUHP_BRINGUP_CPU]->teardown()
+ [CPUHP_OFFLINE + 3]->teardown()
+ [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL
+ [CPUHP_OFFLINE + 1]->teardown()
+ [CPUHP_OFFLINE]
+
+A failed online operation looks like this::
+
+ [CPUHP_OFFLINE]
+ [CPUHP_OFFLINE + 1]->startup() -> success
+ [CPUHP_OFFLINE + 2]->startup() -> success
+ [CPUHP_OFFLINE + 3] -> skipped because startup == NULL
+ ...
+ [CPUHP_BRINGUP_CPU]->startup() -> success
+ === End of PREPARE section
+ [CPUHP_BRINGUP_CPU + 1]->startup() -> success
+ ...
+ [CPUHP_AP_ONLINE]->startup() -> success
+ === End of STARTUP section
+ [CPUHP_AP_ONLINE + 1]->startup() -> success
+ ---
+ [CPUHP_AP_ONLINE + N]->startup() -> fail
+ [CPUHP_AP_ONLINE + (N - 1)]->teardown()
+ ...
+ [CPUHP_AP_ONLINE + 1]->teardown()
+ === Start of STARTUP section
+ [CPUHP_AP_ONLINE]->teardown()
+ ...
+ [CPUHP_BRINGUP_ONLINE - 1]->teardown()
+ ...
+ === Start of PREPARE section
+ [CPUHP_BRINGUP_CPU]->teardown()
+ [CPUHP_OFFLINE + 3]->teardown()
+ [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL
+ [CPUHP_OFFLINE + 1]->teardown()
+ [CPUHP_OFFLINE]
+
+A failed offline operation looks like this::
+
+ [CPUHP_ONLINE]
+ [CPUHP_ONLINE - 1]->teardown() -> success
+ ...
+ [CPUHP_ONLINE - N]->teardown() -> fail
+ [CPUHP_ONLINE - (N - 1)]->startup()
+ ...
+ [CPUHP_ONLINE - 1]->startup()
+ [CPUHP_ONLINE]
+
+Recursive failures cannot be handled sensibly. Look at the following
+example of a recursive fail due to a failed offline operation: ::
+
+ [CPUHP_ONLINE]
+ [CPUHP_ONLINE - 1]->teardown() -> success
+ ...
+ [CPUHP_ONLINE - N]->teardown() -> fail
+ [CPUHP_ONLINE - (N - 1)]->startup() -> success
+ [CPUHP_ONLINE - (N - 2)]->startup() -> fail
+
+The CPU hotplug state machine stops right here and does not try to go back
+down again because that would likely result in an endless loop::
+
+ [CPUHP_ONLINE - (N - 1)]->teardown() -> success
+ [CPUHP_ONLINE - N]->teardown() -> fail
+ [CPUHP_ONLINE - (N - 1)]->startup() -> success
+ [CPUHP_ONLINE - (N - 2)]->startup() -> fail
+ [CPUHP_ONLINE - (N - 1)]->teardown() -> success
+ [CPUHP_ONLINE - N]->teardown() -> fail
+
+Lather, rinse and repeat. In this case the CPU left in state::
+
+ [CPUHP_ONLINE - (N - 1)]
+
+which at least lets the system make progress and gives the user a chance to
+debug or even resolve the situation.
+
+Allocating a state
+------------------
+
+There are two ways to allocate a CPU hotplug state:
+
+* Static allocation
+
+ Static allocation has to be used when the subsystem or driver has
+ ordering requirements versus other CPU hotplug states. E.g. the PERF core
+ startup callback has to be invoked before the PERF driver startup
+ callbacks during a CPU online operation. During a CPU offline operation
+ the driver teardown callbacks have to be invoked before the core teardown
+ callback. The statically allocated states are described by constants in
+ the cpuhp_state enum which can be found in include/linux/cpuhotplug.h.
+
+ Insert the state into the enum at the proper place so the ordering
+ requirements are fulfilled. The state constant has to be used for state
+ setup and removal.
+
+ Static allocation is also required when the state callbacks are not set
+ up at runtime and are part of the initializer of the CPU hotplug state
+ array in kernel/cpu.c.
+
+* Dynamic allocation
+
+ When there are no ordering requirements for the state callbacks then
+ dynamic allocation is the preferred method. The state number is allocated
+ by the setup function and returned to the caller on success.
+
+ Only the PREPARE and ONLINE sections provide a dynamic allocation
+ range. The STARTING section does not as most of the callbacks in that
+ section have explicit ordering requirements.
+
+Setup of a CPU hotplug state
+----------------------------
+
+The core code provides the following functions to setup a state:
+
+* cpuhp_setup_state(state, name, startup, teardown)
+* cpuhp_setup_state_nocalls(state, name, startup, teardown)
+* cpuhp_setup_state_cpuslocked(state, name, startup, teardown)
+* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown)
+
+For cases where a driver or a subsystem has multiple instances and the same
+CPU hotplug state callbacks need to be invoked for each instance, the CPU
+hotplug core provides multi-instance support. The advantage over driver
+specific instance lists is that the instance related functions are fully
+serialized against CPU hotplug operations and provide the automatic
+invocations of the state callbacks on add and removal. To set up such a
+multi-instance state the following function is available:
+
+* cpuhp_setup_state_multi(state, name, startup, teardown)
+
+The @state argument is either a statically allocated state or one of the
+constants for dynamically allocated states - CPUHP_BP_PREPARE_DYN,
+CPUHP_AP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
+which a dynamic state should be allocated.
+
+The @name argument is used for sysfs output and for instrumentation. The
+naming convention is "subsys:mode" or "subsys/driver:mode",
+e.g. "perf:mode" or "perf/x86:mode". The common mode names are:
+
+======== =======================================================
+prepare For states in the PREPARE section
+
+dead For states in the PREPARE section which do not provide
+ a startup callback
+
+starting For states in the STARTING section
+
+dying For states in the STARTING section which do not provide
+ a startup callback
+
+online For states in the ONLINE section
+
+offline For states in the ONLINE section which do not provide
+ a startup callback
+======== =======================================================
+
+As the @name argument is only used for sysfs and instrumentation other mode
+descriptors can be used as well if they describe the nature of the state
+better than the common ones.
+
+Examples for @name arguments: "perf/online", "perf/x86:prepare",
+"RCU/tree:dying", "sched/waitempty"
+
+The @startup argument is a function pointer to the callback which should be
+invoked during a CPU online operation. If the usage site does not require a
+startup callback set the pointer to NULL.
+
+The @teardown argument is a function pointer to the callback which should
+be invoked during a CPU offline operation. If the usage site does not
+require a teardown callback set the pointer to NULL.
+
+The functions differ in the way how the installed callbacks are treated:
+
+ * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked()
+ and cpuhp_setup_state_multi() only install the callbacks
+
+ * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the
+ callbacks and invoke the @startup callback (if not NULL) for all online
+ CPUs which have currently a state greater than the newly installed
+ state. Depending on the state section the callback is either invoked on
+ the current CPU (PREPARE section) or on each online CPU (ONLINE
+ section) in the context of the CPU's hotplug thread.
+
+ If a callback fails for CPU N then the teardown callback for CPU
+ 0 .. N-1 is invoked to rollback the operation. The state setup fails,
+ the callbacks for the state are not installed and in case of dynamic
+ allocation the allocated state is freed.
+
+The state setup and the callback invocations are serialized against CPU
+hotplug operations. If the setup function has to be called from a CPU
+hotplug read locked region, then the _cpuslocked() variants have to be
+used. These functions cannot be used from within CPU hotplug callbacks.
+
+The function return values:
+ ======== ===================================================================
+ 0 Statically allocated state was successfully set up
+
+ >0 Dynamically allocated state was successfully set up.
+
+ The returned number is the state number which was allocated. If
+ the state callbacks have to be removed later, e.g. module
+ removal, then this number has to be saved by the caller and used
+ as @state argument for the state remove function. For
+ multi-instance states the dynamically allocated state number is
+ also required as @state argument for the instance add/remove
+ operations.
+
+ <0 Operation failed
+ ======== ===================================================================
+
+Removal of a CPU hotplug state
+------------------------------
+
+To remove a previously set up state, the following functions are provided:
+
+* cpuhp_remove_state(state)
+* cpuhp_remove_state_nocalls(state)
+* cpuhp_remove_state_nocalls_cpuslocked(state)
+* cpuhp_remove_multi_state(state)
+
+The @state argument is either a statically allocated state or the state
+number which was allocated in the dynamic range by cpuhp_setup_state*(). If
+the state is in the dynamic range, then the state number is freed and
+available for dynamic allocation again.
+
+The functions differ in the way how the installed callbacks are treated:
+
+ * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked()
+ and cpuhp_remove_multi_state() only remove the callbacks.
+
+ * cpuhp_remove_state() removes the callbacks and invokes the teardown
+ callback (if not NULL) for all online CPUs which have currently a state
+ greater than the removed state. Depending on the state section the
+ callback is either invoked on the current CPU (PREPARE section) or on
+ each online CPU (ONLINE section) in the context of the CPU's hotplug
+ thread.
+
+ In order to complete the removal, the teardown callback should not fail.
+
+The state removal and the callback invocations are serialized against CPU
+hotplug operations. If the remove function has to be called from a CPU
+hotplug read locked region, then the _cpuslocked() variants have to be
+used. These functions cannot be used from within CPU hotplug callbacks.
+
+If a multi-instance state is removed then the caller has to remove all
+instances first.
+
+Multi-Instance state instance management
+----------------------------------------
+
+Once the multi-instance state is set up, instances can be added to the
+state:
+
+ * cpuhp_state_add_instance(state, node)
+ * cpuhp_state_add_instance_nocalls(state, node)
+
+The @state argument is either a statically allocated state or the state
+number which was allocated in the dynamic range by cpuhp_setup_state_multi().
+
+The @node argument is a pointer to an hlist_node which is embedded in the
+instance's data structure. The pointer is handed to the multi-instance
+state callbacks and can be used by the callback to retrieve the instance
+via container_of().
+
+The functions differ in the way how the installed callbacks are treated:
+
+ * cpuhp_state_add_instance_nocalls() and only adds the instance to the
+ multi-instance state's node list.
+
+ * cpuhp_state_add_instance() adds the instance and invokes the startup
+ callback (if not NULL) associated with @state for all online CPUs which
+ have currently a state greater than @state. The callback is only
+ invoked for the to be added instance. Depending on the state section
+ the callback is either invoked on the current CPU (PREPARE section) or
+ on each online CPU (ONLINE section) in the context of the CPU's hotplug
+ thread.
+
+ If a callback fails for CPU N then the teardown callback for CPU
+ 0 .. N-1 is invoked to rollback the operation, the function fails and
+ the instance is not added to the node list of the multi-instance state.
+
+To remove an instance from the state's node list these functions are
+available:
+
+ * cpuhp_state_remove_instance(state, node)
+ * cpuhp_state_remove_instance_nocalls(state, node)
+
+The arguments are the same as for the cpuhp_state_add_instance*()
+variants above.
+
+The functions differ in the way how the installed callbacks are treated:
+
+ * cpuhp_state_remove_instance_nocalls() only removes the instance from the
+ state's node list.
+
+ * cpuhp_state_remove_instance() removes the instance and invokes the
+ teardown callback (if not NULL) associated with @state for all online
+ CPUs which have currently a state greater than @state. The callback is
+ only invoked for the to be removed instance. Depending on the state
+ section the callback is either invoked on the current CPU (PREPARE
+ section) or on each online CPU (ONLINE section) in the context of the
+ CPU's hotplug thread.
+
+ In order to complete the removal, the teardown callback should not fail.
+
+The node list add/remove operations and the callback invocations are
+serialized against CPU hotplug operations. These functions cannot be used
+from within CPU hotplug callbacks and CPU hotplug read locked regions.
+
+Examples
+--------
+
+Setup and teardown a statically allocated state in the STARTING section for
+notifications on online and offline operations::
+
+ ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying);
+ if (ret < 0)
+ return ret;
+ ....
+ cpuhp_remove_state(CPUHP_SUBSYS_STARTING);
+
+Setup and teardown a dynamically allocated state in the ONLINE section
+for notifications on offline operations::
+
+ state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
+ if (state < 0)
+ return state;
+ ....
+ cpuhp_remove_state(state);
+
+Setup and teardown a dynamically allocated state in the ONLINE section
+for notifications on online operations without invoking the callbacks::
+
+ state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL);
+ if (state < 0)
+ return state;
+ ....
+ cpuhp_remove_state_nocalls(state);
+
+Setup, use and teardown a dynamically allocated multi-instance state in the
+ONLINE section for notifications on online and offline operation::
+
+ state = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
+ if (state < 0)
+ return state;
+ ....
+ ret = cpuhp_state_add_instance(state, &inst1->node);
+ if (ret)
+ return ret;
+ ....
+ ret = cpuhp_state_add_instance(state, &inst2->node);
+ if (ret)
+ return ret;
+ ....
+ cpuhp_remove_instance(state, &inst1->node);
+ ....
+ cpuhp_remove_instance(state, &inst2->node);
+ ....
+ remove_multi_state(state);
+
Testing of hotplug states
=========================
+
One way to verify whether a custom state is working as expected or not is to
shutdown a CPU and then put it online again. It is also possible to put the CPU
to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to
*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE*
which would lead to rollback to the online state.
-All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: ::
+All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states`` ::
$ tail /sys/devices/system/cpu/hotplug/states
138: mm/vmscan:online
@@ -275,7 +642,7 @@ All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states
168: sched:active
169: online
-To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
+To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue::
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
169
@@ -283,14 +650,14 @@ To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
140
-It is important to note that the teardown callbac of state 140 have been
-invoked. And now get back online: ::
+It is important to note that the teardown callback of state 140 have been
+invoked. And now get back online::
$ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
169
-With trace events enabled, the individual steps are visible, too: ::
+With trace events enabled, the individual steps are visible, too::
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
@@ -325,6 +692,7 @@ trace.
Architecture's requirements
===========================
+
The following functions and configurations are required:
``CONFIG_HOTPLUG_CPU``
@@ -346,11 +714,12 @@ The following functions and configurations are required:
User Space Notification
=======================
-After CPU successfully onlined or offline udev events are sent. A udev rule like: ::
+
+After CPU successfully onlined or offline udev events are sent. A udev rule like::
SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh"
-will receive all events. A script like: ::
+will receive all events. A script like::
#!/bin/sh
@@ -366,6 +735,24 @@ will receive all events. A script like: ::
can process the event further.
+When changes to the CPUs in the system occur, the sysfs file
+/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel
+updates the kdump capture kernel list of CPUs itself (via elfcorehdr),
+or '0' if userspace must update the kdump capture kernel list of CPUs.
+
+The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration
+option.
+
+To skip userspace processing of CPU hot un/plug events for kdump
+(i.e. the unload-then-reload to obtain a current list of CPUs), this sysfs
+file can be used in a udev rule as follows:
+
+ SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+
+For a CPU hot un/plug event, if the architecture supports kernel updates
+of the elfcorehdr (which contains the list of CPUs), then the rule skips
+the unload-then-reload of the kdump capture kernel.
+
Kernel Inline Documentations Reference
======================================
diff --git a/Documentation/core-api/debugging-via-ohci1394.rst b/Documentation/core-api/debugging-via-ohci1394.rst
new file mode 100644
index 000000000000..cb3d3228dfc8
--- /dev/null
+++ b/Documentation/core-api/debugging-via-ohci1394.rst
@@ -0,0 +1,185 @@
+===========================================================================
+Using physical DMA provided by OHCI-1394 FireWire controllers for debugging
+===========================================================================
+
+Introduction
+------------
+
+Basically all FireWire controllers which are in use today are compliant
+to the OHCI-1394 specification which defines the controller to be a PCI
+bus master which uses DMA to offload data transfers from the CPU and has
+a "Physical Response Unit" which executes specific requests by employing
+PCI-Bus master DMA after applying filters defined by the OHCI-1394 driver.
+
+Once properly configured, remote machines can send these requests to
+ask the OHCI-1394 controller to perform read and write requests on
+physical system memory and, for read requests, send the result of
+the physical memory read back to the requester.
+
+With that, it is possible to debug issues by reading interesting memory
+locations such as buffers like the printk buffer or the process table.
+
+Retrieving a full system memory dump is also possible over the FireWire,
+using data transfer rates in the order of 10MB/s or more.
+
+With most FireWire controllers, memory access is limited to the low 4 GB
+of physical address space. This can be a problem on machines where memory is
+located mostly above that limit, but it is rarely a problem on more common
+hardware such as x86, x86-64 and PowerPC.
+
+At least LSI FW643e and FW643e2 controllers are known to support access to
+physical addresses above 4 GB, but this feature is currently not enabled by
+Linux.
+
+Together with a early initialization of the OHCI-1394 controller for debugging,
+this facility proved most useful for examining long debugs logs in the printk
+buffer on to debug early boot problems in areas like ACPI where the system
+fails to boot and other means for debugging (serial port) are either not
+available (notebooks) or too slow for extensive debug information (like ACPI).
+
+Drivers
+-------
+
+The firewire-ohci driver in drivers/firewire uses filtered physical
+DMA by default, which is more secure but not suitable for remote debugging.
+Pass the remote_dma=1 parameter to the driver to get unfiltered physical DMA.
+
+Because the firewire-ohci driver depends on the PCI enumeration to be
+completed, an initialization routine which runs pretty early has been
+implemented for x86. This routine runs long before console_init() can be
+called, i.e. before the printk buffer appears on the console.
+
+To activate it, enable CONFIG_PROVIDE_OHCI1394_DMA_INIT (Kernel hacking menu:
+Remote debugging over FireWire early on boot) and pass the parameter
+"ohci1394_dma=early" to the recompiled kernel on boot.
+
+Tools
+-----
+
+firescope - Originally developed by Benjamin Herrenschmidt, Andi Kleen ported
+it from PowerPC to x86 and x86_64 and added functionality, firescope can now
+be used to view the printk buffer of a remote machine, even with live update.
+
+Bernhard Kaindl enhanced firescope to support accessing 64-bit machines
+from 32-bit firescope and vice versa:
+- http://v3.sk/~lkundrak/firescope/
+
+and he implemented fast system dump (alpha version - read README.txt):
+- http://halobates.de/firewire/firedump-0.1.tar.bz2
+
+There is also a gdb proxy for firewire which allows to use gdb to access
+data which can be referenced from symbols found by gdb in vmlinux:
+- http://halobates.de/firewire/fireproxy-0.33.tar.bz2
+
+The latest version of this gdb proxy (fireproxy-0.34) can communicate (not
+yet stable) with kgdb over an memory-based communication module (kgdbom).
+
+Getting Started
+---------------
+
+The OHCI-1394 specification regulates that the OHCI-1394 controller must
+disable all physical DMA on each bus reset.
+
+This means that if you want to debug an issue in a system state where
+interrupts are disabled and where no polling of the OHCI-1394 controller
+for bus resets takes place, you have to establish any FireWire cable
+connections and fully initialize all FireWire hardware __before__ the
+system enters such state.
+
+Step-by-step instructions for using firescope with early OHCI initialization:
+
+1) Verify that your hardware is supported:
+
+ Load the firewire-ohci module and check your kernel logs.
+ You should see a line similar to::
+
+ firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT
+ ... contexts, quirks 0x11
+
+ when loading the driver. If you have no supported controller, many PCI,
+ CardBus and even some Express cards which are fully compliant to OHCI-1394
+ specification are available. If it requires no driver for Windows operating
+ systems, it most likely is. Only specialized shops have cards which are not
+ compliant, they are based on TI PCILynx chips and require drivers for Windows
+ operating systems.
+
+ The mentioned kernel log message contains the string "physUB" if the
+ controller implements a writable Physical Upper Bound register. This is
+ required for physical DMA above 4 GB (but not utilized by Linux yet).
+
+2) Establish a working FireWire cable connection:
+
+ Any FireWire cable, as long at it provides electrically and mechanically
+ stable connection and has matching connectors (there are small 4-pin and
+ large 6-pin FireWire ports) will do.
+
+ If an driver is running on both machines you should see a line like::
+
+ firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400
+
+ on both machines in the kernel log when the cable is plugged in
+ and connects the two machines.
+
+3) Test physical DMA using firescope:
+
+ On the debug host, make sure that /dev/fw* is accessible,
+ then start firescope::
+
+ $ firescope
+ Port 0 (/dev/fw1) opened, 2 nodes detected
+
+ FireScope
+ ---------
+ Target : <unspecified>
+ Gen : 1
+ [Ctrl-T] choose target
+ [Ctrl-H] this menu
+ [Ctrl-Q] quit
+
+ ------> Press Ctrl-T now, the output should be similar to:
+
+ 2 nodes available, local node is: 0
+ 0: ffc0, uuid: 00000000 00000000 [LOCAL]
+ 1: ffc1, uuid: 00279000 ba4bb801
+
+ Besides the [LOCAL] node, it must show another node without error message.
+
+4) Prepare for debugging with early OHCI-1394 initialization:
+
+ 4.1) Kernel compilation and installation on debug target
+
+ Compile the kernel to be debugged with CONFIG_PROVIDE_OHCI1394_DMA_INIT
+ (Kernel hacking: Provide code for enabling DMA over FireWire early on boot)
+ enabled and install it on the machine to be debugged (debug target).
+
+ 4.2) Transfer the System.map of the debugged kernel to the debug host
+
+ Copy the System.map of the kernel be debugged to the debug host (the host
+ which is connected to the debugged machine over the FireWire cable).
+
+5) Retrieving the printk buffer contents:
+
+ With the FireWire cable connected, the OHCI-1394 driver on the debugging
+ host loaded, reboot the debugged machine, booting the kernel which has
+ CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early.
+
+ Then, on the debugging host, run firescope, for example by using -A::
+
+ firescope -A System.map-of-debug-target-kernel
+
+ Note: -A automatically attaches to the first non-local node. It only works
+ reliably if only connected two machines are connected using FireWire.
+
+ After having attached to the debug target, press Ctrl-D to view the
+ complete printk buffer or Ctrl-U to enter auto update mode and get an
+ updated live view of recent kernel messages logged on the debug target.
+
+ Call "firescope -h" to get more information on firescope's options.
+
+Notes
+-----
+
+Documentation and specifications: http://halobates.de/firewire/
+
+FireWire is a trademark of Apple Inc. - for more information please refer to:
+https://en.wikipedia.org/wiki/FireWire
diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst
new file mode 100644
index 000000000000..e8a55f9d61db
--- /dev/null
+++ b/Documentation/core-api/dma-api-howto.rst
@@ -0,0 +1,915 @@
+=========================
+Dynamic DMA mapping Guide
+=========================
+
+:Author: David S. Miller <davem@redhat.com>
+:Author: Richard Henderson <rth@cygnus.com>
+:Author: Jakub Jelinek <jakub@redhat.com>
+
+This is a guide to device driver writers on how to use the DMA API
+with example pseudo-code. For a concise description of the API, see
+Documentation/core-api/dma-api.rst.
+
+CPU and DMA addresses
+=====================
+
+There are several kinds of addresses involved in the DMA API, and it's
+important to understand the differences.
+
+The kernel normally uses virtual addresses. Any address returned by
+kmalloc(), vmalloc(), and similar interfaces is a virtual address and can
+be stored in a ``void *``.
+
+The virtual memory system (TLB, page tables, etc.) translates virtual
+addresses to CPU physical addresses, which are stored as "phys_addr_t" or
+"resource_size_t". The kernel manages device resources like registers as
+physical addresses. These are the addresses in /proc/iomem. The physical
+address is not directly useful to a driver; it must use ioremap() to map
+the space and produce a virtual address.
+
+I/O devices use a third kind of address: a "bus address". If a device has
+registers at an MMIO address, or if it performs DMA to read or write system
+memory, the addresses used by the device are bus addresses. In some
+systems, bus addresses are identical to CPU physical addresses, but in
+general they are not. IOMMUs and host bridges can produce arbitrary
+mappings between physical and bus addresses.
+
+From a device's point of view, DMA uses the bus address space, but it may
+be restricted to a subset of that space. For example, even if a system
+supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU
+so devices only need to use 32-bit DMA addresses.
+
+Here's a picture and some examples::
+
+ CPU CPU Bus
+ Virtual Physical Address
+ Address Address Space
+ Space Space
+
+ +-------+ +------+ +------+
+ | | |MMIO | Offset | |
+ | | Virtual |Space | applied | |
+ C +-------+ --------> B +------+ ----------> +------+ A
+ | | mapping | | by host | |
+ +-----+ | | | | bridge | | +--------+
+ | | | | +------+ | | | |
+ | CPU | | | | RAM | | | | Device |
+ | | | | | | | | | |
+ +-----+ +-------+ +------+ +------+ +--------+
+ | | Virtual |Buffer| Mapping | |
+ X +-------+ --------> Y +------+ <---------- +------+ Z
+ | | mapping | RAM | by IOMMU
+ | | | |
+ | | | |
+ +-------+ +------+
+
+During the enumeration process, the kernel learns about I/O devices and
+their MMIO space and the host bridges that connect them to the system. For
+example, if a PCI device has a BAR, the kernel reads the bus address (A)
+from the BAR and converts it to a CPU physical address (B). The address B
+is stored in a struct resource and usually exposed via /proc/iomem. When a
+driver claims a device, it typically uses ioremap() to map physical address
+B at a virtual address (C). It can then use, e.g., ioread32(C), to access
+the device registers at bus address A.
+
+If the device supports DMA, the driver sets up a buffer using kmalloc() or
+a similar interface, which returns a virtual address (X). The virtual
+memory system maps X to a physical address (Y) in system RAM. The driver
+can use virtual address X to access the buffer, but the device itself
+cannot because DMA doesn't go through the CPU virtual memory system.
+
+In some simple systems, the device can do DMA directly to physical address
+Y. But in many others, there is IOMMU hardware that translates DMA
+addresses to physical addresses, e.g., it translates Z to Y. This is part
+of the reason for the DMA API: the driver can give a virtual address X to
+an interface like dma_map_single(), which sets up any required IOMMU
+mapping and returns the DMA address Z. The driver then tells the device to
+do DMA to Z, and the IOMMU maps it to the buffer at address Y in system
+RAM.
+
+So that Linux can use the dynamic DMA mapping, it needs some help from the
+drivers, namely it has to take into account that DMA addresses should be
+mapped only for the time they are actually used and unmapped after the DMA
+transfer.
+
+The following API will work of course even on platforms where no such
+hardware exists.
+
+Note that the DMA API works with any bus independent of the underlying
+microprocessor architecture. You should use the DMA API rather than the
+bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
+pci_map_*() interfaces.
+
+First of all, you should make sure::
+
+ #include <linux/dma-mapping.h>
+
+is in your driver, which provides the definition of dma_addr_t. This type
+can hold any valid DMA address for the platform and should be used
+everywhere you hold a DMA address returned from the DMA mapping functions.
+
+What memory is DMA'able?
+========================
+
+The first piece of information you must know is what kernel memory can
+be used with the DMA mapping facilities. There has been an unwritten
+set of rules regarding this, and this text is an attempt to finally
+write them down.
+
+If you acquired your memory via the page allocator
+(i.e. __get_free_page*()) or the generic memory allocators
+(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
+that memory using the addresses returned from those routines.
+
+This means specifically that you may _not_ use the memory/addresses
+returned from vmalloc() for DMA. It is possible to DMA to the
+_underlying_ memory mapped into a vmalloc() area, but this requires
+walking page tables to get the physical addresses, and then
+translating each of those pages back to a kernel address using
+something like __va(). [ EDIT: Update this when we integrate
+Gerd Knorr's generic code which does this. ]
+
+This rule also means that you may use neither kernel image addresses
+(items in data/text/bss segments), nor module image addresses, nor
+stack addresses for DMA. These could all be mapped somewhere entirely
+different than the rest of physical memory. Even if those classes of
+memory could physically work with DMA, you'd need to ensure the I/O
+buffers were cacheline-aligned. Without that, you'd see cacheline
+sharing problems (data corruption) on CPUs with DMA-incoherent caches.
+(The CPU could write to one word, DMA would write to a different one
+in the same cache line, and one of them could be overwritten.)
+
+Also, this means that you cannot take the return of a kmap()
+call and DMA to/from that. This is similar to vmalloc().
+
+What about block I/O and networking buffers? The block I/O and
+networking subsystems make sure that the buffers they use are valid
+for you to DMA from/to.
+
+DMA addressing capabilities
+===========================
+
+By default, the kernel assumes that your device can address 32-bits of DMA
+addressing. For a 64-bit capable device, this needs to be increased, and for
+a device with limitations, it needs to be decreased.
+
+Special note about PCI: PCI-X specification requires PCI-X devices to support
+64-bit addressing (DAC) for all transactions. And at least one platform (SGI
+SN2) requires 64-bit consistent allocations to operate correctly when the IO
+bus is in PCI-X mode.
+
+For correct operation, you must set the DMA mask to inform the kernel about
+your devices DMA addressing capabilities.
+
+This is performed via a call to dma_set_mask_and_coherent()::
+
+ int dma_set_mask_and_coherent(struct device *dev, u64 mask);
+
+which will set the mask for both streaming and coherent APIs together. If you
+have some special requirements, then the following two separate calls can be
+used instead:
+
+ The setup for streaming mappings is performed via a call to
+ dma_set_mask()::
+
+ int dma_set_mask(struct device *dev, u64 mask);
+
+ The setup for consistent allocations is performed via a call
+ to dma_set_coherent_mask()::
+
+ int dma_set_coherent_mask(struct device *dev, u64 mask);
+
+Here, dev is a pointer to the device struct of your device, and mask is a bit
+mask describing which bits of an address your device supports. Often the
+device struct of your device is embedded in the bus-specific device struct of
+your device. For example, &pdev->dev is a pointer to the device struct of a
+PCI device (pdev is a pointer to the PCI device struct of your device).
+
+These calls usually return zero to indicate your device can perform DMA
+properly on the machine given the address mask you provided, but they might
+return an error if the mask is too small to be supportable on the given
+system. If it returns non-zero, your device cannot perform DMA properly on
+this platform, and attempting to do so will result in undefined behavior.
+You must not use DMA on this device unless the dma_set_mask family of
+functions has returned success.
+
+This means that in the failure case, you have two options:
+
+1) Use some non-DMA mode for data transfer, if possible.
+2) Ignore this device and do not initialize it.
+
+It is recommended that your driver print a kernel KERN_WARNING message when
+setting the DMA mask fails. In this manner, if a user of your driver reports
+that performance is bad or that the device is not even detected, you can ask
+them for the kernel messages to find out exactly why.
+
+The standard 64-bit addressing device would do something like this::
+
+ if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) {
+ dev_warn(dev, "mydev: No suitable DMA available\n");
+ goto ignore_this_device;
+ }
+
+If the device only supports 32-bit addressing for descriptors in the
+coherent allocations, but supports full 64-bits for streaming mappings
+it would look like this::
+
+ if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
+ dev_warn(dev, "mydev: No suitable DMA available\n");
+ goto ignore_this_device;
+ }
+
+The coherent mask will always be able to set the same or a smaller mask as
+the streaming mask. However for the rare case that a device driver only
+uses consistent allocations, one would have to check the return value from
+dma_set_coherent_mask().
+
+Finally, if your device can only drive the low 24-bits of
+address you might do something like::
+
+ if (dma_set_mask(dev, DMA_BIT_MASK(24))) {
+ dev_warn(dev, "mydev: 24-bit DMA addressing not available\n");
+ goto ignore_this_device;
+ }
+
+When dma_set_mask() or dma_set_mask_and_coherent() is successful, and
+returns zero, the kernel saves away this mask you have provided. The
+kernel will use this information later when you make DMA mappings.
+
+There is a case which we are aware of at this time, which is worth
+mentioning in this documentation. If your device supports multiple
+functions (for example a sound card provides playback and record
+functions) and the various different functions have _different_
+DMA addressing limitations, you may wish to probe each mask and
+only provide the functionality which the machine can handle. It
+is important that the last call to dma_set_mask() be for the
+most specific mask.
+
+Here is pseudo-code showing how this might be done::
+
+ #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32)
+ #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24)
+
+ struct my_sound_card *card;
+ struct device *dev;
+
+ ...
+ if (!dma_set_mask(dev, PLAYBACK_ADDRESS_BITS)) {
+ card->playback_enabled = 1;
+ } else {
+ card->playback_enabled = 0;
+ dev_warn(dev, "%s: Playback disabled due to DMA limitations\n",
+ card->name);
+ }
+ if (!dma_set_mask(dev, RECORD_ADDRESS_BITS)) {
+ card->record_enabled = 1;
+ } else {
+ card->record_enabled = 0;
+ dev_warn(dev, "%s: Record disabled due to DMA limitations\n",
+ card->name);
+ }
+
+A sound card was used as an example here because this genre of PCI
+devices seems to be littered with ISA chips given a PCI front end,
+and thus retaining the 16MB DMA addressing limitations of ISA.
+
+Types of DMA mappings
+=====================
+
+There are two types of DMA mappings:
+
+- Consistent DMA mappings which are usually mapped at driver
+ initialization, unmapped at the end and for which the hardware should
+ guarantee that the device and the CPU can access the data
+ in parallel and will see updates made by each other without any
+ explicit software flushing.
+
+ Think of "consistent" as "synchronous" or "coherent".
+
+ The current default is to return consistent memory in the low 32
+ bits of the DMA space. However, for future compatibility you should
+ set the consistent mask even if this default is fine for your
+ driver.
+
+ Good examples of what to use consistent mappings for are:
+
+ - Network card DMA ring descriptors.
+ - SCSI adapter mailbox command data structures.
+ - Device firmware microcode executed out of
+ main memory.
+
+ The invariant these examples all require is that any CPU store
+ to memory is immediately visible to the device, and vice
+ versa. Consistent mappings guarantee this.
+
+ .. important::
+
+ Consistent DMA memory does not preclude the usage of
+ proper memory barriers. The CPU may reorder stores to
+ consistent memory just as it may normal memory. Example:
+ if it is important for the device to see the first word
+ of a descriptor updated before the second, you must do
+ something like::
+
+ desc->word0 = address;
+ wmb();
+ desc->word1 = DESC_VALID;
+
+ in order to get correct behavior on all platforms.
+
+ Also, on some platforms your driver may need to flush CPU write
+ buffers in much the same way as it needs to flush write buffers
+ found in PCI bridges (such as by reading a register's value
+ after writing it).
+
+- Streaming DMA mappings which are usually mapped for one DMA
+ transfer, unmapped right after it (unless you use dma_sync_* below)
+ and for which hardware can optimize for sequential accesses.
+
+ Think of "streaming" as "asynchronous" or "outside the coherency
+ domain".
+
+ Good examples of what to use streaming mappings for are:
+
+ - Networking buffers transmitted/received by a device.
+ - Filesystem buffers written/read by a SCSI device.
+
+ The interfaces for using this type of mapping were designed in
+ such a way that an implementation can make whatever performance
+ optimizations the hardware allows. To this end, when using
+ such mappings you must be explicit about what you want to happen.
+
+Neither type of DMA mapping has alignment restrictions that come from
+the underlying bus, although some devices may have such restrictions.
+Also, systems with caches that aren't DMA-coherent will work better
+when the underlying buffers don't share cache lines with other data.
+
+
+Using Consistent DMA mappings
+=============================
+
+To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
+you should do::
+
+ dma_addr_t dma_handle;
+
+ cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp);
+
+where device is a ``struct device *``. This may be called in interrupt
+context with the GFP_ATOMIC flag.
+
+Size is the length of the region you want to allocate, in bytes.
+
+This routine will allocate RAM for that region, so it acts similarly to
+__get_free_pages() (but takes size instead of a page order). If your
+driver needs regions sized smaller than a page, you may prefer using
+the dma_pool interface, described below.
+
+The consistent DMA mapping interfaces, will by default return a DMA address
+which is 32-bit addressable. Even if the device indicates (via the DMA mask)
+that it may address the upper 32-bits, consistent allocation will only
+return > 32-bit addresses for DMA if the consistent DMA mask has been
+explicitly changed via dma_set_coherent_mask(). This is true of the
+dma_pool interface as well.
+
+dma_alloc_coherent() returns two values: the virtual address which you
+can use to access it from the CPU and dma_handle which you pass to the
+card.
+
+The CPU virtual address and the DMA address are both
+guaranteed to be aligned to the smallest PAGE_SIZE order which
+is greater than or equal to the requested size. This invariant
+exists (for example) to guarantee that if you allocate a chunk
+which is smaller than or equal to 64 kilobytes, the extent of the
+buffer you receive will not cross a 64K boundary.
+
+To unmap and free such a DMA region, you call::
+
+ dma_free_coherent(dev, size, cpu_addr, dma_handle);
+
+where dev, size are the same as in the above call and cpu_addr and
+dma_handle are the values dma_alloc_coherent() returned to you.
+This function may not be called in interrupt context.
+
+If your driver needs lots of smaller memory regions, you can write
+custom code to subdivide pages returned by dma_alloc_coherent(),
+or you can use the dma_pool API to do that. A dma_pool is like
+a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
+Also, it understands common hardware constraints for alignment,
+like queue heads needing to be aligned on N byte boundaries.
+
+Create a dma_pool like this::
+
+ struct dma_pool *pool;
+
+ pool = dma_pool_create(name, dev, size, align, boundary);
+
+The "name" is for diagnostics (like a kmem_cache name); dev and size
+are as above. The device's hardware alignment requirement for this
+type of data is "align" (which is expressed in bytes, and must be a
+power of two). If your device has no boundary crossing restrictions,
+pass 0 for boundary; passing 4096 says memory allocated from this pool
+must not cross 4KByte boundaries (but at that time it may be better to
+use dma_alloc_coherent() directly instead).
+
+Allocate memory from a DMA pool like this::
+
+ cpu_addr = dma_pool_alloc(pool, flags, &dma_handle);
+
+flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor
+holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(),
+this returns two values, cpu_addr and dma_handle.
+
+Free memory that was allocated from a dma_pool like this::
+
+ dma_pool_free(pool, cpu_addr, dma_handle);
+
+where pool is what you passed to dma_pool_alloc(), and cpu_addr and
+dma_handle are the values dma_pool_alloc() returned. This function
+may be called in interrupt context.
+
+Destroy a dma_pool by calling::
+
+ dma_pool_destroy(pool);
+
+Make sure you've called dma_pool_free() for all memory allocated
+from a pool before you destroy the pool. This function may not
+be called in interrupt context.
+
+DMA Direction
+=============
+
+The interfaces described in subsequent portions of this document
+take a DMA direction argument, which is an integer and takes on
+one of the following values::
+
+ DMA_BIDIRECTIONAL
+ DMA_TO_DEVICE
+ DMA_FROM_DEVICE
+ DMA_NONE
+
+You should provide the exact DMA direction if you know it.
+
+DMA_TO_DEVICE means "from main memory to the device"
+DMA_FROM_DEVICE means "from the device to main memory"
+It is the direction in which the data moves during the DMA
+transfer.
+
+You are _strongly_ encouraged to specify this as precisely
+as you possibly can.
+
+If you absolutely cannot know the direction of the DMA transfer,
+specify DMA_BIDIRECTIONAL. It means that the DMA can go in
+either direction. The platform guarantees that you may legally
+specify this, and that it will work, but this may be at the
+cost of performance for example.
+
+The value DMA_NONE is to be used for debugging. One can
+hold this in a data structure before you come to know the
+precise direction, and this will help catch cases where your
+direction tracking logic has failed to set things up properly.
+
+Another advantage of specifying this value precisely (outside of
+potential platform-specific optimizations of such) is for debugging.
+Some platforms actually have a write permission boolean which DMA
+mappings can be marked with, much like page protections in the user
+program address space. Such platforms can and do report errors in the
+kernel logs when the DMA controller hardware detects violation of the
+permission setting.
+
+Only streaming mappings specify a direction, consistent mappings
+implicitly have a direction attribute setting of
+DMA_BIDIRECTIONAL.
+
+The SCSI subsystem tells you the direction to use in the
+'sc_data_direction' member of the SCSI command your driver is
+working on.
+
+For Networking drivers, it's a rather simple affair. For transmit
+packets, map/unmap them with the DMA_TO_DEVICE direction
+specifier. For receive packets, just the opposite, map/unmap them
+with the DMA_FROM_DEVICE direction specifier.
+
+Using Streaming DMA mappings
+============================
+
+The streaming DMA mapping routines can be called from interrupt
+context. There are two versions of each map/unmap, one which will
+map/unmap a single memory region, and one which will map/unmap a
+scatterlist.
+
+To map a single region, you do::
+
+ struct device *dev = &my_dev->dev;
+ dma_addr_t dma_handle;
+ void *addr = buffer->ptr;
+ size_t size = buffer->len;
+
+ dma_handle = dma_map_single(dev, addr, size, direction);
+ if (dma_mapping_error(dev, dma_handle)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling;
+ }
+
+and to unmap it::
+
+ dma_unmap_single(dev, dma_handle, size, direction);
+
+You should call dma_mapping_error() as dma_map_single() could fail and return
+error. Doing so will ensure that the mapping code will work correctly on all
+DMA implementations without any dependency on the specifics of the underlying
+implementation. Using the returned address without checking for errors could
+result in failures ranging from panics to silent data corruption. The same
+applies to dma_map_page() as well.
+
+You should call dma_unmap_single() when the DMA activity is finished, e.g.,
+from the interrupt which told you that the DMA transfer is done.
+
+Using CPU pointers like this for single mappings has a disadvantage:
+you cannot reference HIGHMEM memory in this way. Thus, there is a
+map/unmap interface pair akin to dma_{map,unmap}_single(). These
+interfaces deal with page/offset pairs instead of CPU pointers.
+Specifically::
+
+ struct device *dev = &my_dev->dev;
+ dma_addr_t dma_handle;
+ struct page *page = buffer->page;
+ unsigned long offset = buffer->offset;
+ size_t size = buffer->len;
+
+ dma_handle = dma_map_page(dev, page, offset, size, direction);
+ if (dma_mapping_error(dev, dma_handle)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling;
+ }
+
+ ...
+
+ dma_unmap_page(dev, dma_handle, size, direction);
+
+Here, "offset" means byte offset within the given page.
+
+You should call dma_mapping_error() as dma_map_page() could fail and return
+error as outlined under the dma_map_single() discussion.
+
+You should call dma_unmap_page() when the DMA activity is finished, e.g.,
+from the interrupt which told you that the DMA transfer is done.
+
+With scatterlists, you map a region gathered from several regions by::
+
+ int i, count = dma_map_sg(dev, sglist, nents, direction);
+ struct scatterlist *sg;
+
+ for_each_sg(sglist, sg, count, i) {
+ hw_address[i] = sg_dma_address(sg);
+ hw_len[i] = sg_dma_len(sg);
+ }
+
+where nents is the number of entries in the sglist.
+
+The implementation is free to merge several consecutive sglist entries
+into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any
+consecutive sglist entries can be merged into one provided the first one
+ends and the second one starts on a page boundary - in fact this is a huge
+advantage for cards which either cannot do scatter-gather or have very
+limited number of scatter-gather entries) and returns the actual number
+of sg entries it mapped them to. On failure 0 is returned.
+
+Then you should loop count times (note: this can be less than nents times)
+and use sg_dma_address() and sg_dma_len() macros where you previously
+accessed sg->address and sg->length as shown above.
+
+To unmap a scatterlist, just call::
+
+ dma_unmap_sg(dev, sglist, nents, direction);
+
+Again, make sure DMA activity has already finished.
+
+.. note::
+
+ The 'nents' argument to the dma_unmap_sg call must be
+ the _same_ one you passed into the dma_map_sg call,
+ it should _NOT_ be the 'count' value _returned_ from the
+ dma_map_sg call.
+
+Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}()
+counterpart, because the DMA address space is a shared resource and
+you could render the machine unusable by consuming all DMA addresses.
+
+If you need to use the same streaming DMA region multiple times and touch
+the data in between the DMA transfers, the buffer needs to be synced
+properly in order for the CPU and device to see the most up-to-date and
+correct copy of the DMA buffer.
+
+So, firstly, just map it with dma_map_{single,sg}(), and after each DMA
+transfer call either::
+
+ dma_sync_single_for_cpu(dev, dma_handle, size, direction);
+
+or::
+
+ dma_sync_sg_for_cpu(dev, sglist, nents, direction);
+
+as appropriate.
+
+Then, if you wish to let the device get at the DMA area again,
+finish accessing the data with the CPU, and then before actually
+giving the buffer to the hardware call either::
+
+ dma_sync_single_for_device(dev, dma_handle, size, direction);
+
+or::
+
+ dma_sync_sg_for_device(dev, sglist, nents, direction);
+
+as appropriate.
+
+.. note::
+
+ The 'nents' argument to dma_sync_sg_for_cpu() and
+ dma_sync_sg_for_device() must be the same passed to
+ dma_map_sg(). It is _NOT_ the count returned by
+ dma_map_sg().
+
+After the last DMA transfer call one of the DMA unmap routines
+dma_unmap_{single,sg}(). If you don't touch the data from the first
+dma_map_*() call till dma_unmap_*(), then you don't have to call the
+dma_sync_*() routines at all.
+
+Here is pseudo code which shows a situation in which you would need
+to use the dma_sync_*() interfaces::
+
+ my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
+ {
+ dma_addr_t mapping;
+
+ mapping = dma_map_single(cp->dev, buffer, len, DMA_FROM_DEVICE);
+ if (dma_mapping_error(cp->dev, mapping)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling;
+ }
+
+ cp->rx_buf = buffer;
+ cp->rx_len = len;
+ cp->rx_dma = mapping;
+
+ give_rx_buf_to_card(cp);
+ }
+
+ ...
+
+ my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs)
+ {
+ struct my_card *cp = devid;
+
+ ...
+ if (read_card_status(cp) == RX_BUF_TRANSFERRED) {
+ struct my_card_header *hp;
+
+ /* Examine the header to see if we wish
+ * to accept the data. But synchronize
+ * the DMA transfer with the CPU first
+ * so that we see updated contents.
+ */
+ dma_sync_single_for_cpu(&cp->dev, cp->rx_dma,
+ cp->rx_len,
+ DMA_FROM_DEVICE);
+
+ /* Now it is safe to examine the buffer. */
+ hp = (struct my_card_header *) cp->rx_buf;
+ if (header_is_ok(hp)) {
+ dma_unmap_single(&cp->dev, cp->rx_dma, cp->rx_len,
+ DMA_FROM_DEVICE);
+ pass_to_upper_layers(cp->rx_buf);
+ make_and_setup_new_rx_buf(cp);
+ } else {
+ /* CPU should not write to
+ * DMA_FROM_DEVICE-mapped area,
+ * so dma_sync_single_for_device() is
+ * not needed here. It would be required
+ * for DMA_BIDIRECTIONAL mapping if
+ * the memory was modified.
+ */
+ give_rx_buf_to_card(cp);
+ }
+ }
+ }
+
+Handling Errors
+===============
+
+DMA address space is limited on some architectures and an allocation
+failure can be determined by:
+
+- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0
+
+- checking the dma_addr_t returned from dma_map_single() and dma_map_page()
+ by using dma_mapping_error()::
+
+ dma_addr_t dma_handle;
+
+ dma_handle = dma_map_single(dev, addr, size, direction);
+ if (dma_mapping_error(dev, dma_handle)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling;
+ }
+
+- unmap pages that are already mapped, when mapping error occurs in the middle
+ of a multiple page mapping attempt. These example are applicable to
+ dma_map_page() as well.
+
+Example 1::
+
+ dma_addr_t dma_handle1;
+ dma_addr_t dma_handle2;
+
+ dma_handle1 = dma_map_single(dev, addr, size, direction);
+ if (dma_mapping_error(dev, dma_handle1)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling1;
+ }
+ dma_handle2 = dma_map_single(dev, addr, size, direction);
+ if (dma_mapping_error(dev, dma_handle2)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling2;
+ }
+
+ ...
+
+ map_error_handling2:
+ dma_unmap_single(dma_handle1);
+ map_error_handling1:
+
+Example 2::
+
+ /*
+ * if buffers are allocated in a loop, unmap all mapped buffers when
+ * mapping error is detected in the middle
+ */
+
+ dma_addr_t dma_addr;
+ dma_addr_t array[DMA_BUFFERS];
+ int save_index = 0;
+
+ for (i = 0; i < DMA_BUFFERS; i++) {
+
+ ...
+
+ dma_addr = dma_map_single(dev, addr, size, direction);
+ if (dma_mapping_error(dev, dma_addr)) {
+ /*
+ * reduce current DMA mapping usage,
+ * delay and try again later or
+ * reset driver.
+ */
+ goto map_error_handling;
+ }
+ array[i].dma_addr = dma_addr;
+ save_index++;
+ }
+
+ ...
+
+ map_error_handling:
+
+ for (i = 0; i < save_index; i++) {
+
+ ...
+
+ dma_unmap_single(array[i].dma_addr);
+ }
+
+Networking drivers must call dev_kfree_skb() to free the socket buffer
+and return NETDEV_TX_OK if the DMA mapping fails on the transmit hook
+(ndo_start_xmit). This means that the socket buffer is just dropped in
+the failure case.
+
+SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping
+fails in the queuecommand hook. This means that the SCSI subsystem
+passes the command to the driver again later.
+
+Optimizing Unmap State Space Consumption
+========================================
+
+On many platforms, dma_unmap_{single,page}() is simply a nop.
+Therefore, keeping track of the mapping address and length is a waste
+of space. Instead of filling your drivers up with ifdefs and the like
+to "work around" this (which would defeat the whole purpose of a
+portable API) the following facilities are provided.
+
+Actually, instead of describing the macros one by one, we'll
+transform some example code.
+
+1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures.
+ Example, before::
+
+ struct ring_state {
+ struct sk_buff *skb;
+ dma_addr_t mapping;
+ __u32 len;
+ };
+
+ after::
+
+ struct ring_state {
+ struct sk_buff *skb;
+ DEFINE_DMA_UNMAP_ADDR(mapping);
+ DEFINE_DMA_UNMAP_LEN(len);
+ };
+
+2) Use dma_unmap_{addr,len}_set() to set these values.
+ Example, before::
+
+ ringp->mapping = FOO;
+ ringp->len = BAR;
+
+ after::
+
+ dma_unmap_addr_set(ringp, mapping, FOO);
+ dma_unmap_len_set(ringp, len, BAR);
+
+3) Use dma_unmap_{addr,len}() to access these values.
+ Example, before::
+
+ dma_unmap_single(dev, ringp->mapping, ringp->len,
+ DMA_FROM_DEVICE);
+
+ after::
+
+ dma_unmap_single(dev,
+ dma_unmap_addr(ringp, mapping),
+ dma_unmap_len(ringp, len),
+ DMA_FROM_DEVICE);
+
+It really should be self-explanatory. We treat the ADDR and LEN
+separately, because it is possible for an implementation to only
+need the address in order to perform the unmap operation.
+
+Platform Issues
+===============
+
+If you are just writing drivers for Linux and do not maintain
+an architecture port for the kernel, you can safely skip down
+to "Closing".
+
+1) Struct scatterlist requirements.
+
+ You need to enable CONFIG_NEED_SG_DMA_LENGTH if the architecture
+ supports IOMMUs (including software IOMMU).
+
+2) ARCH_DMA_MINALIGN
+
+ Architectures must ensure that kmalloc'ed buffer is
+ DMA-safe. Drivers and subsystems depend on it. If an architecture
+ isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in
+ the CPU cache is identical to data in main memory),
+ ARCH_DMA_MINALIGN must be set so that the memory allocator
+ makes sure that kmalloc'ed buffer doesn't share a cache line with
+ the others. See arch/arm/include/asm/cache.h as an example.
+
+ Note that ARCH_DMA_MINALIGN is about DMA memory alignment
+ constraints. You don't need to worry about the architecture data
+ alignment constraints (e.g. the alignment constraints about 64-bit
+ objects).
+
+Closing
+=======
+
+This document, and the API itself, would not be in its current
+form without the feedback and suggestions from numerous individuals.
+We would like to specifically mention, in no particular order, the
+following people::
+
+ Russell King <rmk@arm.linux.org.uk>
+ Leo Dagum <dagum@barrel.engr.sgi.com>
+ Ralf Baechle <ralf@oss.sgi.com>
+ Grant Grundler <grundler@cup.hp.com>
+ Jay Estabrook <Jay.Estabrook@compaq.com>
+ Thomas Sailer <sailer@ife.ee.ethz.ch>
+ Andrea Arcangeli <andrea@suse.de>
+ Jens Axboe <jens.axboe@oracle.com>
+ David Mosberger-Tang <davidm@hpl.hp.com>
diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
new file mode 100644
index 000000000000..8e3cce3d0a23
--- /dev/null
+++ b/Documentation/core-api/dma-api.rst
@@ -0,0 +1,846 @@
+============================================
+Dynamic DMA mapping using the generic device
+============================================
+
+:Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
+
+This document describes the DMA API. For a more gentle introduction
+of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst.
+
+This API is split into two pieces. Part I describes the basic API.
+Part II describes extensions for supporting non-consistent memory
+machines. Unless you know that your driver absolutely has to support
+non-consistent platforms (this is usually only legacy platforms) you
+should only use the API described in part I.
+
+Part I - dma_API
+----------------
+
+To get the dma_API, you must #include <linux/dma-mapping.h>. This
+provides dma_addr_t and the interfaces described below.
+
+A dma_addr_t can hold any valid DMA address for the platform. It can be
+given to a device to use as a DMA source or target. A CPU cannot reference
+a dma_addr_t directly because there may be translation between its physical
+address space and the DMA address space.
+
+Part Ia - Using large DMA-coherent buffers
+------------------------------------------
+
+::
+
+ void *
+ dma_alloc_coherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, gfp_t flag)
+
+Consistent memory is memory for which a write by either the device or
+the processor can immediately be read by the processor or device
+without having to worry about caching effects. (You may however need
+to make sure to flush the processor's write buffers before telling
+devices to read that memory.)
+
+This routine allocates a region of <size> bytes of consistent memory.
+
+It returns a pointer to the allocated region (in the processor's virtual
+address space) or NULL if the allocation failed.
+
+It also returns a <dma_handle> which may be cast to an unsigned integer the
+same width as the bus and given to the device as the DMA address base of
+the region.
+
+Note: consistent memory can be expensive on some platforms, and the
+minimum allocation length may be as big as a page, so you should
+consolidate your requests for consistent memory as much as possible.
+The simplest way to do that is to use the dma_pool calls (see below).
+
+The flag parameter (dma_alloc_coherent() only) allows the caller to
+specify the ``GFP_`` flags (see kmalloc()) for the allocation (the
+implementation may choose to ignore flags that affect the location of
+the returned memory, like GFP_DMA).
+
+::
+
+ void
+ dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
+ dma_addr_t dma_handle)
+
+Free a region of consistent memory you previously allocated. dev,
+size and dma_handle must all be the same as those passed into
+dma_alloc_coherent(). cpu_addr must be the virtual address returned by
+the dma_alloc_coherent().
+
+Note that unlike their sibling allocation calls, these routines
+may only be called with IRQs enabled.
+
+
+Part Ib - Using small DMA-coherent buffers
+------------------------------------------
+
+To get this part of the dma_API, you must #include <linux/dmapool.h>
+
+Many drivers need lots of small DMA-coherent memory regions for DMA
+descriptors or I/O buffers. Rather than allocating in units of a page
+or more using dma_alloc_coherent(), you can use DMA pools. These work
+much like a struct kmem_cache, except that they use the DMA-coherent allocator,
+not __get_free_pages(). Also, they understand common hardware constraints
+for alignment, like queue heads needing to be aligned on N-byte boundaries.
+
+
+::
+
+ struct dma_pool *
+ dma_pool_create(const char *name, struct device *dev,
+ size_t size, size_t align, size_t alloc);
+
+dma_pool_create() initializes a pool of DMA-coherent buffers
+for use with a given device. It must be called in a context which
+can sleep.
+
+The "name" is for diagnostics (like a struct kmem_cache name); dev and size
+are like what you'd pass to dma_alloc_coherent(). The device's hardware
+alignment requirement for this type of data is "align" (which is expressed
+in bytes, and must be a power of two). If your device has no boundary
+crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated
+from this pool must not cross 4KByte boundaries.
+
+::
+
+ void *
+ dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags,
+ dma_addr_t *handle)
+
+Wraps dma_pool_alloc() and also zeroes the returned memory if the
+allocation attempt succeeded.
+
+
+::
+
+ void *
+ dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags,
+ dma_addr_t *dma_handle);
+
+This allocates memory from the pool; the returned memory will meet the
+size and alignment requirements specified at creation time. Pass
+GFP_ATOMIC to prevent blocking, or if it's permitted (not
+in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow
+blocking. Like dma_alloc_coherent(), this returns two values: an
+address usable by the CPU, and the DMA address usable by the pool's
+device.
+
+::
+
+ void
+ dma_pool_free(struct dma_pool *pool, void *vaddr,
+ dma_addr_t addr);
+
+This puts memory back into the pool. The pool is what was passed to
+dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what
+were returned when that routine allocated the memory being freed.
+
+::
+
+ void
+ dma_pool_destroy(struct dma_pool *pool);
+
+dma_pool_destroy() frees the resources of the pool. It must be
+called in a context which can sleep. Make sure you've freed all allocated
+memory back to the pool before you destroy it.
+
+
+Part Ic - DMA addressing limitations
+------------------------------------
+
+::
+
+ int
+ dma_set_mask_and_coherent(struct device *dev, u64 mask)
+
+Checks to see if the mask is possible and updates the device
+streaming and coherent DMA mask parameters if it is.
+
+Returns: 0 if successful and a negative error if not.
+
+::
+
+ int
+ dma_set_mask(struct device *dev, u64 mask)
+
+Checks to see if the mask is possible and updates the device
+parameters if it is.
+
+Returns: 0 if successful and a negative error if not.
+
+::
+
+ int
+ dma_set_coherent_mask(struct device *dev, u64 mask)
+
+Checks to see if the mask is possible and updates the device
+parameters if it is.
+
+Returns: 0 if successful and a negative error if not.
+
+::
+
+ u64
+ dma_get_required_mask(struct device *dev)
+
+This API returns the mask that the platform requires to
+operate efficiently. Usually this means the returned mask
+is the minimum required to cover all of memory. Examining the
+required mask gives drivers with variable descriptor sizes the
+opportunity to use smaller descriptors as necessary.
+
+Requesting the required mask does not alter the current mask. If you
+wish to take advantage of it, you should issue a dma_set_mask()
+call to set the mask to the value returned.
+
+::
+
+ size_t
+ dma_max_mapping_size(struct device *dev);
+
+Returns the maximum size of a mapping for the device. The size parameter
+of the mapping functions like dma_map_single(), dma_map_page() and
+others should not be larger than the returned value.
+
+::
+
+ size_t
+ dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device.
+
+Mapping larger buffers may take much longer in certain scenarios. In
+addition, for high-rate short-lived streaming mappings, the upfront time
+spent on the mapping may account for an appreciable part of the total
+request lifetime. As such, if splitting larger requests incurs no
+significant performance penalty, then device drivers are advised to
+limit total DMA streaming mappings length to the returned value.
+
+::
+
+ bool
+ dma_need_sync(struct device *dev, dma_addr_t dma_addr);
+
+Returns %true if dma_sync_single_for_{device,cpu} calls are required to
+transfer memory ownership. Returns %false if those calls can be skipped.
+
+::
+
+ unsigned long
+ dma_get_merge_boundary(struct device *dev);
+
+Returns the DMA merge boundary. If the device cannot merge any the DMA address
+segments, the function returns 0.
+
+Part Id - Streaming DMA mappings
+--------------------------------
+
+::
+
+ dma_addr_t
+ dma_map_single(struct device *dev, void *cpu_addr, size_t size,
+ enum dma_data_direction direction)
+
+Maps a piece of processor virtual memory so it can be accessed by the
+device and returns the DMA address of the memory.
+
+The direction for both APIs may be converted freely by casting.
+However the dma_API uses a strongly typed enumerator for its
+direction:
+
+======================= =============================================
+DMA_NONE no direction (used for debugging)
+DMA_TO_DEVICE data is going from the memory to the device
+DMA_FROM_DEVICE data is coming from the device to the memory
+DMA_BIDIRECTIONAL direction isn't known
+======================= =============================================
+
+.. note::
+
+ Not all memory regions in a machine can be mapped by this API.
+ Further, contiguous kernel virtual space may not be contiguous as
+ physical memory. Since this API does not provide any scatter/gather
+ capability, it will fail if the user tries to map a non-physically
+ contiguous piece of memory. For this reason, memory to be mapped by
+ this API should be obtained from sources which guarantee it to be
+ physically contiguous (like kmalloc).
+
+ Further, the DMA address of the memory must be within the
+ dma_mask of the device (the dma_mask is a bit mask of the
+ addressable region for the device, i.e., if the DMA address of
+ the memory ANDed with the dma_mask is still equal to the DMA
+ address, then the device can perform DMA to the memory). To
+ ensure that the memory allocated by kmalloc is within the dma_mask,
+ the driver may specify various platform-dependent flags to restrict
+ the DMA address range of the allocation (e.g., on x86, GFP_DMA
+ guarantees to be within the first 16MB of available DMA addresses,
+ as required by ISA devices).
+
+ Note also that the above constraints on physical contiguity and
+ dma_mask may not apply if the platform has an IOMMU (a device which
+ maps an I/O DMA address to a physical memory address). However, to be
+ portable, device driver writers may *not* assume that such an IOMMU
+ exists.
+
+.. warning::
+
+ Memory coherency operates at a granularity called the cache
+ line width. In order for memory mapped by this API to operate
+ correctly, the mapped region must begin exactly on a cache line
+ boundary and end exactly on one (to prevent two separately mapped
+ regions from sharing a single cache line). Since the cache line size
+ may not be known at compile time, the API will not enforce this
+ requirement. Therefore, it is recommended that driver writers who
+ don't take special care to determine the cache line size at run time
+ only map virtual regions that begin and end on page boundaries (which
+ are guaranteed also to be cache line boundaries).
+
+ DMA_TO_DEVICE synchronisation must be done after the last modification
+ of the memory region by the software and before it is handed off to
+ the device. Once this primitive is used, memory covered by this
+ primitive should be treated as read-only by the device. If the device
+ may write to it at any point, it should be DMA_BIDIRECTIONAL (see
+ below).
+
+ DMA_FROM_DEVICE synchronisation must be done before the driver
+ accesses data that may be changed by the device. This memory should
+ be treated as read-only by the driver. If the driver needs to write
+ to it at any point, it should be DMA_BIDIRECTIONAL (see below).
+
+ DMA_BIDIRECTIONAL requires special handling: it means that the driver
+ isn't sure if the memory was modified before being handed off to the
+ device and also isn't sure if the device will also modify it. Thus,
+ you must always sync bidirectional memory twice: once before the
+ memory is handed off to the device (to make sure all memory changes
+ are flushed from the processor) and once before the data may be
+ accessed after being used by the device (to make sure any processor
+ cache lines are updated with data that the device may have changed).
+
+::
+
+ void
+ dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
+ enum dma_data_direction direction)
+
+Unmaps the region previously mapped. All the parameters passed in
+must be identical to those passed in (and returned) by the mapping
+API.
+
+::
+
+ dma_addr_t
+ dma_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction direction)
+
+ void
+ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
+ enum dma_data_direction direction)
+
+API for mapping and unmapping for pages. All the notes and warnings
+for the other mapping APIs apply here. Also, although the <offset>
+and <size> parameters are provided to do partial page mapping, it is
+recommended that you never use these unless you really know what the
+cache width is.
+
+::
+
+ dma_addr_t
+ dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size,
+ enum dma_data_direction dir, unsigned long attrs)
+
+ void
+ dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
+ enum dma_data_direction dir, unsigned long attrs)
+
+API for mapping and unmapping for MMIO resources. All the notes and
+warnings for the other mapping APIs apply here. The API should only be
+used to map device MMIO resources, mapping of RAM is not permitted.
+
+::
+
+ int
+ dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
+
+In some circumstances dma_map_single(), dma_map_page() and dma_map_resource()
+will fail to create a mapping. A driver can check for these errors by testing
+the returned DMA address with dma_mapping_error(). A non-zero return value
+means the mapping could not be created and the driver should take appropriate
+action (e.g. reduce current DMA mapping usage or delay and try again later).
+
+::
+
+ int
+ dma_map_sg(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction direction)
+
+Returns: the number of DMA address segments mapped (this may be shorter
+than <nents> passed in if some elements of the scatter/gather list are
+physically or virtually adjacent and an IOMMU maps them with a single
+entry).
+
+Please note that the sg cannot be mapped again if it has been mapped once.
+The mapping process is allowed to destroy information in the sg.
+
+As with the other mapping interfaces, dma_map_sg() can fail. When it
+does, 0 is returned and a driver must take appropriate action. It is
+critical that the driver do something, in the case of a block driver
+aborting the request or even oopsing is better than doing nothing and
+corrupting the filesystem.
+
+With scatterlists, you use the resulting mapping like this::
+
+ int i, count = dma_map_sg(dev, sglist, nents, direction);
+ struct scatterlist *sg;
+
+ for_each_sg(sglist, sg, count, i) {
+ hw_address[i] = sg_dma_address(sg);
+ hw_len[i] = sg_dma_len(sg);
+ }
+
+where nents is the number of entries in the sglist.
+
+The implementation is free to merge several consecutive sglist entries
+into one (e.g. with an IOMMU, or if several pages just happen to be
+physically contiguous) and returns the actual number of sg entries it
+mapped them to. On failure 0, is returned.
+
+Then you should loop count times (note: this can be less than nents times)
+and use sg_dma_address() and sg_dma_len() macros where you previously
+accessed sg->address and sg->length as shown above.
+
+::
+
+ void
+ dma_unmap_sg(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction direction)
+
+Unmap the previously mapped scatter/gather list. All the parameters
+must be the same as those and passed in to the scatter/gather mapping
+API.
+
+Note: <nents> must be the number you passed in, *not* the number of
+DMA address entries returned.
+
+::
+
+ void
+ dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
+ size_t size,
+ enum dma_data_direction direction)
+
+ void
+ dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
+ size_t size,
+ enum dma_data_direction direction)
+
+ void
+ dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+ int nents,
+ enum dma_data_direction direction)
+
+ void
+ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
+ int nents,
+ enum dma_data_direction direction)
+
+Synchronise a single contiguous or scatter/gather mapping for the CPU
+and device. With the sync_sg API, all the parameters must be the same
+as those passed into the sg mapping API. With the sync_single API,
+you can use dma_handle and size parameters that aren't identical to
+those passed into the single mapping API to do a partial sync.
+
+
+.. note::
+
+ You must do this:
+
+ - Before reading values that have been written by DMA from the device
+ (use the DMA_FROM_DEVICE direction)
+ - After writing values that will be written to the device using DMA
+ (use the DMA_TO_DEVICE) direction
+ - before *and* after handing memory to the device if the memory is
+ DMA_BIDIRECTIONAL
+
+See also dma_map_single().
+
+::
+
+ dma_addr_t
+ dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size,
+ enum dma_data_direction dir,
+ unsigned long attrs)
+
+ void
+ dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+
+ int
+ dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+ int nents, enum dma_data_direction dir,
+ unsigned long attrs)
+
+ void
+ dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl,
+ int nents, enum dma_data_direction dir,
+ unsigned long attrs)
+
+The four functions above are just like the counterpart functions
+without the _attrs suffixes, except that they pass an optional
+dma_attrs.
+
+The interpretation of DMA attributes is architecture-specific, and
+each attribute should be documented in
+Documentation/core-api/dma-attributes.rst.
+
+If dma_attrs are 0, the semantics of each of these functions
+is identical to those of the corresponding function
+without the _attrs suffix. As a result dma_map_single_attrs()
+can generally replace dma_map_single(), etc.
+
+As an example of the use of the ``*_attrs`` functions, here's how
+you could pass an attribute DMA_ATTR_FOO when mapping memory
+for DMA::
+
+ #include <linux/dma-mapping.h>
+ /* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and
+ * documented in Documentation/core-api/dma-attributes.rst */
+ ...
+
+ unsigned long attr;
+ attr |= DMA_ATTR_FOO;
+ ....
+ n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr);
+ ....
+
+Architectures that care about DMA_ATTR_FOO would check for its
+presence in their implementations of the mapping and unmapping
+routines, e.g.:::
+
+ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+ {
+ ....
+ if (attrs & DMA_ATTR_FOO)
+ /* twizzle the frobnozzle */
+ ....
+ }
+
+
+Part II - Non-coherent DMA allocations
+--------------------------------------
+
+These APIs allow to allocate pages that are guaranteed to be DMA addressable
+by the passed in device, but which need explicit management of memory ownership
+for the kernel vs the device.
+
+If you don't understand how cache line coherency works between a processor and
+an I/O device, you should not be using this part of the API.
+
+::
+
+ struct page *
+ dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle,
+ enum dma_data_direction dir, gfp_t gfp)
+
+This routine allocates a region of <size> bytes of non-coherent memory. It
+returns a pointer to first struct page for the region, or NULL if the
+allocation failed. The resulting struct page can be used for everything a
+struct page is suitable for.
+
+It also returns a <dma_handle> which may be cast to an unsigned integer the
+same width as the bus and given to the device as the DMA address base of
+the region.
+
+The dir parameter specified if data is read and/or written by the device,
+see dma_map_single() for details.
+
+The gfp parameter allows the caller to specify the ``GFP_`` flags (see
+kmalloc()) for the allocation, but rejects flags used to specify a memory
+zone such as GFP_DMA or GFP_HIGHMEM.
+
+Before giving the memory to the device, dma_sync_single_for_device() needs
+to be called, and before reading memory written by the device,
+dma_sync_single_for_cpu(), just like for streaming DMA mappings that are
+reused.
+
+::
+
+ void
+ dma_free_pages(struct device *dev, size_t size, struct page *page,
+ dma_addr_t dma_handle, enum dma_data_direction dir)
+
+Free a region of memory previously allocated using dma_alloc_pages().
+dev, size, dma_handle and dir must all be the same as those passed into
+dma_alloc_pages(). page must be the pointer returned by dma_alloc_pages().
+
+::
+
+ int
+ dma_mmap_pages(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct page *page)
+
+Map an allocation returned from dma_alloc_pages() into a user address space.
+dev and size must be the same as those passed into dma_alloc_pages().
+page must be the pointer returned by dma_alloc_pages().
+
+::
+
+ void *
+ dma_alloc_noncoherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir,
+ gfp_t gfp)
+
+This routine is a convenient wrapper around dma_alloc_pages that returns the
+kernel virtual address for the allocated memory instead of the page structure.
+
+::
+
+ void
+ dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
+ dma_addr_t dma_handle, enum dma_data_direction dir)
+
+Free a region of memory previously allocated using dma_alloc_noncoherent().
+dev, size, dma_handle and dir must all be the same as those passed into
+dma_alloc_noncoherent(). cpu_addr must be the virtual address returned by
+dma_alloc_noncoherent().
+
+::
+
+ struct sg_table *
+ dma_alloc_noncontiguous(struct device *dev, size_t size,
+ enum dma_data_direction dir, gfp_t gfp,
+ unsigned long attrs);
+
+This routine allocates <size> bytes of non-coherent and possibly non-contiguous
+memory. It returns a pointer to struct sg_table that describes the allocated
+and DMA mapped memory, or NULL if the allocation failed. The resulting memory
+can be used for struct page mapped into a scatterlist are suitable for.
+
+The return sg_table is guaranteed to have 1 single DMA mapped segment as
+indicated by sgt->nents, but it might have multiple CPU side segments as
+indicated by sgt->orig_nents.
+
+The dir parameter specified if data is read and/or written by the device,
+see dma_map_single() for details.
+
+The gfp parameter allows the caller to specify the ``GFP_`` flags (see
+kmalloc()) for the allocation, but rejects flags used to specify a memory
+zone such as GFP_DMA or GFP_HIGHMEM.
+
+The attrs argument must be either 0 or DMA_ATTR_ALLOC_SINGLE_PAGES.
+
+Before giving the memory to the device, dma_sync_sgtable_for_device() needs
+to be called, and before reading memory written by the device,
+dma_sync_sgtable_for_cpu(), just like for streaming DMA mappings that are
+reused.
+
+::
+
+ void
+ dma_free_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt,
+ enum dma_data_direction dir)
+
+Free memory previously allocated using dma_alloc_noncontiguous(). dev, size,
+and dir must all be the same as those passed into dma_alloc_noncontiguous().
+sgt must be the pointer returned by dma_alloc_noncontiguous().
+
+::
+
+ void *
+ dma_vmap_noncontiguous(struct device *dev, size_t size,
+ struct sg_table *sgt)
+
+Return a contiguous kernel mapping for an allocation returned from
+dma_alloc_noncontiguous(). dev and size must be the same as those passed into
+dma_alloc_noncontiguous(). sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
+Once a non-contiguous allocation is mapped using this function, the
+flush_kernel_vmap_range() and invalidate_kernel_vmap_range() APIs must be used
+to manage the coherency between the kernel mapping, the device and user space
+mappings (if any).
+
+::
+
+ void
+ dma_vunmap_noncontiguous(struct device *dev, void *vaddr)
+
+Unmap a kernel mapping returned by dma_vmap_noncontiguous(). dev must be the
+same the one passed into dma_alloc_noncontiguous(). vaddr must be the pointer
+returned by dma_vmap_noncontiguous().
+
+
+::
+
+ int
+ dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
+ size_t size, struct sg_table *sgt)
+
+Map an allocation returned from dma_alloc_noncontiguous() into a user address
+space. dev and size must be the same as those passed into
+dma_alloc_noncontiguous(). sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
+::
+
+ int
+ dma_get_cache_alignment(void)
+
+Returns the processor cache alignment. This is the absolute minimum
+alignment *and* width that you must observe when either mapping
+memory or doing partial flushes.
+
+.. note::
+
+ This API may return a number *larger* than the actual cache
+ line, but it will guarantee that one or more cache lines fit exactly
+ into the width returned by this call. It will also always be a power
+ of two for easy alignment.
+
+
+Part III - Debug drivers use of the DMA-API
+-------------------------------------------
+
+The DMA-API as described above has some constraints. DMA addresses must be
+released with the corresponding function with the same size for example. With
+the advent of hardware IOMMUs it becomes more and more important that drivers
+do not violate those constraints. In the worst case such a violation can
+result in data corruption up to destroyed filesystems.
+
+To debug drivers and find bugs in the usage of the DMA-API checking code can
+be compiled into the kernel which will tell the developer about those
+violations. If your architecture supports it you can select the "Enable
+debugging of DMA-API usage" option in your kernel configuration. Enabling this
+option has a performance impact. Do not enable it in production kernels.
+
+If you boot the resulting kernel will contain code which does some bookkeeping
+about what DMA memory was allocated for which device. If this code detects an
+error it prints a warning message with some details into your kernel log. An
+example warning message may look like this::
+
+ WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448
+ check_unmap+0x203/0x490()
+ Hardware name:
+ forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong
+ function [device address=0x00000000640444be] [size=66 bytes] [mapped as
+ single] [unmapped as page]
+ Modules linked in: nfsd exportfs bridge stp llc r8169
+ Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1
+ Call Trace:
+ <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130
+ [<ffffffff80647b70>] _spin_unlock+0x10/0x30
+ [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0
+ [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40
+ [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0
+ [<ffffffff80252f96>] queue_work+0x56/0x60
+ [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50
+ [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0
+ [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40
+ [<ffffffff80235177>] find_busiest_group+0x207/0x8a0
+ [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50
+ [<ffffffff803c7ea3>] check_unmap+0x203/0x490
+ [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50
+ [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0
+ [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0
+ [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70
+ [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150
+ [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0
+ [<ffffffff8020c093>] ret_from_intr+0x0/0xa
+ <EOI> <4>---[ end trace f6435a98e2a38c0e ]---
+
+The driver developer can find the driver and the device including a stacktrace
+of the DMA-API call which caused this warning.
+
+Per default only the first error will result in a warning message. All other
+errors will only silently counted. This limitation exist to prevent the code
+from flooding your kernel log. To support debugging a device driver this can
+be disabled via debugfs. See the debugfs interface documentation below for
+details.
+
+The debugfs directory for the DMA-API debugging code is called dma-api/. In
+this directory the following files can currently be found:
+
+=============================== ===============================================
+dma-api/all_errors This file contains a numeric value. If this
+ value is not equal to zero the debugging code
+ will print a warning for every error it finds
+ into the kernel log. Be careful with this
+ option, as it can easily flood your logs.
+
+dma-api/disabled This read-only file contains the character 'Y'
+ if the debugging code is disabled. This can
+ happen when it runs out of memory or if it was
+ disabled at boot time
+
+dma-api/dump This read-only file contains current DMA
+ mappings.
+
+dma-api/error_count This file is read-only and shows the total
+ numbers of errors found.
+
+dma-api/num_errors The number in this file shows how many
+ warnings will be printed to the kernel log
+ before it stops. This number is initialized to
+ one at system boot and be set by writing into
+ this file
+
+dma-api/min_free_entries This read-only file can be read to get the
+ minimum number of free dma_debug_entries the
+ allocator has ever seen. If this value goes
+ down to zero the code will attempt to increase
+ nr_total_entries to compensate.
+
+dma-api/num_free_entries The current number of free dma_debug_entries
+ in the allocator.
+
+dma-api/nr_total_entries The total number of dma_debug_entries in the
+ allocator, both free and used.
+
+dma-api/driver_filter You can write a name of a driver into this file
+ to limit the debug output to requests from that
+ particular driver. Write an empty string to
+ that file to disable the filter and see
+ all errors again.
+=============================== ===============================================
+
+If you have this code compiled into your kernel it will be enabled by default.
+If you want to boot without the bookkeeping anyway you can provide
+'dma_debug=off' as a boot parameter. This will disable DMA-API debugging.
+Notice that you can not enable it again at runtime. You have to reboot to do
+so.
+
+If you want to see debug messages only for a special device driver you can
+specify the dma_debug_driver=<drivername> parameter. This will enable the
+driver filter at boot time. The debug code will only print errors for that
+driver afterwards. This filter can be disabled or changed later using debugfs.
+
+When the code disables itself at runtime this is most likely because it ran
+out of dma_debug_entries and was unable to allocate more on-demand. 65536
+entries are preallocated at boot - if this is too low for you boot with
+'dma_debug_entries=<your_desired_number>' to overwrite the default. Note
+that the code allocates entries in batches, so the exact number of
+preallocated entries may be greater than the actual number requested. The
+code will print to the kernel log each time it has dynamically allocated
+as many entries as were initially preallocated. This is to indicate that a
+larger preallocation size may be appropriate, or if it happens continually
+that a driver may be leaking mappings.
+
+::
+
+ void
+ debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
+
+dma-debug interface debug_dma_mapping_error() to debug drivers that fail
+to check DMA mapping errors on addresses returned by dma_map_single() and
+dma_map_page() interfaces. This interface clears a flag set by
+debug_dma_map_page() to indicate that dma_mapping_error() has been called by
+the driver. When driver does unmap, debug_dma_unmap() checks the flag and if
+this flag is still set, prints warning message that includes call trace that
+leads up to the unmap. This interface can be called from dma_mapping_error()
+routines to enable DMA mapping error check debugging.
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
new file mode 100644
index 000000000000..1887d92e8e92
--- /dev/null
+++ b/Documentation/core-api/dma-attributes.rst
@@ -0,0 +1,132 @@
+==============
+DMA attributes
+==============
+
+This document describes the semantics of the DMA attributes that are
+defined in linux/dma-mapping.h.
+
+DMA_ATTR_WEAK_ORDERING
+----------------------
+
+DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
+may be weakly ordered, that is that reads and writes may pass each other.
+
+Since it is optional for platforms to implement DMA_ATTR_WEAK_ORDERING,
+those that do not will simply ignore the attribute and exhibit default
+behavior.
+
+DMA_ATTR_WRITE_COMBINE
+----------------------
+
+DMA_ATTR_WRITE_COMBINE specifies that writes to the mapping may be
+buffered to improve performance.
+
+Since it is optional for platforms to implement DMA_ATTR_WRITE_COMBINE,
+those that do not will simply ignore the attribute and exhibit default
+behavior.
+
+DMA_ATTR_NO_KERNEL_MAPPING
+--------------------------
+
+DMA_ATTR_NO_KERNEL_MAPPING lets the platform to avoid creating a kernel
+virtual mapping for the allocated buffer. On some architectures creating
+such mapping is non-trivial task and consumes very limited resources
+(like kernel virtual address space or dma consistent address space).
+Buffers allocated with this attribute can be only passed to user space
+by calling dma_mmap_attrs(). By using this API, you are guaranteeing
+that you won't dereference the pointer returned by dma_alloc_attr(). You
+can treat it as a cookie that must be passed to dma_mmap_attrs() and
+dma_free_attrs(). Make sure that both of these also get this attribute
+set on each call.
+
+Since it is optional for platforms to implement
+DMA_ATTR_NO_KERNEL_MAPPING, those that do not will simply ignore the
+attribute and exhibit default behavior.
+
+DMA_ATTR_SKIP_CPU_SYNC
+----------------------
+
+By default dma_map_{single,page,sg} functions family transfer a given
+buffer from CPU domain to device domain. Some advanced use cases might
+require sharing a buffer between more than one device. This requires
+having a mapping created separately for each device and is usually
+performed by calling dma_map_{single,page,sg} function more than once
+for the given buffer with device pointer to each device taking part in
+the buffer sharing. The first call transfers a buffer from 'CPU' domain
+to 'device' domain, what synchronizes CPU caches for the given region
+(usually it means that the cache has been flushed or invalidated
+depending on the dma direction). However, next calls to
+dma_map_{single,page,sg}() for other devices will perform exactly the
+same synchronization operation on the CPU cache. CPU cache synchronization
+might be a time consuming operation, especially if the buffers are
+large, so it is highly recommended to avoid it if possible.
+DMA_ATTR_SKIP_CPU_SYNC allows platform code to skip synchronization of
+the CPU cache for the given buffer assuming that it has been already
+transferred to 'device' domain. This attribute can be also used for
+dma_unmap_{single,page,sg} functions family to force buffer to stay in
+device domain after releasing a mapping for it. Use this attribute with
+care!
+
+DMA_ATTR_FORCE_CONTIGUOUS
+-------------------------
+
+By default DMA-mapping subsystem is allowed to assemble the buffer
+allocated by dma_alloc_attrs() function from individual pages if it can
+be mapped as contiguous chunk into device dma address space. By
+specifying this attribute the allocated buffer is forced to be contiguous
+also in physical memory.
+
+DMA_ATTR_ALLOC_SINGLE_PAGES
+---------------------------
+
+This is a hint to the DMA-mapping subsystem that it's probably not worth
+the time to try to allocate memory to in a way that gives better TLB
+efficiency (AKA it's not worth trying to build the mapping out of larger
+pages). You might want to specify this if:
+
+- You know that the accesses to this memory won't thrash the TLB.
+ You might know that the accesses are likely to be sequential or
+ that they aren't sequential but it's unlikely you'll ping-pong
+ between many addresses that are likely to be in different physical
+ pages.
+- You know that the penalty of TLB misses while accessing the
+ memory will be small enough to be inconsequential. If you are
+ doing a heavy operation like decryption or decompression this
+ might be the case.
+- You know that the DMA mapping is fairly transitory. If you expect
+ the mapping to have a short lifetime then it may be worth it to
+ optimize allocation (avoid coming up with large pages) instead of
+ getting the slight performance win of larger pages.
+
+Setting this hint doesn't guarantee that you won't get huge pages, but it
+means that we won't try quite as hard to get them.
+
+.. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM,
+ though ARM64 patches will likely be posted soon.
+
+DMA_ATTR_NO_WARN
+----------------
+
+This tells the DMA-mapping subsystem to suppress allocation failure reports
+(similarly to __GFP_NOWARN).
+
+On some architectures allocation failures are reported with error messages
+to the system logs. Although this can help to identify and debug problems,
+drivers which handle failures (eg, retry later) have no problems with them,
+and can actually flood the system logs with error messages that aren't any
+problem at all, depending on the implementation of the retry mechanism.
+
+So, this provides a way for drivers to avoid those error messages on calls
+where allocation failures are not a problem, and shouldn't bother the logs.
+
+.. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC.
+
+DMA_ATTR_PRIVILEGED
+-------------------
+
+Some advanced peripherals such as remote processors and GPUs perform
+accesses to DMA buffers in both privileged "supervisor" and unprivileged
+"user" modes. This attribute is used to indicate to the DMA-mapping
+subsystem that the buffer is fully accessible at the elevated privilege
+level (and ideally inaccessible or at least read-only at the
+lesser-privileged levels).
diff --git a/Documentation/core-api/dma-isa-lpc.rst b/Documentation/core-api/dma-isa-lpc.rst
new file mode 100644
index 000000000000..17b193603f0a
--- /dev/null
+++ b/Documentation/core-api/dma-isa-lpc.rst
@@ -0,0 +1,152 @@
+============================
+DMA with ISA and LPC devices
+============================
+
+:Author: Pierre Ossman <drzeus@drzeus.cx>
+
+This document describes how to do DMA transfers using the old ISA DMA
+controller. Even though ISA is more or less dead today the LPC bus
+uses the same DMA system so it will be around for quite some time.
+
+Headers and dependencies
+------------------------
+
+To do ISA style DMA you need to include two headers::
+
+ #include <linux/dma-mapping.h>
+ #include <asm/dma.h>
+
+The first is the generic DMA API used to convert virtual addresses to
+bus addresses (see Documentation/core-api/dma-api.rst for details).
+
+The second contains the routines specific to ISA DMA transfers. Since
+this is not present on all platforms make sure you construct your
+Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
+to build your driver on unsupported platforms.
+
+Buffer allocation
+-----------------
+
+The ISA DMA controller has some very strict requirements on which
+memory it can access so extra care must be taken when allocating
+buffers.
+
+(You usually need a special buffer for DMA transfers instead of
+transferring directly to and from your normal data structures.)
+
+The DMA-able address space is the lowest 16 MB of _physical_ memory.
+Also the transfer block may not cross page boundaries (which are 64
+or 128 KiB depending on which channel you use).
+
+In order to allocate a piece of memory that satisfies all these
+requirements you pass the flag GFP_DMA to kmalloc.
+
+Unfortunately the memory available for ISA DMA is scarce so unless you
+allocate the memory during boot-up it's a good idea to also pass
+__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
+
+(This scarcity also means that you should allocate the buffer as
+early as possible and not release it until the driver is unloaded.)
+
+Address translation
+-------------------
+
+To translate the virtual address to a bus address, use the normal DMA
+API. Do _not_ use isa_virt_to_bus() even though it does the same
+thing. The reason for this is that the function isa_virt_to_bus()
+will require a Kconfig dependency to ISA, not just ISA_DMA_API which
+is really all you need. Remember that even though the DMA controller
+has its origins in ISA it is used elsewhere.
+
+Note: x86_64 had a broken DMA API when it came to ISA but has since
+been fixed. If your arch has problems then fix the DMA API instead of
+reverting to the ISA functions.
+
+Channels
+--------
+
+A normal ISA DMA controller has 8 channels. The lower four are for
+8-bit transfers and the upper four are for 16-bit transfers.
+
+(Actually the DMA controller is really two separate controllers where
+channel 4 is used to give DMA access for the second controller (0-3).
+This means that of the four 16-bits channels only three are usable.)
+
+You allocate these in a similar fashion as all basic resources:
+
+extern int request_dma(unsigned int dmanr, const char * device_id);
+extern void free_dma(unsigned int dmanr);
+
+The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
+driver author but depends on what the hardware supports. Check your
+specs or test different channels.
+
+Transfer data
+-------------
+
+Now for the good stuff, the actual DMA transfer. :)
+
+Before you use any ISA DMA routines you need to claim the DMA lock
+using claim_dma_lock(). The reason is that some DMA operations are
+not atomic so only one driver may fiddle with the registers at a
+time.
+
+The first time you use the DMA controller you should call
+clear_dma_ff(). This clears an internal register in the DMA
+controller that is used for the non-atomic operations. As long as you
+(and everyone else) uses the locking functions then you only need to
+reset this once.
+
+Next, you tell the controller in which direction you intend to do the
+transfer using set_dma_mode(). Currently you have the options
+DMA_MODE_READ and DMA_MODE_WRITE.
+
+Set the address from where the transfer should start (this needs to
+be 16-bit aligned for 16-bit transfers) and how many bytes to
+transfer. Note that it's _bytes_. The DMA routines will do all the
+required translation to values that the DMA controller understands.
+
+The final step is enabling the DMA channel and releasing the DMA
+lock.
+
+Once the DMA transfer is finished (or timed out) you should disable
+the channel again. You should also check get_dma_residue() to make
+sure that all data has been transferred.
+
+Example::
+
+ int flags, residue;
+
+ flags = claim_dma_lock();
+
+ clear_dma_ff();
+
+ set_dma_mode(channel, DMA_MODE_WRITE);
+ set_dma_addr(channel, phys_addr);
+ set_dma_count(channel, num_bytes);
+
+ dma_enable(channel);
+
+ release_dma_lock(flags);
+
+ while (!device_done());
+
+ flags = claim_dma_lock();
+
+ dma_disable(channel);
+
+ residue = dma_get_residue(channel);
+ if (residue != 0)
+ printk(KERN_ERR "driver: Incomplete DMA transfer!"
+ " %d bytes left!\n", residue);
+
+ release_dma_lock(flags);
+
+Suspend/resume
+--------------
+
+It is the driver's responsibility to make sure that the machine isn't
+suspended while a DMA transfer is in progress. Also, all DMA settings
+are lost when the system suspends so if your driver relies on the DMA
+controller being in a certain state then you have to restore these
+registers upon resume.
diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst
new file mode 100644
index 000000000000..e12f22ab33c7
--- /dev/null
+++ b/Documentation/core-api/entry.rst
@@ -0,0 +1,279 @@
+Entry/exit handling for exceptions, interrupts, syscalls and KVM
+================================================================
+
+All transitions between execution domains require state updates which are
+subject to strict ordering constraints. State updates are required for the
+following:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Preemption counter
+ * Tracing
+ * Time accounting
+
+The update order depends on the transition type and is explained below in
+the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
+exceptions`_, `NMI and NMI-like exceptions`_.
+
+Non-instrumentable code - noinstr
+---------------------------------
+
+Most instrumentation facilities depend on RCU, so intrumentation is prohibited
+for entry code before RCU starts watching and exit code after RCU stops
+watching. In addition, many architectures must save and restore register state,
+which means that (for example) a breakpoint in the breakpoint entry code would
+overwrite the debug registers of the initial breakpoint.
+
+Such code must be marked with the 'noinstr' attribute, placing that code into a
+special section inaccessible to instrumentation and debug facilities. Some
+functions are partially instrumentable, which is handled by marking them
+noinstr and using instrumentation_begin() and instrumentation_end() to flag the
+instrumentable ranges of code:
+
+.. code-block:: c
+
+ noinstr void entry(void)
+ {
+ handle_entry(); // <-- must be 'noinstr' or '__always_inline'
+ ...
+
+ instrumentation_begin();
+ handle_context(); // <-- instrumentable code
+ instrumentation_end();
+
+ ...
+ handle_exit(); // <-- must be 'noinstr' or '__always_inline'
+ }
+
+This allows verification of the 'noinstr' restrictions via objtool on
+supported architectures.
+
+Invoking non-instrumentable functions from instrumentable context has no
+restrictions and is useful to protect e.g. state switching which would
+cause malfunction if instrumented.
+
+All non-instrumentable entry/exit code sections before and after the RCU
+state transitions must run with interrupts disabled.
+
+Syscalls
+--------
+
+Syscall-entry code starts in assembly code and calls out into low-level C code
+after establishing low-level architecture-specific state and stack frames. This
+low-level C code must not be instrumented. A typical syscall handling function
+invoked from low-level assembly code looks like this:
+
+.. code-block:: c
+
+ noinstr void syscall(struct pt_regs *regs, int nr)
+ {
+ arch_syscall_enter(regs);
+ nr = syscall_enter_from_user_mode(regs, nr);
+
+ instrumentation_begin();
+ if (!invoke_syscall(regs, nr) && nr != -1)
+ result_reg(regs) = __sys_ni_syscall(regs);
+ instrumentation_end();
+
+ syscall_exit_to_user_mode(regs);
+ }
+
+syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
+establishes state in the following order:
+
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+and then invokes the various entry work functions like ptrace, seccomp, audit,
+syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
+function can be invoked. The instrumentable code section then ends, after which
+syscall_exit_to_user_mode() is invoked.
+
+syscall_exit_to_user_mode() handles all work which needs to be done before
+returning to user space like tracing, audit, signals, task work etc. After
+that it invokes exit_to_user_mode() which again handles the state
+transition in the reverse order:
+
+ * Tracing
+ * RCU / Context tracking
+ * Lockdep
+
+syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
+available as fine grained subfunctions in cases where the architecture code
+has to do extra work between the various steps. In such cases it has to
+ensure that enter_from_user_mode() is called first on entry and
+exit_to_user_mode() is called last on exit.
+
+Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
+to print a warning.
+
+KVM
+---
+
+Entering or exiting guest mode is very similar to syscalls. From the host
+kernel point of view the CPU goes off into user space when entering the
+guest and returns to the kernel on exit.
+
+kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
+and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
+The state operations have the same ordering.
+
+Task work handling is done separately for guest at the boundary of the
+vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
+the work handled on return to user space.
+
+Do not nest KVM entry/exit transitions because doing so is nonsensical.
+
+Interrupts and regular exceptions
+---------------------------------
+
+Interrupts entry and exit handling is slightly more complex than syscalls
+and KVM transitions.
+
+If an interrupt is raised while the CPU executes in user space, the entry
+and exit handling is exactly the same as for syscalls.
+
+If the interrupt is raised while the CPU executes in kernel space the entry and
+exit handling is slightly different. RCU state is only updated when the
+interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
+already be watching. Lockdep and tracing have to be updated unconditionally.
+
+irqentry_enter() and irqentry_exit() provide the implementation for this.
+
+The architecture-specific part looks similar to syscall handling:
+
+.. code-block:: c
+
+ noinstr void interrupt(struct pt_regs *regs, int nr)
+ {
+ arch_interrupt_enter(regs);
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+
+ irq_enter_rcu();
+ invoke_irq_handler(regs, nr);
+ irq_exit_rcu();
+
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ }
+
+Note that the invocation of the actual interrupt handler is within a
+irq_enter_rcu() and irq_exit_rcu() pair.
+
+irq_enter_rcu() updates the preemption count which makes in_hardirq()
+return true, handles NOHZ tick state and interrupt time accounting. This
+means that up to the point where irq_enter_rcu() is invoked in_hardirq()
+returns false.
+
+irq_exit_rcu() handles interrupt time accounting, undoes the preemption
+count update and eventually handles soft interrupts and NOHZ tick state.
+
+In theory, the preemption count could be updated in irqentry_enter(). In
+practice, deferring this update to irq_enter_rcu() allows the preemption-count
+code to be traced, while also maintaining symmetry with irq_exit_rcu() and
+irqentry_exit(), which are described in the next paragraph. The only downside
+is that the early entry code up to irq_enter_rcu() must be aware that the
+preemption count has not yet been updated with the HARDIRQ_OFFSET state.
+
+Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
+before it handles soft interrupts, whose handlers must run in BH context rather
+than irq-disabled context. In addition, irqentry_exit() might schedule, which
+also requires that HARDIRQ_OFFSET has been removed from the preemption count.
+
+Even though interrupt handlers are expected to run with local interrupts
+disabled, interrupt nesting is common from an entry/exit perspective. For
+example, softirq handling happens within an irqentry_{enter,exit}() block with
+local interrupts enabled. Also, although uncommon, nothing prevents an
+interrupt handler from re-enabling interrupts.
+
+Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
+runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
+the entry code is shared between the two.
+
+NMI and NMI-like exceptions
+---------------------------
+
+NMIs and NMI-like exceptions (machine checks, double faults, debug
+interrupts, etc.) can hit any context and must be extra careful with
+the state.
+
+State changes for debug exceptions and machine-check exceptions depend on
+whether these exceptions happened in user-space (breakpoints or watchpoints) or
+in kernel mode (code patching). From user-space, they are treated like
+interrupts, while from kernel mode they are treated like NMIs.
+
+NMIs and other NMI-like exceptions handle state transitions without
+distinguishing between user-mode and kernel-mode origin.
+
+The state update on entry is handled in irqentry_nmi_enter() which updates
+state in the following order:
+
+ * Preemption counter
+ * Lockdep
+ * RCU / Context tracking
+ * Tracing
+
+The exit counterpart irqentry_nmi_exit() does the reverse operation in the
+reverse order.
+
+Note that the update of the preemption counter has to be the first
+operation on enter and the last operation on exit. The reason is that both
+lockdep and RCU rely on in_nmi() returning true in this case. The
+preemption count modification in the NMI entry/exit case must not be
+traced.
+
+Architecture-specific code looks like this:
+
+.. code-block:: c
+
+ noinstr void nmi(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ nmi_handler(regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs);
+ }
+
+and for e.g. a debug exception it can look like this:
+
+.. code-block:: c
+
+ noinstr void debug(struct pt_regs *regs)
+ {
+ arch_nmi_enter(regs);
+
+ debug_regs = save_debug_regs();
+
+ if (user_mode(regs)) {
+ state = irqentry_enter(regs);
+
+ instrumentation_begin();
+ user_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_exit(regs, state);
+ } else {
+ state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+ kernel_mode_debug_handler(regs, debug_regs);
+ instrumentation_end();
+
+ irqentry_nmi_exit(regs, state);
+ }
+ }
+
+There is no combined irqentry_nmi_if_kernel() function available as the
+above cannot be handled in an exception-agnostic way.
+
+NMIs can happen in any context. For example, an NMI-like exception triggered
+while handling an NMI. So NMI entry code has to be reentrant and state updates
+need to handle nesting.
diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
index 8f06d885c310..4a460639ab1c 100644
--- a/Documentation/core-api/genericirq.rst
+++ b/Documentation/core-api/genericirq.rst
@@ -264,7 +264,7 @@ The following control flow is implemented (simplified excerpt)::
desc->irq_data.chip->irq_unmask();
desc->status &= ~pending;
handle_irq_event(desc->action);
- } while (status & pending);
+ } while (desc->status & pending);
desc->status &= ~running;
@@ -419,6 +419,7 @@ functions which are exported.
.. kernel-doc:: kernel/irq/manage.c
.. kernel-doc:: kernel/irq/chip.c
+ :export:
Internal Functions Provided
===========================
@@ -431,6 +432,7 @@ functions.
.. kernel-doc:: kernel/irq/handle.c
.. kernel-doc:: kernel/irq/chip.c
+ :internal:
Credits
=======
diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst
index a2738050c4f0..18d724867064 100644
--- a/Documentation/core-api/idr.rst
+++ b/Documentation/core-api/idr.rst
@@ -17,51 +17,54 @@ solution to the problem to avoid everybody inventing their own. The IDR
provides the ability to map an ID to a pointer, while the IDA provides
only ID allocation, and as a result is much more memory-efficient.
+The IDR interface is deprecated; please use the :doc:`XArray <xarray>`
+instead.
+
IDR usage
=========
-Start by initialising an IDR, either with :c:func:`DEFINE_IDR`
-for statically allocated IDRs or :c:func:`idr_init` for dynamically
+Start by initialising an IDR, either with DEFINE_IDR()
+for statically allocated IDRs or idr_init() for dynamically
allocated IDRs.
-You can call :c:func:`idr_alloc` to allocate an unused ID. Look up
-the pointer you associated with the ID by calling :c:func:`idr_find`
-and free the ID by calling :c:func:`idr_remove`.
+You can call idr_alloc() to allocate an unused ID. Look up
+the pointer you associated with the ID by calling idr_find()
+and free the ID by calling idr_remove().
If you need to change the pointer associated with an ID, you can call
-:c:func:`idr_replace`. One common reason to do this is to reserve an
+idr_replace(). One common reason to do this is to reserve an
ID by passing a ``NULL`` pointer to the allocation function; initialise the
object with the reserved ID and finally insert the initialised object
into the IDR.
Some users need to allocate IDs larger than ``INT_MAX``. So far all of
these users have been content with a ``UINT_MAX`` limit, and they use
-:c:func:`idr_alloc_u32`. If you need IDs that will not fit in a u32,
+idr_alloc_u32(). If you need IDs that will not fit in a u32,
we will work with you to address your needs.
If you need to allocate IDs sequentially, you can use
-:c:func:`idr_alloc_cyclic`. The IDR becomes less efficient when dealing
+idr_alloc_cyclic(). The IDR becomes less efficient when dealing
with larger IDs, so using this function comes at a slight cost.
To perform an action on all pointers used by the IDR, you can
-either use the callback-based :c:func:`idr_for_each` or the
-iterator-style :c:func:`idr_for_each_entry`. You may need to use
-:c:func:`idr_for_each_entry_continue` to continue an iteration. You can
-also use :c:func:`idr_get_next` if the iterator doesn't fit your needs.
+either use the callback-based idr_for_each() or the
+iterator-style idr_for_each_entry(). You may need to use
+idr_for_each_entry_continue() to continue an iteration. You can
+also use idr_get_next() if the iterator doesn't fit your needs.
-When you have finished using an IDR, you can call :c:func:`idr_destroy`
+When you have finished using an IDR, you can call idr_destroy()
to release the memory used by the IDR. This will not free the objects
pointed to from the IDR; if you want to do that, use one of the iterators
to do it.
-You can use :c:func:`idr_is_empty` to find out whether there are any
+You can use idr_is_empty() to find out whether there are any
IDs currently allocated.
If you need to take a lock while allocating a new ID from the IDR,
you may need to pass a restrictive set of GFP flags, which can lead
to the IDR being unable to allocate memory. To work around this,
-you can call :c:func:`idr_preload` before taking the lock, and then
-:c:func:`idr_preload_end` after the allocation.
+you can call idr_preload() before taking the lock, and then
+idr_preload_end() after the allocation.
.. kernel-doc:: include/linux/idr.h
:doc: idr sync
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 0897ad12c119..7a3a08d81f11 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,8 +18,12 @@ it.
kernel-api
workqueue
+ watch_queue
+ printk-basics
printk-formats
+ printk-index
symbol-namespaces
+ asm-annotations
Data structures and low-level utilities
=======================================
@@ -30,29 +34,44 @@ Library functionality that is used throughout the kernel.
:maxdepth: 1
kobject
+ kref
assoc_array
xarray
+ maple_tree
idr
circular-buffers
+ rbtree
generic-radix-tree
packing
+ this_cpu_ops
timekeeping
errseq
+ wrappers/atomic_t
+ wrappers/atomic_bitops
+
+Low level entry and exit
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ entry
Concurrency primitives
======================
How Linux keeps everything from happening at the same time. See
-:doc:`/locking/index` for more related documentation.
+Documentation/locking/index.rst for more related documentation.
.. toctree::
:maxdepth: 1
- atomic_ops
refcount-vs-atomic
+ irq/index
local_ops
padata
../RCU/index
+ wrappers/memory-barriers.rst
Low-level hardware management
=============================
@@ -72,12 +91,17 @@ Memory management
=================
How to allocate and use memory in the kernel. Note that there is a lot
-more memory-management documentation in :doc:`/vm/index`.
+more memory-management documentation in Documentation/mm/index.rst.
.. toctree::
:maxdepth: 1
memory-allocation
+ unaligned-memory-access
+ dma-api
+ dma-api-howto
+ dma-attributes
+ dma-isa-lpc
mm-api
genalloc
pin_user_pages
@@ -92,6 +116,7 @@ Interfaces for kernel debugging
debug-objects
tracepoint
+ debugging-via-ohci1394
Everything else
===============
@@ -102,6 +127,7 @@ Documents that don't fit elsewhere or which have yet to be categorized.
:maxdepth: 1
librs
+ netlink
.. only:: subproject and html
diff --git a/Documentation/core-api/irq/concepts.rst b/Documentation/core-api/irq/concepts.rst
new file mode 100644
index 000000000000..4273806a606b
--- /dev/null
+++ b/Documentation/core-api/irq/concepts.rst
@@ -0,0 +1,24 @@
+===============
+What is an IRQ?
+===============
+
+An IRQ is an interrupt request from a device.
+Currently they can come in over a pin, or over a packet.
+Several devices may be connected to the same pin thus
+sharing an IRQ.
+
+An IRQ number is a kernel identifier used to talk about a hardware
+interrupt source. Typically this is an index into the global irq_desc
+array, but except for what linux/interrupt.h implements the details
+are architecture specific.
+
+An IRQ number is an enumeration of the possible interrupt sources on a
+machine. Typically what is enumerated is the number of input pins on
+all of the interrupt controller in the system. In the case of ISA
+what is enumerated are the 16 input pins on the two i8259 interrupt
+controllers.
+
+Architectures can assign additional meaning to the IRQ numbers, and
+are encouraged to in the case where there is any manual configuration
+of the hardware involved. The ISA IRQs are a classic example of
+assigning this kind of additional meaning.
diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst
new file mode 100644
index 000000000000..0d65d11e5420
--- /dev/null
+++ b/Documentation/core-api/irq/index.rst
@@ -0,0 +1,11 @@
+====
+IRQs
+====
+
+.. toctree::
+ :maxdepth: 1
+
+ concepts
+ irq-affinity
+ irq-domain
+ irqflags-tracing
diff --git a/Documentation/core-api/irq/irq-affinity.rst b/Documentation/core-api/irq/irq-affinity.rst
new file mode 100644
index 000000000000..29da5000836a
--- /dev/null
+++ b/Documentation/core-api/irq/irq-affinity.rst
@@ -0,0 +1,70 @@
+================
+SMP IRQ affinity
+================
+
+ChangeLog:
+ - Started by Ingo Molnar <mingo@redhat.com>
+ - Update by Max Krasnyansky <maxk@qualcomm.com>
+
+
+/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify
+which target CPUs are permitted for a given IRQ source. It's a bitmask
+(smp_affinity) or cpu list (smp_affinity_list) of allowed CPUs. It's not
+allowed to turn off all CPUs, and if an IRQ controller does not support
+IRQ affinity then the value will not change from the default of all cpus.
+
+/proc/irq/default_smp_affinity specifies default affinity mask that applies
+to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
+will be set to the default mask. It can then be changed as described above.
+Default mask is 0xffffffff.
+
+Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
+it to CPU4-7 (this is an 8-CPU SMP box)::
+
+ [root@moon 44]# cd /proc/irq/44
+ [root@moon 44]# cat smp_affinity
+ ffffffff
+
+ [root@moon 44]# echo 0f > smp_affinity
+ [root@moon 44]# cat smp_affinity
+ 0000000f
+ [root@moon 44]# ping -f h
+ PING hell (195.4.7.3): 56 data bytes
+ ...
+ --- hell ping statistics ---
+ 6029 packets transmitted, 6027 packets received, 0% packet loss
+ round-trip min/avg/max = 0.1/0.1/0.4 ms
+ [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
+ CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
+ 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
+
+As can be seen from the line above IRQ44 was delivered only to the first four
+processors (0-3).
+Now lets restrict that IRQ to CPU(4-7).
+
+::
+
+ [root@moon 44]# echo f0 > smp_affinity
+ [root@moon 44]# cat smp_affinity
+ 000000f0
+ [root@moon 44]# ping -f h
+ PING hell (195.4.7.3): 56 data bytes
+ ..
+ --- hell ping statistics ---
+ 2779 packets transmitted, 2777 packets received, 0% packet loss
+ round-trip min/avg/max = 0.1/0.5/585.4 ms
+ [root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
+ CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
+ 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
+
+This time around IRQ44 was delivered only to the last four processors.
+i.e counters for the CPU0-3 did not change.
+
+Here is an example of limiting that same irq (44) to cpus 1024 to 1031::
+
+ [root@moon 44]# echo 1024-1031 > smp_affinity_list
+ [root@moon 44]# cat smp_affinity_list
+ 1024-1031
+
+Note that to do this with a bitmask would require 32 bitmasks of zero
+to follow the pertinent one.
diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst
new file mode 100644
index 000000000000..f88a6ee67a35
--- /dev/null
+++ b/Documentation/core-api/irq/irq-domain.rst
@@ -0,0 +1,297 @@
+===============================================
+The irq_domain interrupt number mapping library
+===============================================
+
+The current design of the Linux kernel uses a single large number
+space where each separate IRQ source is assigned a different number.
+This is simple when there is only one interrupt controller, but in
+systems with multiple interrupt controllers the kernel must ensure
+that each one gets assigned non-overlapping allocations of Linux
+IRQ numbers.
+
+The number of interrupt controllers registered as unique irqchips
+show a rising tendency: for example subdrivers of different kinds
+such as GPIO controllers avoid reimplementing identical callback
+mechanisms as the IRQ core system by modelling their interrupt
+handlers as irqchips, i.e. in effect cascading interrupt controllers.
+
+Here the interrupt number loose all kind of correspondence to
+hardware interrupt numbers: whereas in the past, IRQ numbers could
+be chosen so they matched the hardware IRQ line into the root
+interrupt controller (i.e. the component actually fireing the
+interrupt line to the CPU) nowadays this number is just a number.
+
+For this reason we need a mechanism to separate controller-local
+interrupt numbers, called hardware irq's, from Linux IRQ numbers.
+
+The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of
+irq numbers, but they don't provide any support for reverse mapping of
+the controller-local IRQ (hwirq) number into the Linux IRQ number
+space.
+
+The irq_domain library adds mapping between hwirq and IRQ numbers on
+top of the irq_alloc_desc*() API. An irq_domain to manage mapping is
+preferred over interrupt controller drivers open coding their own
+reverse mapping scheme.
+
+irq_domain also implements translation from an abstract irq_fwspec
+structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
+be easily extended to support other IRQ topology data sources.
+
+irq_domain usage
+================
+
+An interrupt controller driver creates and registers an irq_domain by
+calling one of the irq_domain_add_*() or irq_domain_create_*() functions
+(each mapping method has a different allocator function, more on that later).
+The function will return a pointer to the irq_domain on success. The caller
+must provide the allocator function with an irq_domain_ops structure.
+
+In most cases, the irq_domain will begin empty without any mappings
+between hwirq and IRQ numbers. Mappings are added to the irq_domain
+by calling irq_create_mapping() which accepts the irq_domain and a
+hwirq number as arguments. If a mapping for the hwirq doesn't already
+exist then it will allocate a new Linux irq_desc, associate it with
+the hwirq, and call the .map() callback so the driver can perform any
+required hardware setup.
+
+Once a mapping has been established, it can be retrieved or used via a
+variety of methods:
+
+- irq_resolve_mapping() returns a pointer to the irq_desc structure
+ for a given domain and hwirq number, and NULL if there was no
+ mapping.
+- irq_find_mapping() returns a Linux IRQ number for a given domain and
+ hwirq number, and 0 if there was no mapping
+- irq_linear_revmap() is now identical to irq_find_mapping(), and is
+ deprecated
+- generic_handle_domain_irq() handles an interrupt described by a
+ domain and a hwirq number
+
+Note that irq domain lookups must happen in contexts that are
+compatible with a RCU read-side critical section.
+
+The irq_create_mapping() function must be called *at least once*
+before any call to irq_find_mapping(), lest the descriptor will not
+be allocated.
+
+If the driver has the Linux IRQ number or the irq_data pointer, and
+needs to know the associated hwirq number (such as in the irq_chip
+callbacks) then it can be directly obtained from irq_data->hwirq.
+
+Types of irq_domain mappings
+============================
+
+There are several mechanisms available for reverse mapping from hwirq
+to Linux irq, and each mechanism uses a different allocation function.
+Which reverse map type should be used depends on the use case. Each
+of the reverse map types are described below:
+
+Linear
+------
+
+::
+
+ irq_domain_add_linear()
+ irq_domain_create_linear()
+
+The linear reverse map maintains a fixed size table indexed by the
+hwirq number. When a hwirq is mapped, an irq_desc is allocated for
+the hwirq, and the IRQ number is stored in the table.
+
+The Linear map is a good choice when the maximum number of hwirqs is
+fixed and a relatively small number (~ < 256). The advantages of this
+map are fixed time lookup for IRQ numbers, and irq_descs are only
+allocated for in-use IRQs. The disadvantage is that the table must be
+as large as the largest possible hwirq number.
+
+irq_domain_add_linear() and irq_domain_create_linear() are functionally
+equivalent, except for the first argument is different - the former
+accepts an Open Firmware specific 'struct device_node', while the latter
+accepts a more general abstraction 'struct fwnode_handle'.
+
+The majority of drivers should use the linear map.
+
+Tree
+----
+
+::
+
+ irq_domain_add_tree()
+ irq_domain_create_tree()
+
+The irq_domain maintains a radix tree map from hwirq numbers to Linux
+IRQs. When an hwirq is mapped, an irq_desc is allocated and the
+hwirq is used as the lookup key for the radix tree.
+
+The tree map is a good choice if the hwirq number can be very large
+since it doesn't need to allocate a table as large as the largest
+hwirq number. The disadvantage is that hwirq to IRQ number lookup is
+dependent on how many entries are in the table.
+
+irq_domain_add_tree() and irq_domain_create_tree() are functionally
+equivalent, except for the first argument is different - the former
+accepts an Open Firmware specific 'struct device_node', while the latter
+accepts a more general abstraction 'struct fwnode_handle'.
+
+Very few drivers should need this mapping.
+
+No Map
+------
+
+::
+
+ irq_domain_add_nomap()
+
+The No Map mapping is to be used when the hwirq number is
+programmable in the hardware. In this case it is best to program the
+Linux IRQ number into the hardware itself so that no mapping is
+required. Calling irq_create_direct_mapping() will allocate a Linux
+IRQ number and call the .map() callback so that driver can program the
+Linux IRQ number into the hardware.
+
+Most drivers cannot use this mapping, and it is now gated on the
+CONFIG_IRQ_DOMAIN_NOMAP option. Please refrain from introducing new
+users of this API.
+
+Legacy
+------
+
+::
+
+ irq_domain_add_simple()
+ irq_domain_add_legacy()
+ irq_domain_create_simple()
+ irq_domain_create_legacy()
+
+The Legacy mapping is a special case for drivers that already have a
+range of irq_descs allocated for the hwirqs. It is used when the
+driver cannot be immediately converted to use the linear mapping. For
+example, many embedded system board support files use a set of #defines
+for IRQ numbers that are passed to struct device registrations. In that
+case the Linux IRQ numbers cannot be dynamically assigned and the legacy
+mapping should be used.
+
+As the name implies, the \*_legacy() functions are deprecated and only
+exist to ease the support of ancient platforms. No new users should be
+added. Same goes for the \*_simple() functions when their use results
+in the legacy behaviour.
+
+The legacy map assumes a contiguous range of IRQ numbers has already
+been allocated for the controller and that the IRQ number can be
+calculated by adding a fixed offset to the hwirq number, and
+visa-versa. The disadvantage is that it requires the interrupt
+controller to manage IRQ allocations and it requires an irq_desc to be
+allocated for every hwirq, even if it is unused.
+
+The legacy map should only be used if fixed IRQ mappings must be
+supported. For example, ISA controllers would use the legacy map for
+mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ
+numbers.
+
+Most users of legacy mappings should use irq_domain_add_simple() or
+irq_domain_create_simple() which will use a legacy domain only if an IRQ range
+is supplied by the system and will otherwise use a linear domain mapping.
+The semantics of this call are such that if an IRQ range is specified then
+descriptors will be allocated on-the-fly for it, and if no range is
+specified it will fall through to irq_domain_add_linear() or
+irq_domain_create_linear() which means *no* irq descriptors will be allocated.
+
+A typical use case for simple domains is where an irqchip provider
+is supporting both dynamic and static IRQ assignments.
+
+In order to avoid ending up in a situation where a linear domain is
+used and no descriptor gets allocated it is very important to make sure
+that the driver using the simple domain call irq_create_mapping()
+before any irq_find_mapping() since the latter will actually work
+for the static IRQ assignment case.
+
+irq_domain_add_simple() and irq_domain_create_simple() as well as
+irq_domain_add_legacy() and irq_domain_create_legacy() are functionally
+equivalent, except for the first argument is different - the former
+accepts an Open Firmware specific 'struct device_node', while the latter
+accepts a more general abstraction 'struct fwnode_handle'.
+
+Hierarchy IRQ domain
+--------------------
+
+On some architectures, there may be multiple interrupt controllers
+involved in delivering an interrupt from the device to the target CPU.
+Let's look at a typical interrupt delivering path on x86 platforms::
+
+ Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
+
+There are three interrupt controllers involved:
+
+1) IOAPIC controller
+2) Interrupt remapping controller
+3) Local APIC controller
+
+To support such a hardware topology and make software architecture match
+hardware architecture, an irq_domain data structure is built for each
+interrupt controller and those irq_domains are organized into hierarchy.
+When building irq_domain hierarchy, the irq_domain near to the device is
+child and the irq_domain near to CPU is parent. So a hierarchy structure
+as below will be built for the example above::
+
+ CPU Vector irq_domain (root irq_domain to manage CPU vectors)
+ ^
+ |
+ Interrupt Remapping irq_domain (manage irq_remapping entries)
+ ^
+ |
+ IOAPIC irq_domain (manage IOAPIC delivery entries/pins)
+
+There are four major interfaces to use hierarchy irq_domain:
+
+1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt
+ controller related resources to deliver these interrupts.
+2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller
+ related resources associated with these interrupts.
+3) irq_domain_activate_irq(): activate interrupt controller hardware to
+ deliver the interrupt.
+4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
+ to stop delivering the interrupt.
+
+Following changes are needed to support hierarchy irq_domain:
+
+1) a new field 'parent' is added to struct irq_domain; it's used to
+ maintain irq_domain hierarchy information.
+2) a new field 'parent_data' is added to struct irq_data; it's used to
+ build hierarchy irq_data to match hierarchy irq_domains. The irq_data
+ is used to store irq_domain pointer and hardware irq number.
+3) new callbacks are added to struct irq_domain_ops to support hierarchy
+ irq_domain operations.
+
+With support of hierarchy irq_domain and hierarchy irq_data ready, an
+irq_domain structure is built for each interrupt controller, and an
+irq_data structure is allocated for each irq_domain associated with an
+IRQ. Now we could go one step further to support stacked(hierarchy)
+irq_chip. That is, an irq_chip is associated with each irq_data along
+the hierarchy. A child irq_chip may implement a required action by
+itself or by cooperating with its parent irq_chip.
+
+With stacked irq_chip, interrupt controller driver only needs to deal
+with the hardware managed by itself and may ask for services from its
+parent irq_chip when needed. So we could achieve a much cleaner
+software architecture.
+
+For an interrupt controller driver to support hierarchy irq_domain, it
+needs to:
+
+1) Implement irq_domain_ops.alloc and irq_domain_ops.free
+2) Optionally implement irq_domain_ops.activate and
+ irq_domain_ops.deactivate.
+3) Optionally implement an irq_chip to manage the interrupt controller
+ hardware.
+4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap,
+ they are unused with hierarchy irq_domain.
+
+Hierarchy irq_domain is in no way x86 specific, and is heavily used to
+support other architectures, such as ARM, ARM64 etc.
+
+Debugging
+=========
+
+Most of the internals of the IRQ subsystem are exposed in debugfs by
+turning CONFIG_GENERIC_IRQ_DEBUGFS on.
diff --git a/Documentation/core-api/irq/irqflags-tracing.rst b/Documentation/core-api/irq/irqflags-tracing.rst
new file mode 100644
index 000000000000..bdd208259fb3
--- /dev/null
+++ b/Documentation/core-api/irq/irqflags-tracing.rst
@@ -0,0 +1,52 @@
+=======================
+IRQ-flags state tracing
+=======================
+
+:Author: started by Ingo Molnar <mingo@redhat.com>
+
+The "irq-flags tracing" feature "traces" hardirq and softirq state, in
+that it gives interested subsystems an opportunity to be notified of
+every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that
+happens in the kernel.
+
+CONFIG_TRACE_IRQFLAGS_SUPPORT is needed for CONFIG_PROVE_SPIN_LOCKING
+and CONFIG_PROVE_RW_LOCKING to be offered by the generic lock debugging
+code. Otherwise only CONFIG_PROVE_MUTEX_LOCKING and
+CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these
+are locking APIs that are not used in IRQ context. (the one exception
+for rwsems is worked around)
+
+Architecture support for this is certainly not in the "trivial"
+category, because lots of lowlevel assembly code deal with irq-flags
+state changes. But an architecture can be irq-flags-tracing enabled in a
+rather straightforward and risk-free manner.
+
+Architectures that want to support this need to do a couple of
+code-organizational changes first:
+
+- add and enable TRACE_IRQFLAGS_SUPPORT in their arch level Kconfig file
+
+and then a couple of functional changes are needed as well to implement
+irq-flags-tracing support:
+
+- in lowlevel entry code add (build-conditional) calls to the
+ trace_hardirqs_off()/trace_hardirqs_on() functions. The lock validator
+ closely guards whether the 'real' irq-flags matches the 'virtual'
+ irq-flags state, and complains loudly (and turns itself off) if the
+ two do not match. Usually most of the time for arch support for
+ irq-flags-tracing is spent in this state: look at the lockdep
+ complaint, try to figure out the assembly code we did not cover yet,
+ fix and repeat. Once the system has booted up and works without a
+ lockdep complaint in the irq-flags-tracing functions arch support is
+ complete.
+- if the architecture has non-maskable interrupts then those need to be
+ excluded from the irq-tracing [and lock validation] mechanism via
+ lockdep_off()/lockdep_on().
+
+In general there is no risk from having an incomplete irq-flags-tracing
+implementation in an architecture: lockdep will detect that and will
+turn itself off. I.e. the lock validator will still be reliable. There
+should be no crashes due to irq-tracing bugs. (except if the assembly
+changes break other code by modifying conditions or registers that
+shouldn't be)
+
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 4ac53a1363f6..ae92a2571388 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -24,11 +24,8 @@ String Conversions
.. kernel-doc:: lib/vsprintf.c
:export:
-.. kernel-doc:: include/linux/kernel.h
- :functions: kstrtol
-
-.. kernel-doc:: include/linux/kernel.h
- :functions: kstrtoul
+.. kernel-doc:: include/linux/kstrtox.h
+ :functions: kstrtol kstrtoul
.. kernel-doc:: lib/kstrtox.c
:export:
@@ -39,6 +36,9 @@ String Conversions
String Manipulation
-------------------
+.. kernel-doc:: include/linux/fortify-string.h
+ :internal:
+
.. kernel-doc:: lib/string.c
:export:
@@ -96,6 +96,12 @@ Command-line Parsing
.. kernel-doc:: lib/cmdline.c
:export:
+Error Pointers
+--------------
+
+.. kernel-doc:: include/linux/err.h
+ :internal:
+
Sorting
-------
@@ -121,6 +127,12 @@ Text Searching
CRC and Math Functions in Linux
===============================
+Arithmetic Overflow Checking
+----------------------------
+
+.. kernel-doc:: include/linux/overflow.h
+ :internal:
+
CRC Functions
-------------
@@ -150,8 +162,10 @@ Base 2 log and power Functions
.. kernel-doc:: include/linux/log2.h
:internal:
-Integer power Functions
------------------------
+Integer log and power Functions
+-------------------------------
+
+.. kernel-doc:: include/linux/int_log.h
.. kernel-doc:: lib/math/int_pow.c
:export:
@@ -168,9 +182,6 @@ Division Functions
.. kernel-doc:: include/linux/math64.h
:internal:
-.. kernel-doc:: lib/math/div64.c
- :functions: div_s64_rem div64_u64_rem div64_u64 div64_s64
-
.. kernel-doc:: lib/math/gcd.c
:export:
@@ -217,26 +228,38 @@ relay interface
Module Support
==============
-Module Loading
---------------
+Kernel module auto-loading
+--------------------------
-.. kernel-doc:: kernel/kmod.c
+.. kernel-doc:: kernel/module/kmod.c
:export:
+Module debugging
+----------------
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: module debugging statistics overview
+
+dup_failed_modules - tracks duplicate failed modules
+****************************************************
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: dup_failed_modules - tracks duplicate failed modules
+
+module statistics debugfs counters
+**********************************
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: module statistics debugfs counters
+
Inter Module support
--------------------
-Refer to the file kernel/module.c for more information.
+Refer to the files in kernel/module/ for more information.
Hardware Interfaces
===================
-Interrupt Handling
-------------------
-
-.. kernel-doc:: kernel/irq/manage.c
- :export:
-
DMA Channels
------------
@@ -288,6 +311,7 @@ Accounting Framework
Block Devices
=============
+.. kernel-doc:: include/linux/bio.h
.. kernel-doc:: block/blk-core.c
:export:
@@ -303,9 +327,6 @@ Block Devices
.. kernel-doc:: block/blk-settings.c
:export:
-.. kernel-doc:: block/blk-exec.c
- :export:
-
.. kernel-doc:: block/blk-flush.c
:export:
@@ -324,6 +345,9 @@ Block Devices
.. kernel-doc:: block/genhd.c
:export:
+.. kernel-doc:: block/bdev.c
+ :export:
+
Char devices
============
@@ -396,3 +420,15 @@ Read-Copy Update (RCU)
.. kernel-doc:: include/linux/rcu_sync.h
.. kernel-doc:: kernel/rcu/sync.c
+
+.. kernel-doc:: kernel/rcu/tasks.h
+
+.. kernel-doc:: kernel/rcu/tree_stall.h
+
+.. kernel-doc:: include/linux/rcupdate_trace.h
+
+.. kernel-doc:: include/linux/rcupdate_wait.h
+
+.. kernel-doc:: include/linux/rcuref.h
+
+.. kernel-doc:: include/linux/rcutree.h
diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst
index 1f62d4d7d966..7310247310a0 100644
--- a/Documentation/core-api/kobject.rst
+++ b/Documentation/core-api/kobject.rst
@@ -6,7 +6,7 @@ Everything you never wanted to know about kobjects, ksets, and ktypes
:Last updated: December 19, 2007
Based on an original article by Jon Corbet for lwn.net written October 1,
-2003 and located at http://lwn.net/Articles/51437/
+2003 and located at https://lwn.net/Articles/51437/
Part of the difficulty in understanding the driver model - and the kobject
abstraction upon which it is built - is that there is no obvious starting
@@ -80,11 +80,11 @@ what is the pointer to the containing structure? You must avoid tricks
(such as assuming that the kobject is at the beginning of the structure)
and, instead, use the container_of() macro, found in ``<linux/kernel.h>``::
- container_of(pointer, type, member)
+ container_of(ptr, type, member)
where:
- * ``pointer`` is the pointer to the embedded kobject,
+ * ``ptr`` is the pointer to the embedded kobject,
* ``type`` is the type of the containing structure, and
* ``member`` is the name of the structure field to which ``pointer`` points.
@@ -118,7 +118,7 @@ Initialization of kobjects
Code which creates a kobject must, of course, initialize that object. Some
of the internal fields are setup with a (mandatory) call to kobject_init()::
- void kobject_init(struct kobject *kobj, struct kobj_type *ktype);
+ void kobject_init(struct kobject *kobj, const struct kobj_type *ktype);
The ktype is required for a kobject to be created properly, as every kobject
must have an associated kobj_type. After calling kobject_init(), to
@@ -140,7 +140,7 @@ the name of the kobject, call kobject_rename()::
int kobject_rename(struct kobject *kobj, const char *new_name);
-kobject_rename does not perform any locking or have a solid notion of
+kobject_rename() does not perform any locking or have a solid notion of
what names are valid so the caller must provide their own sanity checking
and serialization.
@@ -156,7 +156,7 @@ kobject_name()::
There is a helper function to both initialize and add the kobject to the
kernel at the same time, called surprisingly enough kobject_init_and_add()::
- int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype,
+ int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype,
struct kobject *parent, const char *fmt, ...);
The arguments are the same as the individual kobject_init() and
@@ -210,7 +210,7 @@ statically and will warn the developer of this improper usage.
If all that you want to use a kobject for is to provide a reference counter
for your structure, please use the struct kref instead; a kobject would be
overkill. For more information on how to use struct kref, please see the
-file Documentation/kref.txt in the Linux kernel source tree.
+file Documentation/core-api/kref.rst in the Linux kernel source tree.
Creating "simple" kobjects
@@ -222,17 +222,17 @@ ksets, show and store functions, and other details. This is the one
exception where a single kobject should be created. To create such an
entry, use the function::
- struct kobject *kobject_create_and_add(char *name, struct kobject *parent);
+ struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);
This function will create a kobject and place it in sysfs in the location
underneath the specified parent kobject. To create simple attributes
associated with this kobject, use::
- int sysfs_create_file(struct kobject *kobj, struct attribute *attr);
+ int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
or::
- int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp);
+ int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);
Both types of attributes used here, with a kobject that has been created
with the kobject_create_and_add(), can be of type kobj_attribute, so no
@@ -299,9 +299,10 @@ kobj_type::
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
- struct attribute **default_attrs;
+ const struct attribute_group **default_groups;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
+ void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid);
};
This structure is used to describe a particular type of kobject (or, more
@@ -311,10 +312,10 @@ call kobject_init() or kobject_init_and_add().
The release field in struct kobj_type is, of course, a pointer to the
release() method for this type of kobject. The other two fields (sysfs_ops
-and default_attrs) control how objects of this type are represented in
+and default_groups) control how objects of this type are represented in
sysfs; they are beyond the scope of this document.
-The default_attrs pointer is a list of default attributes that will be
+The default_groups pointer is a list of default attributes that will be
automatically created for any kobject that is registered with this ktype.
@@ -352,12 +353,12 @@ created and never declared statically or on the stack. To create a new
kset use::
struct kset *kset_create_and_add(const char *name,
- struct kset_uevent_ops *u,
- struct kobject *parent);
+ const struct kset_uevent_ops *uevent_ops,
+ struct kobject *parent_kobj);
When you are finished with the kset, call::
- void kset_unregister(struct kset *kset);
+ void kset_unregister(struct kset *k);
to destroy it. This removes the kset from sysfs and decrements its reference
count. When the reference count goes to zero, the kset will be released.
@@ -371,10 +372,9 @@ If a kset wishes to control the uevent operations of the kobjects
associated with it, it can use the struct kset_uevent_ops to handle it::
struct kset_uevent_ops {
- int (*filter)(struct kset *kset, struct kobject *kobj);
- const char *(*name)(struct kset *kset, struct kobject *kobj);
- int (*uevent)(struct kset *kset, struct kobject *kobj,
- struct kobj_uevent_env *env);
+ int (* const filter)(struct kobject *kobj);
+ const char *(* const name)(struct kobject *kobj);
+ int (* const uevent)(struct kobject *kobj, struct kobj_uevent_env *env);
};
diff --git a/Documentation/core-api/kref.rst b/Documentation/core-api/kref.rst
new file mode 100644
index 000000000000..c61eea6f1bf2
--- /dev/null
+++ b/Documentation/core-api/kref.rst
@@ -0,0 +1,323 @@
+===================================================
+Adding reference counters (krefs) to kernel objects
+===================================================
+
+:Author: Corey Minyard <minyard@acm.org>
+:Author: Thomas Hellstrom <thellstrom@vmware.com>
+
+A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and
+presentation on krefs, which can be found at:
+
+ - http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf
+ - http://www.kroah.com/linux/talks/ols_2004_kref_talk/
+
+Introduction
+============
+
+krefs allow you to add reference counters to your objects. If you
+have objects that are used in multiple places and passed around, and
+you don't have refcounts, your code is almost certainly broken. If
+you want refcounts, krefs are the way to go.
+
+To use a kref, add one to your data structures like::
+
+ struct my_data
+ {
+ .
+ .
+ struct kref refcount;
+ .
+ .
+ };
+
+The kref can occur anywhere within the data structure.
+
+Initialization
+==============
+
+You must initialize the kref after you allocate it. To do this, call
+kref_init as so::
+
+ struct my_data *data;
+
+ data = kmalloc(sizeof(*data), GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+ kref_init(&data->refcount);
+
+This sets the refcount in the kref to 1.
+
+Kref rules
+==========
+
+Once you have an initialized kref, you must follow the following
+rules:
+
+1) If you make a non-temporary copy of a pointer, especially if
+ it can be passed to another thread of execution, you must
+ increment the refcount with kref_get() before passing it off::
+
+ kref_get(&data->refcount);
+
+ If you already have a valid pointer to a kref-ed structure (the
+ refcount cannot go to zero) you may do this without a lock.
+
+2) When you are done with a pointer, you must call kref_put()::
+
+ kref_put(&data->refcount, data_release);
+
+ If this is the last reference to the pointer, the release
+ routine will be called. If the code never tries to get
+ a valid pointer to a kref-ed structure without already
+ holding a valid pointer, it is safe to do this without
+ a lock.
+
+3) If the code attempts to gain a reference to a kref-ed structure
+ without already holding a valid pointer, it must serialize access
+ where a kref_put() cannot occur during the kref_get(), and the
+ structure must remain valid during the kref_get().
+
+For example, if you allocate some data and then pass it to another
+thread to process::
+
+ void data_release(struct kref *ref)
+ {
+ struct my_data *data = container_of(ref, struct my_data, refcount);
+ kfree(data);
+ }
+
+ void more_data_handling(void *cb_data)
+ {
+ struct my_data *data = cb_data;
+ .
+ . do stuff with data here
+ .
+ kref_put(&data->refcount, data_release);
+ }
+
+ int my_data_handler(void)
+ {
+ int rv = 0;
+ struct my_data *data;
+ struct task_struct *task;
+ data = kmalloc(sizeof(*data), GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+ kref_init(&data->refcount);
+
+ kref_get(&data->refcount);
+ task = kthread_run(more_data_handling, data, "more_data_handling");
+ if (task == ERR_PTR(-ENOMEM)) {
+ rv = -ENOMEM;
+ kref_put(&data->refcount, data_release);
+ goto out;
+ }
+
+ .
+ . do stuff with data here
+ .
+ out:
+ kref_put(&data->refcount, data_release);
+ return rv;
+ }
+
+This way, it doesn't matter what order the two threads handle the
+data, the kref_put() handles knowing when the data is not referenced
+any more and releasing it. The kref_get() does not require a lock,
+since we already have a valid pointer that we own a refcount for. The
+put needs no lock because nothing tries to get the data without
+already holding a pointer.
+
+In the above example, kref_put() will be called 2 times in both success
+and error paths. This is necessary because the reference count got
+incremented 2 times by kref_init() and kref_get().
+
+Note that the "before" in rule 1 is very important. You should never
+do something like::
+
+ task = kthread_run(more_data_handling, data, "more_data_handling");
+ if (task == ERR_PTR(-ENOMEM)) {
+ rv = -ENOMEM;
+ goto out;
+ } else
+ /* BAD BAD BAD - get is after the handoff */
+ kref_get(&data->refcount);
+
+Don't assume you know what you are doing and use the above construct.
+First of all, you may not know what you are doing. Second, you may
+know what you are doing (there are some situations where locking is
+involved where the above may be legal) but someone else who doesn't
+know what they are doing may change the code or copy the code. It's
+bad style. Don't do it.
+
+There are some situations where you can optimize the gets and puts.
+For instance, if you are done with an object and enqueuing it for
+something else or passing it off to something else, there is no reason
+to do a get then a put::
+
+ /* Silly extra get and put */
+ kref_get(&obj->ref);
+ enqueue(obj);
+ kref_put(&obj->ref, obj_cleanup);
+
+Just do the enqueue. A comment about this is always welcome::
+
+ enqueue(obj);
+ /* We are done with obj, so we pass our refcount off
+ to the queue. DON'T TOUCH obj AFTER HERE! */
+
+The last rule (rule 3) is the nastiest one to handle. Say, for
+instance, you have a list of items that are each kref-ed, and you wish
+to get the first one. You can't just pull the first item off the list
+and kref_get() it. That violates rule 3 because you are not already
+holding a valid pointer. You must add a mutex (or some other lock).
+For instance::
+
+ static DEFINE_MUTEX(mutex);
+ static LIST_HEAD(q);
+ struct my_data
+ {
+ struct kref refcount;
+ struct list_head link;
+ };
+
+ static struct my_data *get_entry()
+ {
+ struct my_data *entry = NULL;
+ mutex_lock(&mutex);
+ if (!list_empty(&q)) {
+ entry = container_of(q.next, struct my_data, link);
+ kref_get(&entry->refcount);
+ }
+ mutex_unlock(&mutex);
+ return entry;
+ }
+
+ static void release_entry(struct kref *ref)
+ {
+ struct my_data *entry = container_of(ref, struct my_data, refcount);
+
+ list_del(&entry->link);
+ kfree(entry);
+ }
+
+ static void put_entry(struct my_data *entry)
+ {
+ mutex_lock(&mutex);
+ kref_put(&entry->refcount, release_entry);
+ mutex_unlock(&mutex);
+ }
+
+The kref_put() return value is useful if you do not want to hold the
+lock during the whole release operation. Say you didn't want to call
+kfree() with the lock held in the example above (since it is kind of
+pointless to do so). You could use kref_put() as follows::
+
+ static void release_entry(struct kref *ref)
+ {
+ /* All work is done after the return from kref_put(). */
+ }
+
+ static void put_entry(struct my_data *entry)
+ {
+ mutex_lock(&mutex);
+ if (kref_put(&entry->refcount, release_entry)) {
+ list_del(&entry->link);
+ mutex_unlock(&mutex);
+ kfree(entry);
+ } else
+ mutex_unlock(&mutex);
+ }
+
+This is really more useful if you have to call other routines as part
+of the free operations that could take a long time or might claim the
+same lock. Note that doing everything in the release routine is still
+preferred as it is a little neater.
+
+The above example could also be optimized using kref_get_unless_zero() in
+the following way::
+
+ static struct my_data *get_entry()
+ {
+ struct my_data *entry = NULL;
+ mutex_lock(&mutex);
+ if (!list_empty(&q)) {
+ entry = container_of(q.next, struct my_data, link);
+ if (!kref_get_unless_zero(&entry->refcount))
+ entry = NULL;
+ }
+ mutex_unlock(&mutex);
+ return entry;
+ }
+
+ static void release_entry(struct kref *ref)
+ {
+ struct my_data *entry = container_of(ref, struct my_data, refcount);
+
+ mutex_lock(&mutex);
+ list_del(&entry->link);
+ mutex_unlock(&mutex);
+ kfree(entry);
+ }
+
+ static void put_entry(struct my_data *entry)
+ {
+ kref_put(&entry->refcount, release_entry);
+ }
+
+Which is useful to remove the mutex lock around kref_put() in put_entry(), but
+it's important that kref_get_unless_zero is enclosed in the same critical
+section that finds the entry in the lookup table,
+otherwise kref_get_unless_zero may reference already freed memory.
+Note that it is illegal to use kref_get_unless_zero without checking its
+return value. If you are sure (by already having a valid pointer) that
+kref_get_unless_zero() will return true, then use kref_get() instead.
+
+Krefs and RCU
+=============
+
+The function kref_get_unless_zero also makes it possible to use rcu
+locking for lookups in the above example::
+
+ struct my_data
+ {
+ struct rcu_head rhead;
+ .
+ struct kref refcount;
+ .
+ .
+ };
+
+ static struct my_data *get_entry_rcu()
+ {
+ struct my_data *entry = NULL;
+ rcu_read_lock();
+ if (!list_empty(&q)) {
+ entry = container_of(q.next, struct my_data, link);
+ if (!kref_get_unless_zero(&entry->refcount))
+ entry = NULL;
+ }
+ rcu_read_unlock();
+ return entry;
+ }
+
+ static void release_entry_rcu(struct kref *ref)
+ {
+ struct my_data *entry = container_of(ref, struct my_data, refcount);
+
+ mutex_lock(&mutex);
+ list_del_rcu(&entry->link);
+ mutex_unlock(&mutex);
+ kfree_rcu(entry, rhead);
+ }
+
+ static void put_entry(struct my_data *entry)
+ {
+ kref_put(&entry->refcount, release_entry_rcu);
+ }
+
+But note that the struct kref member needs to remain in valid memory for a
+rcu grace period after release_entry_rcu was called. That can be accomplished
+by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu()
+before using kfree, but note that synchronize_rcu() may sleep for a
+substantial amount of time.
diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst
index 2ac3f9f29845..0b42ceaaf3c4 100644
--- a/Documentation/core-api/local_ops.rst
+++ b/Documentation/core-api/local_ops.rst
@@ -191,7 +191,7 @@ Here is a sample module which implements a basic per cpu counter using
static void __exit test_exit(void)
{
- del_timer_sync(&test_timer);
+ timer_shutdown_sync(&test_timer);
}
module_init(test_init);
diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
new file mode 100644
index 000000000000..ccdd1615cf97
--- /dev/null
+++ b/Documentation/core-api/maple_tree.rst
@@ -0,0 +1,221 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+
+==========
+Maple Tree
+==========
+
+:Author: Liam R. Howlett
+
+Overview
+========
+
+The Maple Tree is a B-Tree data type which is optimized for storing
+non-overlapping ranges, including ranges of size 1. The tree was designed to
+be simple to use and does not require a user written search method. It
+supports iterating over a range of entries and going to the previous or next
+entry in a cache-efficient manner. The tree can also be put into an RCU-safe
+mode of operation which allows reading and writing concurrently. Writers must
+synchronize on a lock, which can be the default spinlock, or the user can set
+the lock to an external lock of a different type.
+
+The Maple Tree maintains a small memory footprint and was designed to use
+modern processor cache efficiently. The majority of the users will be able to
+use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
+scenarios. The most important usage of the Maple Tree is the tracking of the
+virtual memory areas.
+
+The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
+Tree reserves values with the bottom two bits set to '10' which are below 4096
+(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
+entries then the users can convert the entries using xa_mk_value() and convert
+them back by calling xa_to_value(). If the user needs to use a reserved
+value, then the user can convert the value when using the
+:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
+
+The Maple Tree can also be configured to support searching for a gap of a given
+size (or larger).
+
+Pre-allocating of nodes is also supported using the
+:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
+successful store operation within a given
+code segment when allocating cannot be done. Allocations of nodes are
+relatively small at around 256 bytes.
+
+.. _maple-tree-normal-api:
+
+Normal API
+==========
+
+Start by initialising a maple tree, either with DEFINE_MTREE() for statically
+allocated maple trees or mt_init() for dynamically allocated ones. A
+freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
+- ``ULONG_MAX``. There are currently two types of maple trees supported: the
+allocation tree and the regular tree. The regular tree has a higher branching
+factor for internal nodes. The allocation tree has a lower branching factor
+but allows the user to search for a gap of a given size or larger from either
+``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
+passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
+
+You can then set entries using mtree_store() or mtree_store_range().
+mtree_store() will overwrite any entry with the new entry and return 0 on
+success or an error code otherwise. mtree_store_range() works in the same way
+but takes a range. mtree_load() is used to retrieve the entry stored at a
+given index. You can use mtree_erase() to erase an entire range by only
+knowing one value within that range, or mtree_store() call with an entry of
+NULL may be used to partially erase a range or many ranges at once.
+
+If you want to only store a new entry to a range (or index) if that range is
+currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
+return -EEXIST if the range is not empty.
+
+You can search for an entry from an index upwards by using mt_find().
+
+You can walk each entry within a range by calling mt_for_each(). You must
+provide a temporary variable to store a cursor. If you want to walk each
+element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
+the caller is going to hold the lock for the duration of the walk then it is
+worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
+section.
+
+Sometimes it is necessary to ensure the next call to store to a maple tree does
+not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
+
+You can use mtree_dup() to duplicate an entire maple tree. It is a more
+efficient way than inserting all elements one by one into a new tree.
+
+Finally, you can remove all entries from a maple tree by calling
+mtree_destroy(). If the maple tree entries are pointers, you may wish to free
+the entries first.
+
+Allocating Nodes
+----------------
+
+The allocations are handled by the internal tree code. See
+:ref:`maple-tree-advanced-alloc` for other options.
+
+Locking
+-------
+
+You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
+for other options.
+
+The Maple Tree uses RCU and an internal spinlock to synchronise access:
+
+Takes RCU read lock:
+ * mtree_load()
+ * mt_find()
+ * mt_for_each()
+ * mt_next()
+ * mt_prev()
+
+Takes ma_lock internally:
+ * mtree_store()
+ * mtree_store_range()
+ * mtree_insert()
+ * mtree_insert_range()
+ * mtree_erase()
+ * mtree_dup()
+ * mtree_destroy()
+ * mt_set_in_rcu()
+ * mt_clear_in_rcu()
+
+If you want to take advantage of the internal lock to protect the data
+structures that you are storing in the Maple Tree, you can call mtree_lock()
+before calling mtree_load(), then take a reference count on the object you
+have found before calling mtree_unlock(). This will prevent stores from
+removing the object from the tree between looking up the object and
+incrementing the refcount. You can also use RCU to avoid dereferencing
+freed memory, but an explanation of that is beyond the scope of this
+document.
+
+.. _maple-tree-advanced-api:
+
+Advanced API
+============
+
+The advanced API offers more flexibility and better performance at the
+cost of an interface which can be harder to use and has fewer safeguards.
+You must take care of your own locking while using the advanced API.
+You can use the ma_lock, RCU or an external lock for protection.
+You can mix advanced and normal operations on the same array, as long
+as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
+in terms of the advanced API.
+
+The advanced API is based around the ma_state, this is where the 'mas'
+prefix originates. The ma_state struct keeps track of tree operations to make
+life easier for both internal and external tree users.
+
+Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
+Please see above.
+
+The maple state keeps track of the range start and end in mas->index and
+mas->last, respectively.
+
+mas_walk() will walk the tree to the location of mas->index and set the
+mas->index and mas->last according to the range for the entry.
+
+You can set entries using mas_store(). mas_store() will overwrite any entry
+with the new entry and return the first existing entry that is overwritten.
+The range is passed in as members of the maple state: index and last.
+
+You can use mas_erase() to erase an entire range by setting index and
+last of the maple state to the desired range to erase. This will erase
+the first range that is found in that range, set the maple state index
+and last as the range that was erased and return the entry that existed
+at that location.
+
+You can walk each entry within a range by using mas_for_each(). If you want
+to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
+the range. If the lock needs to be periodically dropped, see the locking
+section mas_pause().
+
+Using a maple state allows mas_next() and mas_prev() to function as if the
+tree was a linked list. With such a high branching factor the amortized
+performance penalty is outweighed by cache optimization. mas_next() will
+return the next entry which occurs after the entry at index. mas_prev()
+will return the previous entry which occurs before the entry at index.
+
+mas_find() will find the first entry which exists at or above index on
+the first call, and the next entry from every subsequent calls.
+
+mas_find_rev() will find the first entry which exists at or below the last on
+the first call, and the previous entry from every subsequent calls.
+
+If the user needs to yield the lock during an operation, then the maple state
+must be paused using mas_pause().
+
+There are a few extra interfaces provided when using an allocation tree.
+If you wish to search for a gap within a range, then mas_empty_area()
+or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
+starting at the lowest index given up to the maximum of the range.
+mas_empty_area_rev() searches for a gap starting at the highest index given
+and continues downward to the lower bound of the range.
+
+.. _maple-tree-advanced-alloc:
+
+Advanced Allocating Nodes
+-------------------------
+
+Allocations are usually handled internally to the tree, however if allocations
+need to occur before a write occurs then calling mas_expected_entries() will
+allocate the worst-case number of needed nodes to insert the provided number of
+ranges. This also causes the tree to enter mass insertion mode. Once
+insertions are complete calling mas_destroy() on the maple state will free the
+unused allocations.
+
+.. _maple-tree-advanced-locks:
+
+Advanced Locking
+----------------
+
+The maple tree uses a spinlock by default, but external locks can be used for
+tree updates as well. To use an external lock, the tree must be initialized
+with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
+MTREE_INIT_EXT() #define, which takes an external lock as an argument.
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/maple_tree.h
+.. kernel-doc:: lib/maple_tree.c
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 4aa82ddd01b8..1c58d883b273 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -84,6 +84,50 @@ driver for a device with such restrictions, avoid using these flags.
And even with hardware with restrictions it is preferable to use
`dma_alloc*` APIs.
+GFP flags and reclaim behavior
+------------------------------
+Memory allocations may trigger direct or background reclaim and it is
+useful to understand how hard the page allocator will try to satisfy that
+or another request.
+
+ * ``GFP_KERNEL & ~__GFP_RECLAIM`` - optimistic allocation without _any_
+ attempt to free memory at all. The most light weight mode which even
+ doesn't kick the background reclaim. Should be used carefully because it
+ might deplete the memory and the next user might hit the more aggressive
+ reclaim.
+
+ * ``GFP_KERNEL & ~__GFP_DIRECT_RECLAIM`` (or ``GFP_NOWAIT``)- optimistic
+ allocation without any attempt to free memory from the current
+ context but can wake kswapd to reclaim memory if the zone is below
+ the low watermark. Can be used from either atomic contexts or when
+ the request is a performance optimization and there is another
+ fallback for a slow path.
+
+ * ``(GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM`` (aka ``GFP_ATOMIC``) -
+ non sleeping allocation with an expensive fallback so it can access
+ some portion of memory reserves. Usually used from interrupt/bottom-half
+ context with an expensive slow path fallback.
+
+ * ``GFP_KERNEL`` - both background and direct reclaim are allowed and the
+ **default** page allocator behavior is used. That means that not costly
+ allocation requests are basically no-fail but there is no guarantee of
+ that behavior so failures have to be checked properly by callers
+ (e.g. OOM killer victim is allowed to fail currently).
+
+ * ``GFP_KERNEL | __GFP_NORETRY`` - overrides the default allocator behavior
+ and all allocation requests fail early rather than cause disruptive
+ reclaim (one round of reclaim in this implementation). The OOM killer
+ is not invoked.
+
+ * ``GFP_KERNEL | __GFP_RETRY_MAYFAIL`` - overrides the default allocator
+ behavior and all allocation requests try really hard. The request
+ will fail if the reclaim cannot make any progress. The OOM killer
+ won't be triggered.
+
+ * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior
+ and all allocation requests will loop endlessly until they succeed.
+ This might be really dangerous especially for larger orders.
+
Selecting memory allocator
==========================
@@ -103,6 +147,10 @@ The address of a chunk allocated with `kmalloc` is aligned to at least
ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the
alignment is also guaranteed to be at least the respective size.
+Chunks allocated with kmalloc() can be resized with krealloc(). Similarly
+to kmalloc_array(): a helper for resizing arrays is provided in the form of
+krealloc_array().
+
For large allocations you can use vmalloc() and vzalloc(), or directly
request pages from the page allocator. The memory allocated by `vmalloc`
and related functions is not physically contiguous.
@@ -122,7 +170,16 @@ should be used if a part of the cache might be copied to the userspace.
After the cache is created kmem_cache_alloc() and its convenience
wrappers can allocate memory from that cache.
-When the allocated memory is no longer needed it must be freed. You can
-use kvfree() for the memory allocated with `kmalloc`, `vmalloc` and
-`kvmalloc`. The slab caches should be freed with kmem_cache_free(). And
-don't forget to destroy the cache with kmem_cache_destroy().
+When the allocated memory is no longer needed it must be freed.
+
+Objects allocated by `kmalloc` can be freed by `kfree` or `kvfree`. Objects
+allocated by `kmem_cache_alloc` can be freed with `kmem_cache_free`, `kfree`
+or `kvfree`, where the latter two might be more convenient thanks to not
+needing the kmem_cache pointer.
+
+The same rules apply to _bulk and _rcu flavors of freeing functions.
+
+Memory allocated by `vmalloc` can be freed with `vfree` or `kvfree`.
+Memory allocated by `kvmalloc` can be freed with `kvfree`.
+Caches created by `kmem_cache_create` should be freed with
+`kmem_cache_destroy` only after freeing all the allocated objects first.
diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst
index de7467e48067..682259ee633a 100644
--- a/Documentation/core-api/memory-hotplug.rst
+++ b/Documentation/core-api/memory-hotplug.rst
@@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
- int status_change_nid_high;
int status_change_nid;
}
@@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
- nr_pages is # of pages of online/offline memory.
- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
is (will be) set/clear, if this is -1, then nodemask status is not changed.
-- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
- is (will be) set/clear, if this is -1, then nodemask status is not changed.
- status_change_nid is set node id when N_MEMORY of nodemask is (will be)
set/clear. It means a new(memoryless) node gets new memory by online and a
node loses all memory. If this is -1, then nodemask status is not changed.
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 2adffb3f7914..af8151db88b2 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -19,22 +19,16 @@ User Space Memory Access
Memory Allocation Controls
==========================
-Functions which need to allocate memory often use GFP flags to express
-how that memory should be allocated. The GFP acronym stands for "get
-free pages", the underlying memory allocation function. Not every GFP
-flag is allowed to every function which may allocate memory. Most
-users will want to use a plain ``GFP_KERNEL``.
-
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Page mobility and placement hints
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Watermark modifiers
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Reclaim modifiers
-.. kernel-doc:: include/linux/gfp.h
+.. kernel-doc:: include/linux/gfp_types.h
:doc: Useful GFP flag combinations
The Slab Cache
@@ -43,7 +37,7 @@ The Slab Cache
.. kernel-doc:: include/linux/slab.h
:internal:
-.. kernel-doc:: mm/slab.c
+.. kernel-doc:: mm/slub.c
:export:
.. kernel-doc:: mm/slab_common.c
@@ -61,15 +55,30 @@ Virtually Contiguous Mappings
File Mapping and Page Cache
===========================
-.. kernel-doc:: mm/readahead.c
- :export:
+Filemap
+-------
.. kernel-doc:: mm/filemap.c
:export:
+Readahead
+---------
+
+.. kernel-doc:: mm/readahead.c
+ :doc: Readahead Overview
+
+.. kernel-doc:: mm/readahead.c
+ :export:
+
+Writeback
+---------
+
.. kernel-doc:: mm/page-writeback.c
:export:
+Truncate
+--------
+
.. kernel-doc:: mm/truncate.c
:export:
@@ -95,3 +104,39 @@ More Memory Management Functions
:export:
.. kernel-doc:: mm/page_alloc.c
+.. kernel-doc:: mm/mempolicy.c
+.. kernel-doc:: include/linux/mm_types.h
+ :internal:
+.. kernel-doc:: include/linux/mm_inline.h
+.. kernel-doc:: include/linux/page-flags.h
+.. kernel-doc:: include/linux/mm.h
+ :internal:
+.. kernel-doc:: include/linux/page_ref.h
+.. kernel-doc:: include/linux/mmzone.h
+.. kernel-doc:: mm/util.c
+ :functions: folio_mapping
+
+.. kernel-doc:: mm/rmap.c
+.. kernel-doc:: mm/migrate.c
+.. kernel-doc:: mm/mmap.c
+.. kernel-doc:: mm/kmemleak.c
+.. #kernel-doc:: mm/hmm.c (build warnings)
+.. kernel-doc:: mm/memremap.c
+.. kernel-doc:: mm/hugetlb.c
+.. kernel-doc:: mm/swap.c
+.. kernel-doc:: mm/zpool.c
+.. kernel-doc:: mm/memcontrol.c
+.. #kernel-doc:: mm/memory-tiers.c (build warnings)
+.. kernel-doc:: mm/shmem.c
+.. kernel-doc:: mm/migrate_device.c
+.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files)
+.. kernel-doc:: mm/mapping_dirty_helpers.c
+.. #kernel-doc:: mm/memory-failure.c (build warnings)
+.. kernel-doc:: mm/percpu.c
+.. kernel-doc:: mm/maccess.c
+.. kernel-doc:: mm/vmscan.c
+.. kernel-doc:: mm/memory_hotplug.c
+.. kernel-doc:: mm/mmu_notifier.c
+.. kernel-doc:: mm/balloon_compaction.c
+.. kernel-doc:: mm/huge_memory.c
+.. kernel-doc:: mm/io-mapping.c
diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst
new file mode 100644
index 000000000000..9f692b02bfe6
--- /dev/null
+++ b/Documentation/core-api/netlink.rst
@@ -0,0 +1,102 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+
+.. _kernel_netlink:
+
+===================================
+Netlink notes for kernel developers
+===================================
+
+General guidance
+================
+
+Attribute enums
+---------------
+
+Older families often define "null" attributes and commands with value
+of ``0`` and named ``unspec``. This is supported (``type: unused``)
+but should be avoided in new families. The ``unspec`` enum values are
+not used in practice, so just set the value of the first attribute to ``1``.
+
+Message enums
+-------------
+
+Use the same command IDs for requests and replies. This makes it easier
+to match them up, and we have plenty of ID space.
+
+Use separate command IDs for notifications. This makes it easier to
+sort the notifications from replies (and present them to the user
+application via a different API than replies).
+
+Answer requests
+---------------
+
+Older families do not reply to all of the commands, especially NEW / ADD
+commands. User only gets information whether the operation succeeded or
+not via the ACK. Try to find useful data to return. Once the command is
+added whether it replies with a full message or only an ACK is uAPI and
+cannot be changed. It's better to err on the side of replying.
+
+Specifically NEW and ADD commands should reply with information identifying
+the created object such as the allocated object's ID (without having to
+resort to using ``NLM_F_ECHO``).
+
+NLM_F_ECHO
+----------
+
+Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO``
+to take effect. This is useful for programs that need precise feedback
+from the kernel (for example for logging purposes).
+
+Support dump consistency
+------------------------
+
+If iterating over objects during dump may skip over objects or repeat
+them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``.
+This is usually implemented by maintaining a generation id for the
+structure and recording it in the ``seq`` member of struct netlink_callback.
+
+Netlink specification
+=====================
+
+Documentation of the Netlink specification parts which are only relevant
+to the kernel space.
+
+Globals
+-------
+
+kernel-policy
+~~~~~~~~~~~~~
+
+Defines whether the kernel validation policy is ``global`` i.e. the same for all
+operations of the family, defined for each operation individually - ``per-op``,
+or separately for each operation and operation type (do vs dump) - ``split``.
+New families should use ``per-op`` (default) to be able to narrow down the
+attributes accepted by a specific command.
+
+checks
+------
+
+Documentation for the ``checks`` sub-sections of attribute specs.
+
+unterminated-ok
+~~~~~~~~~~~~~~~
+
+Accept strings without the null-termination (for legacy families only).
+Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type.
+
+max-len
+~~~~~~~
+
+Defines max length for a binary or string attribute (corresponding
+to the ``len`` member of struct nla_policy). For string attributes terminating
+null character is not counted towards ``max-len``.
+
+The field may either be a literal integer value or a name of a defined
+constant. String types may reduce the constant by one
+(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating
+character so implementations should recognize such pattern.
+
+min-len
+~~~~~~~
+
+Similar to ``max-len`` but defines minimum length.
diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst
index d8c341fe383e..3ed13bc9a195 100644
--- a/Documentation/core-api/packing.rst
+++ b/Documentation/core-api/packing.rst
@@ -161,6 +161,6 @@ xxx_packing() that calls it using the proper QUIRK_* one-hot bits set.
The packing() function returns an int-encoded error code, which protects the
programmer against incorrect API use. The errors are not expected to occur
-durring runtime, therefore it is reasonable for xxx_packing() to return void
+during runtime, therefore it is reasonable for xxx_packing() to return void
and simply swallow those errors. Optionally it can dump stack or print the
error description.
diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst
index 9a24c111781d..05b73c6c105f 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,42 +4,34 @@
The padata parallel execution mechanism
=======================================
-:Date: December 2019
+:Date: May 2020
Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering. It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets. The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
-Usage
-=====
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets. This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+=======================
Initializing
------------
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run serialized jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
#include <linux/padata.h>
- struct padata_instance *padata_alloc_possible(const char *name);
+ struct padata_instance *padata_alloc(const char *name);
'name' simply identifies the instance.
-There are functions for enabling and disabling the instance::
-
- int padata_start(struct padata_instance *pinst);
- void padata_stop(struct padata_instance *pinst);
-
-These functions are setting or clearing the "PADATA_INIT" flag; if that flag is
-not set, other functions will refuse to work. padata_start() returns zero on
-success (flag set) or -EINVAL if the padata cpumask contains no active CPU
-(flag not set). padata_stop() clears the flag and blocks until the padata
-instance is unused.
-
-Finally, complete padata initialization by allocating a padata_shell::
+Then, complete padata initialization by allocating a padata_shell::
struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
@@ -50,7 +42,7 @@ padata_shells associated with it, each allowing a separate series of jobs.
Modifying cpumasks
------------------
-The CPUs used to run jobs can be changed in two ways, programatically with
+The CPUs used to run jobs can be changed in two ways, programmatically with
padata_set_cpumask() or via sysfs. The former is defined::
int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
@@ -152,16 +144,33 @@ submitted.
Destroying
----------
-Cleaning up a padata instance predictably involves calling the three free
+Cleaning up a padata instance predictably involves calling the two free
functions that correspond to the allocation in reverse::
void padata_free_shell(struct padata_shell *ps);
- void padata_stop(struct padata_instance *pinst);
void padata_free(struct padata_instance *pinst);
It is the user's responsibility to ensure all outstanding jobs are complete
before any of the above are called.
+Running Multithreaded Jobs
+==========================
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished. padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job. First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section. This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread. Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any. Prepare the shared state, which is
+typically allocated on the main thread's stack. Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
Interface
=========
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 2e939ff10b86..6b5f7e6e7155 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -33,7 +33,7 @@ all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
that's a natural dividing line, and a good point to make separate wrapper calls.
In other words, use pin_user_pages*() for DMA-pinned pages, and
-get_user_pages*() for other cases. There are four cases described later on in
+get_user_pages*() for other cases. There are five cases described later on in
this document, to further clarify that concept.
FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
@@ -55,18 +55,17 @@ flags the caller provides. The caller is required to pass in a non-null struct
pages* array, and the function then pins pages by incrementing each by a special
value: GUP_PIN_COUNTING_BIAS.
-For huge pages (and in fact, any compound page of more than 2 pages), the
-GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
-is achieved, by using the 3rd struct page in the compound page. A new struct
-page field, hpage_pinned_refcount, has been added in order to support this.
+For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
+the extra space available in the struct folio is used to store the
+pincount directly.
-This approach for compound pages avoids the counting upper limit problems that
-are discussed below. Those limitations would have been aggravated severely by
-huge pages, because each tail page adds a refcount to the head page. And in
-fact, testing revealed that, without a separate hpage_pinned_refcount field,
-page overflows were seen in some huge page stress tests.
+This approach for large folios avoids the counting upper limit problems
+that are discussed below. Those limitations would have been aggravated
+severely by huge pages, because each tail page adds a refcount to the
+head page. And in fact, testing revealed that, without a separate pincount
+field, refcount overflows were seen in some huge page stress tests.
-This also means that huge pages and compound pages (of order > 1) do not suffer
+This also means that huge pages and large folios do not suffer
from the false positives problem that is mentioned below.::
Function
@@ -113,6 +112,12 @@ pages:
This also leads to limitations: there are only 31-10==21 bits available for a
counter that increments 10 bits at a time.
+* Because of that limitation, special handling is applied to the zero pages
+ when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its
+ refcount or pincount at all (it is permanent, so there's no need). The
+ unpinning functions also don't do anything to a zero page. This is
+ transparent to the caller.
+
* Callers must specifically request "dma-pinned tracking of pages". In other
words, just calling get_user_pages() will not suffice; a new set of functions,
pin_user_page() and related, must be used.
@@ -148,23 +153,48 @@ NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
because DAX pages do not have a separate page cache, and so "pinning" implies
locking down file system blocks, which is not (yet) supported in that way.
-CASE 3: Hardware with page faulting support
--------------------------------------------
-Here, a well-written driver doesn't normally need to pin pages at all. However,
-if the driver does choose to do so, it can register MMU notifiers for the range,
-and will be called back upon invalidation. Either way (avoiding page pinning, or
-using MMU notifiers to unpin upon request), there is proper synchronization with
-both filesystem and mm (page_mkclean(), munmap(), etc).
+.. _mmu-notifier-registration-case:
+
+CASE 3: MMU notifier registration, with or without page faulting hardware
+-------------------------------------------------------------------------
+Device drivers can pin pages via get_user_pages*(), and register for mmu
+notifier callbacks for the memory range. Then, upon receiving a notifier
+"invalidate range" callback , stop the device from using the range, and unpin
+the pages. There may be other possible schemes, such as for example explicitly
+synchronizing against pending IO, that accomplish approximately the same thing.
-Therefore, neither flag needs to be set.
+Or, if the hardware supports replayable page faults, then the device driver can
+avoid pinning entirely (this is ideal), as follows: register for mmu notifier
+callbacks as above, but instead of stopping the device and unpinning in the
+callback, simply remove the range from the device's page tables.
-In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
-called. Instead, the software should be written so that it does not pin pages.
-This allows mm and filesystems to operate more efficiently and reliably.
+Either way, as long as the driver unpins the pages upon mmu notifier callback,
+then there is proper synchronization with both filesystem and mm
+(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
CASE 4: Pinning for struct page manipulation only
-------------------------------------------------
-Here, normal GUP calls are sufficient, so neither flag needs to be set.
+If only struct page data (as opposed to the actual memory contents that a page
+is tracking) is affected, then normal GUP calls are sufficient, and neither flag
+needs to be set.
+
+CASE 5: Pinning in order to write to the data within the page
+-------------------------------------------------------------
+Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
+write to a page's data, unpin" can cause a problem. Case 5 may be considered a
+superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
+other words, if the code is neither Case 1 nor Case 2, it may still require
+FOLL_PIN, for patterns like this:
+
+Correct (uses FOLL_PIN calls):
+ pin_user_pages()
+ write to the data within the pages
+ unpin_user_pages()
+
+INCORRECT (uses FOLL_GET calls):
+ get_user_pages()
+ write to the data within the pages
+ put_page()
page_maybe_dma_pinned(): the whole point of pinning
===================================================
@@ -198,12 +228,12 @@ Unit testing
============
This file::
- tools/testing/selftests/vm/gup_benchmark.c
+ tools/testing/selftests/mm/gup_test.c
has the following new calls to exercise the new pin*() wrapper functions:
-* PIN_FAST_BENCHMARK (./gup_benchmark -a)
-* PIN_BENCHMARK (./gup_benchmark -b)
+* PIN_FAST_BENCHMARK (./gup_test -a)
+* PIN_BASIC_TEST (./gup_test -b)
You can monitor how many total dma-pinned pages have been acquired and released
since the system was booted, via two new /proc/vmstat entries: ::
@@ -241,9 +271,9 @@ place.)
Other diagnostics
=================
-dump_page() has been enhanced slightly, to handle these new counting fields, and
-to better report on compound pages in general. Specifically, for compound pages
-with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+dump_page() has been enhanced slightly to handle these new counting
+fields, and to better report on large folios in general. Specifically,
+for large folios, the exact pincount is reported.
References
==========
diff --git a/Documentation/core-api/printk-basics.rst b/Documentation/core-api/printk-basics.rst
new file mode 100644
index 000000000000..2dde24ca7d9f
--- /dev/null
+++ b/Documentation/core-api/printk-basics.rst
@@ -0,0 +1,112 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+Message logging with printk
+===========================
+
+printk() is one of the most widely known functions in the Linux kernel. It's the
+standard tool we have for printing messages and usually the most basic way of
+tracing and debugging. If you're familiar with printf(3) you can tell printk()
+is based on it, although it has some functional differences:
+
+ - printk() messages can specify a log level.
+
+ - the format string, while largely compatible with C99, doesn't follow the
+ exact same specification. It has some extensions and a few limitations
+ (no ``%n`` or floating point conversion specifiers). See :ref:`How to get
+ printk format specifiers right <printk-specifiers>`.
+
+All printk() messages are printed to the kernel log buffer, which is a ring
+buffer exported to userspace through /dev/kmsg. The usual way to read it is
+using ``dmesg``.
+
+printk() is typically used like this::
+
+ printk(KERN_INFO "Message: %s\n", arg);
+
+where ``KERN_INFO`` is the log level (note that it's concatenated to the format
+string, the log level is not a separate argument). The available log levels are:
+
++----------------+--------+-----------------------------------------------+
+| Name | String | Alias function |
++================+========+===============================================+
+| KERN_EMERG | "0" | pr_emerg() |
++----------------+--------+-----------------------------------------------+
+| KERN_ALERT | "1" | pr_alert() |
++----------------+--------+-----------------------------------------------+
+| KERN_CRIT | "2" | pr_crit() |
++----------------+--------+-----------------------------------------------+
+| KERN_ERR | "3" | pr_err() |
++----------------+--------+-----------------------------------------------+
+| KERN_WARNING | "4" | pr_warn() |
++----------------+--------+-----------------------------------------------+
+| KERN_NOTICE | "5" | pr_notice() |
++----------------+--------+-----------------------------------------------+
+| KERN_INFO | "6" | pr_info() |
++----------------+--------+-----------------------------------------------+
+| KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined |
++----------------+--------+-----------------------------------------------+
+| KERN_DEFAULT | "" | |
++----------------+--------+-----------------------------------------------+
+| KERN_CONT | "c" | pr_cont() |
++----------------+--------+-----------------------------------------------+
+
+
+The log level specifies the importance of a message. The kernel decides whether
+to show the message immediately (printing it to the current console) depending
+on its log level and the current *console_loglevel* (a kernel variable). If the
+message priority is higher (lower log level value) than the *console_loglevel*
+the message will be printed to the console.
+
+If the log level is omitted, the message is printed with ``KERN_DEFAULT``
+level.
+
+You can check the current *console_loglevel* with::
+
+ $ cat /proc/sys/kernel/printk
+ 4 4 1 7
+
+The result shows the *current*, *default*, *minimum* and *boot-time-default* log
+levels.
+
+To change the current console_loglevel simply write the desired level to
+``/proc/sys/kernel/printk``. For example, to print all messages to the console::
+
+ # echo 8 > /proc/sys/kernel/printk
+
+Another way, using ``dmesg``::
+
+ # dmesg -n 5
+
+sets the console_loglevel to print KERN_WARNING (4) or more severe messages to
+console. See ``dmesg(1)`` for more information.
+
+As an alternative to printk() you can use the ``pr_*()`` aliases for
+logging. This family of macros embed the log level in the macro names. For
+example::
+
+ pr_info("Info message no. %d\n", msg_num);
+
+prints a ``KERN_INFO`` message.
+
+Besides being more concise than the equivalent printk() calls, they can use a
+common definition for the format string through the pr_fmt() macro. For
+instance, defining this at the top of a source file (before any ``#include``
+directive)::
+
+ #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__
+
+would prefix every pr_*() message in that file with the module and function name
+that originated the message.
+
+For debugging purposes there are also two conditionally-compiled macros:
+pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or
+also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined.
+
+
+Function reference
+==================
+
+.. kernel-doc:: include/linux/printk.h
+ :functions: printk pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info
+ pr_fmt pr_debug pr_devel pr_cont
diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 8ebe46b1af39..4451ef501936 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -2,6 +2,8 @@
How to get printk format specifiers right
=========================================
+.. _printk-specifiers:
+
:Author: Randy Dunlap <rdunlap@infradead.org>
:Author: Andrew Murray <amurray@mpc-data.co.uk>
@@ -13,9 +15,10 @@ Integer types
If variable is of Type, use printk format specifier:
------------------------------------------------------------
- char %d or %x
+ signed char %d or %hhx
unsigned char %u or %x
- short int %d or %x
+ char %u or %x
+ short int %d or %hx
unsigned short int %u or %x
int %d or %x
unsigned int %u or %x
@@ -25,9 +28,9 @@ Integer types
unsigned long long %llu or %llx
size_t %zu or %zx
ssize_t %zd or %zx
- s8 %d or %x
+ s8 %d or %hhx
u8 %u or %x
- s16 %d or %x
+ s16 %d or %hx
u16 %u or %x
s32 %d or %x
u32 %u or %x
@@ -35,14 +38,13 @@ Integer types
u64 %llu or %llx
-If <type> is dependent on a config option for its size (e.g., sector_t,
-blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
-format specifier of its largest possible type and explicitly cast to it.
+If <type> is architecture-dependent for its size (e.g., cycles_t, tcflag_t) or
+is dependent on a config option for its size (e.g., blk_status_t), use a format
+specifier of its largest possible type and explicitly cast to it.
Example::
- printk("test: sector number/total blocks: %llu/%llu\n",
- (unsigned long long)sector, (unsigned long long)blockcount);
+ printk("test: latency: %llu cycles\n", (unsigned long long)time);
Reminder: sizeof() returns type size_t.
@@ -77,7 +79,19 @@ Pointers printed without a specifier extension (i.e unadorned %p) are
hashed to prevent leaking information about the kernel memory layout. This
has the added benefit of providing a unique identifier. On 64-bit machines
the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it
-gathers enough entropy. If you *really* want the address see %px below.
+gathers enough entropy.
+
+When possible, use specialised modifiers such as %pS or %pB (described below)
+to avoid the need of providing an unhashed address that has to be interpreted
+post-hoc. If not possible, and the aim of printing the address is to provide
+more information for debugging, use %p and boot the kernel with the
+``no_hash_pointers`` parameter during debugging, which will print all %p
+addresses unmodified. If you *really* always want the unmodified address, see
+%px below.
+
+If (and only if) you are printing addresses as a content of a virtual file in
+e.g. procfs or sysfs (using e.g. seq_printf(), not printk()) read by a
+userspace process, use the %pK modifier described below instead of %p or %px.
Error Pointers
--------------
@@ -112,6 +126,32 @@ used when printing stack backtraces. The specifier takes into
consideration the effect of compiler optimisations which may occur
when tail-calls are used and marked with the noreturn GCC attribute.
+If the pointer is within a module, the module name and optionally build ID is
+printed after the symbol name with an extra ``b`` appended to the end of the
+specifier.
+
+::
+
+ %pS versatile_init+0x0/0x110 [module_name]
+ %pSb versatile_init+0x0/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
+ %pSRb versatile_init+0x9/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
+ (with __builtin_extract_return_addr() translation)
+ %pBb prev_fn_of_versatile_init+0x88/0x88 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
+
+Probed Pointers from BPF / tracing
+----------------------------------
+
+::
+
+ %pks kernel string
+ %pus user string
+
+The ``k`` and ``u`` specifiers are used for printing prior probed memory from
+either kernel memory (k) or user memory (u). The subsequent ``s`` specifier
+results in printing a string. For direct use in regular vsnprintf() the (k)
+and (u) annotation is ignored, however, when used out of BPF's bpf_trace_printk(),
+for example, it reads the memory it is pointing to without faulting.
+
Kernel Pointers
---------------
@@ -123,6 +163,11 @@ For printing kernel pointers which should be hidden from unprivileged
users. The behaviour of %pK depends on the kptr_restrict sysctl - see
Documentation/admin-guide/sysctl/kernel.rst for more details.
+This modifier is *only* intended when producing content of a file read by
+userspace from e.g. procfs or sysfs, not for dmesg. Please refer to the
+section about %p above for discussion about how to manage hashing pointers
+in printk().
+
Unmodified Addresses
--------------------
@@ -137,6 +182,13 @@ equivalent to %lx (or %lu). %px is preferred because it is more uniquely
grep'able. If in the future we need to modify the way the kernel handles
printing pointers we will be better equipped to find the call sites.
+Before using %px, consider if using %p is sufficient together with enabling the
+``no_hash_pointers`` kernel parameter during debugging sessions (see the %p
+description above). One valid scenario for %px might be printing information
+immediately before a panic, which prevents any sensitive information to be
+exploited anyway, and with %px there would be no need to reproduce the panic
+with no_hash_pointers.
+
Pointer Differences
-------------------
@@ -301,7 +353,7 @@ colon-separators. Leading zeros are always used.
The additional ``c`` specifier can be used with the ``I`` specifier to
print a compressed IPv6 address as described by
-http://tools.ietf.org/html/rfc5952
+https://tools.ietf.org/html/rfc5952
Passed by reference.
@@ -325,7 +377,7 @@ The additional ``p``, ``f``, and ``s`` specifiers are used to specify port
flowinfo a ``/`` and scope a ``%``, each followed by the actual value.
In case of an IPv6 address the compressed IPv6 address as described by
-http://tools.ietf.org/html/rfc5952 is being used if the additional
+https://tools.ietf.org/html/rfc5952 is being used if the additional
specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in
case of additional specifiers ``p``, ``f`` or ``s`` as suggested by
https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07
@@ -468,21 +520,30 @@ Examples (OF)::
%pfwf /ocp@68000000/i2c@48072000/camera@10/port/endpoint - Full name
%pfwP endpoint - Node name
-Time and date (struct rtc_time)
--------------------------------
+Time and date
+-------------
::
- %ptR YYYY-mm-ddTHH:MM:SS
- %ptRd YYYY-mm-dd
- %ptRt HH:MM:SS
- %ptR[dt][r]
+ %pt[RT] YYYY-mm-ddTHH:MM:SS
+ %pt[RT]s YYYY-mm-dd HH:MM:SS
+ %pt[RT]d YYYY-mm-dd
+ %pt[RT]t HH:MM:SS
+ %pt[RT][dt][r][s]
+
+For printing date and time as represented by::
+
+ R struct rtc_time structure
+ T time64_t type
-For printing date and time as represented by struct rtc_time structure in
-human readable format.
+in human readable format.
-By default year will be incremented by 1900 and month by 1. Use %ptRr (raw)
-to suppress this behaviour.
+By default year will be incremented by 1900 and month by 1.
+Use %pt[RT]r (raw) to suppress this behaviour.
+
+The %pt[RT]s (space) will override ISO 8601 separator by using ' ' (space)
+instead of 'T' (Capital T) between date and time. It won't have any effect
+when date or time is omitted.
Passed by reference.
@@ -511,22 +572,30 @@ For printing bitmap and its derivatives such as cpumask and nodemask,
%*pb outputs the bitmap with field width as the number of bits and %*pbl
output the bitmap as range list with field width as the number of bits.
-Passed by reference.
+The field width is passed by value, the bitmap is passed by reference.
+Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
+printing cpumask and nodemask.
-Flags bitfields such as page flags, gfp_flags
----------------------------------------------
+Flags bitfields such as page flags, page_type, gfp_flags
+--------------------------------------------------------
::
- %pGp referenced|uptodate|lru|active|private
+ %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
+ %pGt 0xffffff7f(buddy)
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite
For printing flags bitfields as a collection of symbolic constants that
would construct the value. The type of flags is given by the third
-character. Currently supported are [p]age flags, [v]ma_flags (both
-expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
-names and print order depends on the particular type.
+character. Currently supported are:
+
+ - p - [p]age flags, expects value of type (``unsigned long *``)
+ - t - page [t]ype, expects value of type (``unsigned int *``)
+ - v - [v]ma_flags, expects value of type (``unsigned long *``)
+ - g - [g]fp_flags, expects value of type (``gfp_t *``)
+
+The flag names and print order depends on the particular type.
Note that this format should not be used directly in the
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()
@@ -545,6 +614,34 @@ For printing netdev_features_t.
Passed by reference.
+V4L2 and DRM FourCC code (pixel format)
+---------------------------------------
+
+::
+
+ %p4cc
+
+Print a FourCC code used by V4L2 or DRM, including format endianness and
+its numerical value as hexadecimal.
+
+Passed by reference.
+
+Examples::
+
+ %p4cc BG12 little-endian (0x32314742)
+ %p4cc Y10 little-endian (0x20303159)
+ %p4cc NV12 big-endian (0xb231564e)
+
+Rust
+----
+
+::
+
+ %pA
+
+Only intended to be used from Rust code to format ``core::fmt::Arguments``.
+Do *not* use it from C.
+
Thanks
======
diff --git a/Documentation/core-api/printk-index.rst b/Documentation/core-api/printk-index.rst
new file mode 100644
index 000000000000..3062f37d119b
--- /dev/null
+++ b/Documentation/core-api/printk-index.rst
@@ -0,0 +1,137 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Printk Index
+============
+
+There are many ways how to monitor the state of the system. One important
+source of information is the system log. It provides a lot of information,
+including more or less important warnings and error messages.
+
+There are monitoring tools that filter and take action based on messages
+logged.
+
+The kernel messages are evolving together with the code. As a result,
+particular kernel messages are not KABI and never will be!
+
+It is a huge challenge for maintaining the system log monitors. It requires
+knowing what messages were updated in a particular kernel version and why.
+Finding these changes in the sources would require non-trivial parsers.
+Also it would require matching the sources with the binary kernel which
+is not always trivial. Various changes might be backported. Various kernel
+versions might be used on different monitored systems.
+
+This is where the printk index feature might become useful. It provides
+a dump of printk formats used all over the source code used for the kernel
+and modules on the running system. It is accessible at runtime via debugfs.
+
+The printk index helps to find changes in the message formats. Also it helps
+to track the strings back to the kernel sources and the related commit.
+
+
+User Interface
+==============
+
+The index of printk formats are split in into separate files. The files are
+named according to the binaries where the printk formats are built-in. There
+is always "vmlinux" and optionally also modules, for example::
+
+ /sys/kernel/debug/printk/index/vmlinux
+ /sys/kernel/debug/printk/index/ext4
+ /sys/kernel/debug/printk/index/scsi_mod
+
+Note that only loaded modules are shown. Also printk formats from a module
+might appear in "vmlinux" when the module is built-in.
+
+The content is inspired by the dynamic debug interface and looks like::
+
+ $> head -1 /sys/kernel/debug/printk/index/vmlinux; shuf -n 5 vmlinux
+ # <level[,flags]> filename:line function "format"
+ <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
+ <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
+ <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
+ <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
+ <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
+
+, where the meaning is:
+
+ - :level: log level value: 0-7 for particular severity, -1 as default,
+ 'c' as continuous line without an explicit log level
+ - :flags: optional flags: currently only 'c' for KERN_CONT
+ - :filename\:line: source filename and line number of the related
+ printk() call. Note that there are many wrappers, for example,
+ pr_warn(), pr_warn_once(), dev_warn().
+ - :function: function name where the printk() call is used.
+ - :format: format string
+
+The extra information makes it a bit harder to find differences
+between various kernels. Especially the line number might change
+very often. On the other hand, it helps a lot to confirm that
+it is the same string or find the commit that is responsible
+for eventual changes.
+
+
+printk() Is Not a Stable KABI
+=============================
+
+Several developers are afraid that exporting all these implementation
+details into the user space will transform particular printk() calls
+into KABI.
+
+But it is exactly the opposite. printk() calls must _not_ be KABI.
+And the printk index helps user space tools to deal with this.
+
+
+Subsystem specific printk wrappers
+==================================
+
+The printk index is generated using extra metadata that are stored in
+a dedicated .elf section ".printk_index". It is achieved using macro
+wrappers doing __printk_index_emit() together with the real printk()
+call. The same technique is used also for the metadata used by
+the dynamic debug feature.
+
+The metadata are stored for a particular message only when it is printed
+using these special wrappers. It is implemented for the commonly
+used printk() calls, including, for example, pr_warn(), or pr_once().
+
+Additional changes are necessary for various subsystem specific wrappers
+that call the original printk() via a common helper function. These needs
+their own wrappers adding __printk_index_emit().
+
+Only few subsystem specific wrappers have been updated so far,
+for example, dev_printk(). As a result, the printk formats from
+some subsystes can be missing in the printk index.
+
+
+Subsystem specific prefix
+=========================
+
+The macro pr_fmt() macro allows to define a prefix that is printed
+before the string generated by the related printk() calls.
+
+Subsystem specific wrappers usually add even more complicated
+prefixes.
+
+These prefixes can be stored into the printk index metadata
+by an optional parameter of __printk_index_emit(). The debugfs
+interface might then show the printk formats including these prefixes.
+For example, drivers/acpi/osl.c contains::
+
+ #define pr_fmt(fmt) "ACPI: OSL: " fmt
+
+ static int __init acpi_no_auto_serialize_setup(char *str)
+ {
+ acpi_gbl_auto_serialize_methods = FALSE;
+ pr_info("Auto-serialization disabled\n");
+
+ return 1;
+ }
+
+This results in the following printk index entry::
+
+ <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
+
+It helps matching messages from the real log with printk index.
+Then the source file name, line number, and function name can
+be used to match the string with the source code.
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 49d9833af871..bf28ac0401f3 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,30 +4,29 @@
Memory Protection Keys
======================
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake "Scalable Processor" Server CPUs.
-It will be avalable in future non-server parts.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains. It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key. Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+Pkeys Userspace (PKU) is a feature which can be found on:
+ * Intel server CPUs, Skylake and later
+ * Intel client CPUs, Tiger Lake (11th Gen Core) and later
+ * Future AMD CPUs
+
+Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.
+
+Protections for each key are defined with a per-CPU user-accessible register
+(PKRU). Each of these is a 32-bit register storing two bits (Access Disable
+and Write Disable) for each of 16 keys.
+
+Being a CPU register, PKRU is inherently thread-local, potentially giving each
thread a different set of protections from every other thread.
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register. The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs. These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
+There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
+register. The feature is only available in 64-bit mode, even though there is
+theoretically space in the PAE PTEs. These permissions are enforced on data
+access only and have no effect on instruction fetches.
Syscalls
========
diff --git a/Documentation/core-api/rbtree.rst b/Documentation/core-api/rbtree.rst
new file mode 100644
index 000000000000..ed1a9fbc779e
--- /dev/null
+++ b/Documentation/core-api/rbtree.rst
@@ -0,0 +1,429 @@
+=================================
+Red-black Trees (rbtree) in Linux
+=================================
+
+
+:Date: January 18, 2007
+:Author: Rob Landley <rob@landley.net>
+
+What are red-black trees, and what are they for?
+------------------------------------------------
+
+Red-black trees are a type of self-balancing binary search tree, used for
+storing sortable key/value data pairs. This differs from radix trees (which
+are used to efficiently store sparse arrays and thus use long integer indexes
+to insert/access/delete nodes) and hash tables (which are not kept sorted to
+be easily traversed in order, and must be tuned for a specific size and
+hash function where rbtrees scale gracefully storing arbitrary keys).
+
+Red-black trees are similar to AVL trees, but provide faster real-time bounded
+worst case performance for insertion and deletion (at most two rotations and
+three rotations, respectively, to balance the tree), with slightly slower
+(but still O(log n)) lookup time.
+
+To quote Linux Weekly News:
+
+ There are a number of red-black trees in use in the kernel.
+ The deadline and CFQ I/O schedulers employ rbtrees to
+ track requests; the packet CD/DVD driver does the same.
+ The high-resolution timer code uses an rbtree to organize outstanding
+ timer requests. The ext3 filesystem tracks directory entries in a
+ red-black tree. Virtual memory areas (VMAs) are tracked with red-black
+ trees, as are epoll file descriptors, cryptographic keys, and network
+ packets in the "hierarchical token bucket" scheduler.
+
+This document covers use of the Linux rbtree implementation. For more
+information on the nature and implementation of Red Black Trees, see:
+
+ Linux Weekly News article on red-black trees
+ https://lwn.net/Articles/184495/
+
+ Wikipedia entry on red-black trees
+ https://en.wikipedia.org/wiki/Red-black_tree
+
+Linux implementation of red-black trees
+---------------------------------------
+
+Linux's rbtree implementation lives in the file "lib/rbtree.c". To use it,
+"#include <linux/rbtree.h>".
+
+The Linux rbtree implementation is optimized for speed, and thus has one
+less layer of indirection (and better cache locality) than more traditional
+tree implementations. Instead of using pointers to separate rb_node and data
+structures, each instance of struct rb_node is embedded in the data structure
+it organizes. And instead of using a comparison callback function pointer,
+users are expected to write their own tree search and insert functions
+which call the provided rbtree functions. Locking is also left up to the
+user of the rbtree code.
+
+Creating a new rbtree
+---------------------
+
+Data nodes in an rbtree tree are structures containing a struct rb_node member::
+
+ struct mytype {
+ struct rb_node node;
+ char *keystring;
+ };
+
+When dealing with a pointer to the embedded struct rb_node, the containing data
+structure may be accessed with the standard container_of() macro. In addition,
+individual members may be accessed directly via rb_entry(node, type, member).
+
+At the root of each rbtree is an rb_root structure, which is initialized to be
+empty via:
+
+ struct rb_root mytree = RB_ROOT;
+
+Searching for a value in an rbtree
+----------------------------------
+
+Writing a search function for your tree is fairly straightforward: start at the
+root, compare each value, and follow the left or right branch as necessary.
+
+Example::
+
+ struct mytype *my_search(struct rb_root *root, char *string)
+ {
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ struct mytype *data = container_of(node, struct mytype, node);
+ int result;
+
+ result = strcmp(string, data->keystring);
+
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+ }
+
+Inserting data into an rbtree
+-----------------------------
+
+Inserting data in the tree involves first searching for the place to insert the
+new node, then inserting the node and rebalancing ("recoloring") the tree.
+
+The search for insertion differs from the previous search by finding the
+location of the pointer on which to graft the new node. The new node also
+needs a link to its parent node for rebalancing purposes.
+
+Example::
+
+ int my_insert(struct rb_root *root, struct mytype *data)
+ {
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ struct mytype *this = container_of(*new, struct mytype, node);
+ int result = strcmp(data->keystring, this->keystring);
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return FALSE;
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+
+ return TRUE;
+ }
+
+Removing or replacing existing data in an rbtree
+------------------------------------------------
+
+To remove an existing node from a tree, call::
+
+ void rb_erase(struct rb_node *victim, struct rb_root *tree);
+
+Example::
+
+ struct mytype *data = mysearch(&mytree, "walrus");
+
+ if (data) {
+ rb_erase(&data->node, &mytree);
+ myfree(data);
+ }
+
+To replace an existing node in a tree with a new one with the same key, call::
+
+ void rb_replace_node(struct rb_node *old, struct rb_node *new,
+ struct rb_root *tree);
+
+Replacing a node this way does not re-sort the tree: If the new node doesn't
+have the same key as the old node, the rbtree will probably become corrupted.
+
+Iterating through the elements stored in an rbtree (in sort order)
+------------------------------------------------------------------
+
+Four functions are provided for iterating through an rbtree's contents in
+sorted order. These work on arbitrary trees, and should not need to be
+modified or wrapped (except for locking purposes)::
+
+ struct rb_node *rb_first(struct rb_root *tree);
+ struct rb_node *rb_last(struct rb_root *tree);
+ struct rb_node *rb_next(struct rb_node *node);
+ struct rb_node *rb_prev(struct rb_node *node);
+
+To start iterating, call rb_first() or rb_last() with a pointer to the root
+of the tree, which will return a pointer to the node structure contained in
+the first or last element in the tree. To continue, fetch the next or previous
+node by calling rb_next() or rb_prev() on the current node. This will return
+NULL when there are no more nodes left.
+
+The iterator functions return a pointer to the embedded struct rb_node, from
+which the containing data structure may be accessed with the container_of()
+macro, and individual members may be accessed directly via
+rb_entry(node, type, member).
+
+Example::
+
+ struct rb_node *node;
+ for (node = rb_first(&mytree); node; node = rb_next(node))
+ printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring);
+
+Cached rbtrees
+--------------
+
+Computing the leftmost (smallest) node is quite a common task for binary
+search trees, such as for traversals or users relying on a the particular
+order for their own logic. To this end, users can use 'struct rb_root_cached'
+to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding
+potentially expensive tree iterations. This is done at negligible runtime
+overhead for maintenance; albeit larger memory footprint.
+
+Similar to the rb_root structure, cached rbtrees are initialized to be
+empty via::
+
+ struct rb_root_cached mytree = RB_ROOT_CACHED;
+
+Cached rbtree is simply a regular rb_root with an extra pointer to cache the
+leftmost node. This allows rb_root_cached to exist wherever rb_root does,
+which permits augmented trees to be supported as well as only a few extra
+interfaces::
+
+ struct rb_node *rb_first_cached(struct rb_root_cached *tree);
+ void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool);
+ void rb_erase_cached(struct rb_node *node, struct rb_root_cached *);
+
+Both insert and erase calls have their respective counterpart of augmented
+trees::
+
+ void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *,
+ bool, struct rb_augment_callbacks *);
+ void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *,
+ struct rb_augment_callbacks *);
+
+
+Support for Augmented rbtrees
+-----------------------------
+
+Augmented rbtree is an rbtree with "some" additional data stored in
+each node, where the additional data for node N must be a function of
+the contents of all nodes in the subtree rooted at N. This data can
+be used to augment some new functionality to rbtree. Augmented rbtree
+is an optional feature built on top of basic rbtree infrastructure.
+An rbtree user who wants this feature will have to call the augmentation
+functions with the user provided augmentation callback when inserting
+and erasing nodes.
+
+C files implementing augmented rbtree manipulation must include
+<linux/rbtree_augmented.h> instead of <linux/rbtree.h>. Note that
+linux/rbtree_augmented.h exposes some rbtree implementations details
+you are not expected to rely on; please stick to the documented APIs
+there and do not include <linux/rbtree_augmented.h> from header files
+either so as to minimize chances of your users accidentally relying on
+such implementation details.
+
+On insertion, the user must update the augmented information on the path
+leading to the inserted node, then call rb_link_node() as usual and
+rb_augment_inserted() instead of the usual rb_insert_color() call.
+If rb_augment_inserted() rebalances the rbtree, it will callback into
+a user provided function to update the augmented information on the
+affected subtrees.
+
+When erasing a node, the user must call rb_erase_augmented() instead of
+rb_erase(). rb_erase_augmented() calls back into user provided functions
+to updated the augmented information on affected subtrees.
+
+In both cases, the callbacks are provided through struct rb_augment_callbacks.
+3 callbacks must be defined:
+
+- A propagation callback, which updates the augmented value for a given
+ node and its ancestors, up to a given stop point (or NULL to update
+ all the way to the root).
+
+- A copy callback, which copies the augmented value for a given subtree
+ to a newly assigned subtree root.
+
+- A tree rotation callback, which copies the augmented value for a given
+ subtree to a newly assigned subtree root AND recomputes the augmented
+ information for the former subtree root.
+
+The compiled code for rb_erase_augmented() may inline the propagation and
+copy callbacks, which results in a large function, so each augmented rbtree
+user should have a single rb_erase_augmented() call site in order to limit
+compiled code size.
+
+
+Sample usage
+^^^^^^^^^^^^
+
+Interval tree is an example of augmented rb tree. Reference -
+"Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein.
+More details about interval trees:
+
+Classical rbtree has a single key and it cannot be directly used to store
+interval ranges like [lo:hi] and do a quick lookup for any overlap with a new
+lo:hi or to find whether there is an exact match for a new lo:hi.
+
+However, rbtree can be augmented to store such interval ranges in a structured
+way making it possible to do efficient lookup and exact match.
+
+This "extra information" stored in each node is the maximum hi
+(max_hi) value among all the nodes that are its descendants. This
+information can be maintained at each node just be looking at the node
+and its immediate children. And this will be used in O(log n) lookup
+for lowest match (lowest start address among all possible matches)
+with something like::
+
+ struct interval_tree_node *
+ interval_tree_first_match(struct rb_root *root,
+ unsigned long start, unsigned long last)
+ {
+ struct interval_tree_node *node;
+
+ if (!root->rb_node)
+ return NULL;
+ node = rb_entry(root->rb_node, struct interval_tree_node, rb);
+
+ while (true) {
+ if (node->rb.rb_left) {
+ struct interval_tree_node *left =
+ rb_entry(node->rb.rb_left,
+ struct interval_tree_node, rb);
+ if (left->__subtree_last >= start) {
+ /*
+ * Some nodes in left subtree satisfy Cond2.
+ * Iterate to find the leftmost such node N.
+ * If it also satisfies Cond1, that's the match
+ * we are looking for. Otherwise, there is no
+ * matching interval as nodes to the right of N
+ * can't satisfy Cond1 either.
+ */
+ node = left;
+ continue;
+ }
+ }
+ if (node->start <= last) { /* Cond1 */
+ if (node->last >= start) /* Cond2 */
+ return node; /* node is leftmost match */
+ if (node->rb.rb_right) {
+ node = rb_entry(node->rb.rb_right,
+ struct interval_tree_node, rb);
+ if (node->__subtree_last >= start)
+ continue;
+ }
+ }
+ return NULL; /* No match */
+ }
+ }
+
+Insertion/removal are defined using the following augmented callbacks::
+
+ static inline unsigned long
+ compute_subtree_last(struct interval_tree_node *node)
+ {
+ unsigned long max = node->last, subtree_last;
+ if (node->rb.rb_left) {
+ subtree_last = rb_entry(node->rb.rb_left,
+ struct interval_tree_node, rb)->__subtree_last;
+ if (max < subtree_last)
+ max = subtree_last;
+ }
+ if (node->rb.rb_right) {
+ subtree_last = rb_entry(node->rb.rb_right,
+ struct interval_tree_node, rb)->__subtree_last;
+ if (max < subtree_last)
+ max = subtree_last;
+ }
+ return max;
+ }
+
+ static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
+ {
+ while (rb != stop) {
+ struct interval_tree_node *node =
+ rb_entry(rb, struct interval_tree_node, rb);
+ unsigned long subtree_last = compute_subtree_last(node);
+ if (node->__subtree_last == subtree_last)
+ break;
+ node->__subtree_last = subtree_last;
+ rb = rb_parent(&node->rb);
+ }
+ }
+
+ static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new)
+ {
+ struct interval_tree_node *old =
+ rb_entry(rb_old, struct interval_tree_node, rb);
+ struct interval_tree_node *new =
+ rb_entry(rb_new, struct interval_tree_node, rb);
+
+ new->__subtree_last = old->__subtree_last;
+ }
+
+ static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
+ {
+ struct interval_tree_node *old =
+ rb_entry(rb_old, struct interval_tree_node, rb);
+ struct interval_tree_node *new =
+ rb_entry(rb_new, struct interval_tree_node, rb);
+
+ new->__subtree_last = old->__subtree_last;
+ old->__subtree_last = compute_subtree_last(old);
+ }
+
+ static const struct rb_augment_callbacks augment_callbacks = {
+ augment_propagate, augment_copy, augment_rotate
+ };
+
+ void interval_tree_insert(struct interval_tree_node *node,
+ struct rb_root *root)
+ {
+ struct rb_node **link = &root->rb_node, *rb_parent = NULL;
+ unsigned long start = node->start, last = node->last;
+ struct interval_tree_node *parent;
+
+ while (*link) {
+ rb_parent = *link;
+ parent = rb_entry(rb_parent, struct interval_tree_node, rb);
+ if (parent->__subtree_last < last)
+ parent->__subtree_last = last;
+ if (start < parent->start)
+ link = &parent->rb.rb_left;
+ else
+ link = &parent->rb.rb_right;
+ }
+
+ node->__subtree_last = last;
+ rb_link_node(&node->rb, rb_parent, link);
+ rb_insert_augmented(&node->rb, root, &augment_callbacks);
+ }
+
+ void interval_tree_remove(struct interval_tree_node *node,
+ struct rb_root *root)
+ {
+ rb_erase_augmented(&node->rb, root, &augment_callbacks);
+ }
diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst
index 9b76337f6756..12e4aecdae94 100644
--- a/Documentation/core-api/symbol-namespaces.rst
+++ b/Documentation/core-api/symbol-namespaces.rst
@@ -43,16 +43,16 @@ exporting of kernel symbols to the kernel symbol table, variants of these are
available to export symbols into a certain namespace: EXPORT_SYMBOL_NS() and
EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace.
Please note that due to macro expansion that argument needs to be a
-preprocessor symbol. E.g. to export the symbol `usb_stor_suspend` into the
-namespace `USB_STORAGE`, use::
+preprocessor symbol. E.g. to export the symbol ``usb_stor_suspend`` into the
+namespace ``USB_STORAGE``, use::
EXPORT_SYMBOL_NS(usb_stor_suspend, USB_STORAGE);
-The corresponding ksymtab entry struct `kernel_symbol` will have the member
-`namespace` set accordingly. A symbol that is exported without a namespace will
-refer to `NULL`. There is no default namespace if none is defined. `modpost`
-and kernel/module.c make use the namespace at build time or module load time,
-respectively.
+The corresponding ksymtab entry struct ``kernel_symbol`` will have the member
+``namespace`` set accordingly. A symbol that is exported without a namespace will
+refer to ``NULL``. There is no default namespace if none is defined. ``modpost``
+and kernel/module/main.c make use the namespace at build time or module load
+time, respectively.
2.2 Using the DEFAULT_SYMBOL_NAMESPACE define
=============================================
@@ -64,7 +64,7 @@ and EXPORT_SYMBOL_GPL() macro expansions that do not specify a namespace.
There are multiple ways of specifying this define and it depends on the
subsystem and the maintainer's preference, which one to use. The first option
-is to define the default namespace in the `Makefile` of the subsystem. E.g. to
+is to define the default namespace in the ``Makefile`` of the subsystem. E.g. to
export all symbols defined in usb-common into the namespace USB_COMMON, add a
line like this to drivers/usb/common/Makefile::
@@ -96,7 +96,7 @@ using a statement like::
MODULE_IMPORT_NS(USB_STORAGE);
-This will create a `modinfo` tag in the module for each imported namespace.
+This will create a ``modinfo`` tag in the module for each imported namespace.
This has the side effect, that the imported namespaces of a module can be
inspected with modinfo::
@@ -113,7 +113,7 @@ metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section
4. Loading Modules that use namespaced Symbols
==============================================
-At module loading time (e.g. `insmod`), the kernel will check each symbol
+At module loading time (e.g. ``insmod``), the kernel will check each symbol
referenced from the module for its availability and whether the namespace it
might be exported to has been imported by the module. The default behaviour of
the kernel is to reject loading modules that don't specify sufficient imports.
@@ -138,19 +138,19 @@ missing imports. Fixing missing imports can be done with::
A typical scenario for module authors would be::
- write code that depends on a symbol from a not imported namespace
- - `make`
+ - ``make``
- notice the warning of modpost telling about a missing import
- - run `make nsdeps` to add the import to the correct code location
+ - run ``make nsdeps`` to add the import to the correct code location
For subsystem maintainers introducing a namespace, the steps are very similar.
-Again, `make nsdeps` will eventually add the missing namespace imports for
+Again, ``make nsdeps`` will eventually add the missing namespace imports for
in-tree modules::
- move or add symbols to a namespace (e.g. with EXPORT_SYMBOL_NS())
- - `make` (preferably with an allmodconfig to cover all in-kernel
+ - ``make`` (preferably with an allmodconfig to cover all in-kernel
modules)
- notice the warning of modpost telling about a missing import
- - run `make nsdeps` to add the import to the correct code location
+ - run ``make nsdeps`` to add the import to the correct code location
You can also run nsdeps for external module builds. A typical usage is::
diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst
new file mode 100644
index 000000000000..91acbcf30e9b
--- /dev/null
+++ b/Documentation/core-api/this_cpu_ops.rst
@@ -0,0 +1,337 @@
+===================
+this_cpu operations
+===================
+
+:Author: Christoph Lameter, August 4th, 2014
+:Author: Pranith Kumar, Aug 2nd, 2014
+
+this_cpu operations are a way of optimizing access to per cpu
+variables associated with the *currently* executing processor. This is
+done through the use of segment registers (or a dedicated register where
+the cpu permanently stored the beginning of the per cpu area for a
+specific processor).
+
+this_cpu operations add a per cpu variable offset to the processor
+specific per cpu base and encode that operation in the instruction
+operating on the per cpu variable.
+
+This means that there are no atomicity issues between the calculation of
+the offset and the operation on the data. Therefore it is not
+necessary to disable preemption or interrupts to ensure that the
+processor is not changed between the calculation of the address and
+the operation on the data.
+
+Read-modify-write operations are of particular interest. Frequently
+processors have special lower latency instructions that can operate
+without the typical synchronization overhead, but still provide some
+sort of relaxed atomicity guarantees. The x86, for example, can execute
+RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the
+lock prefix and the associated latency penalty.
+
+Access to the variable without the lock prefix is not synchronized but
+synchronization is not necessary since we are dealing with per cpu
+data specific to the currently executing processor. Only the current
+processor should be accessing that variable and therefore there are no
+concurrency issues with other processors in the system.
+
+Please note that accesses by remote processors to a per cpu area are
+exceptional situations and may impact performance and/or correctness
+(remote write operations) of local RMW operations via this_cpu_*.
+
+The main use of the this_cpu operations has been to optimize counter
+operations.
+
+The following this_cpu() operations with implied preemption protection
+are defined. These operations can be used without worrying about
+preemption and interrupts::
+
+ this_cpu_read(pcp)
+ this_cpu_write(pcp, val)
+ this_cpu_add(pcp, val)
+ this_cpu_and(pcp, val)
+ this_cpu_or(pcp, val)
+ this_cpu_add_return(pcp, val)
+ this_cpu_xchg(pcp, nval)
+ this_cpu_cmpxchg(pcp, oval, nval)
+ this_cpu_sub(pcp, val)
+ this_cpu_inc(pcp)
+ this_cpu_dec(pcp)
+ this_cpu_sub_return(pcp, val)
+ this_cpu_inc_return(pcp)
+ this_cpu_dec_return(pcp)
+
+
+Inner working of this_cpu operations
+------------------------------------
+
+On x86 the fs: or the gs: segment registers contain the base of the
+per cpu area. It is then possible to simply use the segment override
+to relocate a per cpu relative address to the proper per cpu area for
+the processor. So the relocation to the per cpu base is encoded in the
+instruction via a segment register prefix.
+
+For example::
+
+ DEFINE_PER_CPU(int, x);
+ int z;
+
+ z = this_cpu_read(x);
+
+results in a single instruction::
+
+ mov ax, gs:[x]
+
+instead of a sequence of calculation of the address and then a fetch
+from that address which occurs with the per cpu operations. Before
+this_cpu_ops such sequence also required preempt disable/enable to
+prevent the kernel from moving the thread to a different processor
+while the calculation is performed.
+
+Consider the following this_cpu operation::
+
+ this_cpu_inc(x)
+
+The above results in the following single instruction (no lock prefix!)::
+
+ inc gs:[x]
+
+instead of the following operations required if there is no segment
+register::
+
+ int *y;
+ int cpu;
+
+ cpu = get_cpu();
+ y = per_cpu_ptr(&x, cpu);
+ (*y)++;
+ put_cpu();
+
+Note that these operations can only be used on per cpu data that is
+reserved for a specific processor. Without disabling preemption in the
+surrounding code this_cpu_inc() will only guarantee that one of the
+per cpu counters is correctly incremented. However, there is no
+guarantee that the OS will not move the process directly before or
+after the this_cpu instruction is executed. In general this means that
+the value of the individual counters for each processor are
+meaningless. The sum of all the per cpu counters is the only value
+that is of interest.
+
+Per cpu variables are used for performance reasons. Bouncing cache
+lines can be avoided if multiple processors concurrently go through
+the same code paths. Since each processor has its own per cpu
+variables no concurrent cache line updates take place. The price that
+has to be paid for this optimization is the need to add up the per cpu
+counters when the value of a counter is needed.
+
+
+Special operations
+------------------
+
+::
+
+ y = this_cpu_ptr(&x)
+
+Takes the offset of a per cpu variable (&x !) and returns the address
+of the per cpu variable that belongs to the currently executing
+processor. this_cpu_ptr avoids multiple steps that the common
+get_cpu/put_cpu sequence requires. No processor number is
+available. Instead, the offset of the local per cpu area is simply
+added to the per cpu offset.
+
+Note that this operation is usually used in a code segment when
+preemption has been disabled. The pointer is then used to
+access local per cpu data in a critical section. When preemption
+is re-enabled this pointer is usually no longer useful since it may
+no longer point to per cpu data of the current processor.
+
+
+Per cpu variables and offsets
+-----------------------------
+
+Per cpu variables have *offsets* to the beginning of the per cpu
+area. They do not have addresses although they look like that in the
+code. Offsets cannot be directly dereferenced. The offset must be
+added to a base pointer of a per cpu area of a processor in order to
+form a valid address.
+
+Therefore the use of x or &x outside of the context of per cpu
+operations is invalid and will generally be treated like a NULL
+pointer dereference.
+
+::
+
+ DEFINE_PER_CPU(int, x);
+
+In the context of per cpu operations the above implies that x is a per
+cpu variable. Most this_cpu operations take a cpu variable.
+
+::
+
+ int __percpu *p = &x;
+
+&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr()
+takes the offset of a per cpu variable which makes this look a bit
+strange.
+
+
+Operations on a field of a per cpu structure
+--------------------------------------------
+
+Let's say we have a percpu structure::
+
+ struct s {
+ int n,m;
+ };
+
+ DEFINE_PER_CPU(struct s, p);
+
+
+Operations on these fields are straightforward::
+
+ this_cpu_inc(p.m)
+
+ z = this_cpu_cmpxchg(p.m, 0, 1);
+
+
+If we have an offset to struct s::
+
+ struct s __percpu *ps = &p;
+
+ this_cpu_dec(ps->m);
+
+ z = this_cpu_inc_return(ps->n);
+
+
+The calculation of the pointer may require the use of this_cpu_ptr()
+if we do not make use of this_cpu ops later to manipulate fields::
+
+ struct s *pp;
+
+ pp = this_cpu_ptr(&p);
+
+ pp->m--;
+
+ z = pp->n++;
+
+
+Variants of this_cpu ops
+------------------------
+
+this_cpu ops are interrupt safe. Some architectures do not support
+these per cpu local operations. In that case the operation must be
+replaced by code that disables interrupts, then does the operations
+that are guaranteed to be atomic and then re-enable interrupts. Doing
+so is expensive. If there are other reasons why the scheduler cannot
+change the processor we are executing on then there is no reason to
+disable interrupts. For that purpose the following __this_cpu operations
+are provided.
+
+These operations have no guarantee against concurrent interrupts or
+preemption. If a per cpu variable is not used in an interrupt context
+and the scheduler cannot preempt, then they are safe. If any interrupts
+still occur while an operation is in progress and if the interrupt too
+modifies the variable, then RMW actions can not be guaranteed to be
+safe::
+
+ __this_cpu_read(pcp)
+ __this_cpu_write(pcp, val)
+ __this_cpu_add(pcp, val)
+ __this_cpu_and(pcp, val)
+ __this_cpu_or(pcp, val)
+ __this_cpu_add_return(pcp, val)
+ __this_cpu_xchg(pcp, nval)
+ __this_cpu_cmpxchg(pcp, oval, nval)
+ __this_cpu_sub(pcp, val)
+ __this_cpu_inc(pcp)
+ __this_cpu_dec(pcp)
+ __this_cpu_sub_return(pcp, val)
+ __this_cpu_inc_return(pcp)
+ __this_cpu_dec_return(pcp)
+
+
+Will increment x and will not fall-back to code that disables
+interrupts on platforms that cannot accomplish atomicity through
+address relocation and a Read-Modify-Write operation in the same
+instruction.
+
+
+&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
+--------------------------------------------
+
+The first operation takes the offset and forms an address and then
+adds the offset of the n field. This may result in two add
+instructions emitted by the compiler.
+
+The second one first adds the two offsets and then does the
+relocation. IMHO the second form looks cleaner and has an easier time
+with (). The second form also is consistent with the way
+this_cpu_read() and friends are used.
+
+
+Remote access to per cpu data
+------------------------------
+
+Per cpu data structures are designed to be used by one cpu exclusively.
+If you use the variables as intended, this_cpu_ops() are guaranteed to
+be "atomic" as no other CPU has access to these data structures.
+
+There are special cases where you might need to access per cpu data
+structures remotely. It is usually safe to do a remote read access
+and that is frequently done to summarize counters. Remote write access
+something which could be problematic because this_cpu ops do not
+have lock semantics. A remote write may interfere with a this_cpu
+RMW operation.
+
+Remote write accesses to percpu data structures are highly discouraged
+unless absolutely necessary. Please consider using an IPI to wake up
+the remote CPU and perform the update to its per cpu area.
+
+To access per-cpu data structure remotely, typically the per_cpu_ptr()
+function is used::
+
+
+ DEFINE_PER_CPU(struct data, datap);
+
+ struct data *p = per_cpu_ptr(&datap, cpu);
+
+This makes it explicit that we are getting ready to access a percpu
+area remotely.
+
+You can also do the following to convert the datap offset to an address::
+
+ struct data *p = this_cpu_ptr(&datap);
+
+but, passing of pointers calculated via this_cpu_ptr to other cpus is
+unusual and should be avoided.
+
+Remote access are typically only for reading the status of another cpus
+per cpu data. Write accesses can cause unique problems due to the
+relaxed synchronization requirements for this_cpu operations.
+
+One example that illustrates some concerns with write operations is
+the following scenario that occurs because two per cpu variables
+share a cache-line but the relaxed synchronization is applied to
+only one process updating the cache-line.
+
+Consider the following example::
+
+
+ struct test {
+ atomic_t a;
+ int b;
+ };
+
+ DEFINE_PER_CPU(struct test, onecacheline);
+
+There is some concern about what would happen if the field 'a' is updated
+remotely from one processor and the local processor would use this_cpu ops
+to update field b. Care should be taken that such simultaneous accesses to
+data within the same cache line are avoided. Also costly synchronization
+may be necessary. IPIs are generally recommended in such scenarios instead
+of a remote write to the per cpu area of another processor.
+
+Even in cases where the remote writes are rare, please bear in
+mind that a remote write will evict the cache line from the processor
+that most likely will access it. If the processor wakes up and finds a
+missing local cache line of a per cpu area, its performance and hence
+the wake up times will be affected.
diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst
index 729e24864fe7..22ec68f24421 100644
--- a/Documentation/core-api/timekeeping.rst
+++ b/Documentation/core-api/timekeeping.rst
@@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases:
.. c:function:: u64 ktime_get_mono_fast_ns( void )
u64 ktime_get_raw_fast_ns( void )
u64 ktime_get_boot_fast_ns( void )
+ u64 ktime_get_tai_fast_ns( void )
u64 ktime_get_real_fast_ns( void )
These variants are safe to call from any context, including from
diff --git a/Documentation/core-api/unaligned-memory-access.rst b/Documentation/core-api/unaligned-memory-access.rst
new file mode 100644
index 000000000000..1ee82419d8aa
--- /dev/null
+++ b/Documentation/core-api/unaligned-memory-access.rst
@@ -0,0 +1,265 @@
+=========================
+Unaligned Memory Accesses
+=========================
+
+:Author: Daniel Drake <dsd@gentoo.org>,
+:Author: Johannes Berg <johannes@sipsolutions.net>
+
+:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt,
+ Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz,
+ Vadim Lobanov
+
+
+Linux runs on a wide variety of architectures which have varying behaviour
+when it comes to memory access. This document presents some details about
+unaligned accesses, why you need to write code that doesn't cause them,
+and how to write such code!
+
+
+The definition of an unaligned access
+=====================================
+
+Unaligned memory accesses occur when you try to read N bytes of data starting
+from an address that is not evenly divisible by N (i.e. addr % N != 0).
+For example, reading 4 bytes of data from address 0x10004 is fine, but
+reading 4 bytes of data from address 0x10005 would be an unaligned memory
+access.
+
+The above may seem a little vague, as memory access can happen in different
+ways. The context here is at the machine code level: certain instructions read
+or write a number of bytes to or from memory (e.g. movb, movw, movl in x86
+assembly). As will become clear, it is relatively easy to spot C statements
+which will compile to multiple-byte memory access instructions, namely when
+dealing with types such as u16, u32 and u64.
+
+
+Natural alignment
+=================
+
+The rule mentioned above forms what we refer to as natural alignment:
+When accessing N bytes of memory, the base memory address must be evenly
+divisible by N, i.e. addr % N == 0.
+
+When writing code, assume the target architecture has natural alignment
+requirements.
+
+In reality, only a few architectures require natural alignment on all sizes
+of memory access. However, we must consider ALL supported architectures;
+writing code that satisfies natural alignment requirements is the easiest way
+to achieve full portability.
+
+
+Why unaligned access is bad
+===========================
+
+The effects of performing an unaligned memory access vary from architecture
+to architecture. It would be easy to write a whole document on the differences
+here; a summary of the common scenarios is presented below:
+
+ - Some architectures are able to perform unaligned memory accesses
+ transparently, but there is usually a significant performance cost.
+ - Some architectures raise processor exceptions when unaligned accesses
+ happen. The exception handler is able to correct the unaligned access,
+ at significant cost to performance.
+ - Some architectures raise processor exceptions when unaligned accesses
+ happen, but the exceptions do not contain enough information for the
+ unaligned access to be corrected.
+ - Some architectures are not capable of unaligned memory access, but will
+ silently perform a different memory access to the one that was requested,
+ resulting in a subtle code bug that is hard to detect!
+
+It should be obvious from the above that if your code causes unaligned
+memory accesses to happen, your code will not work correctly on certain
+platforms and will cause performance problems on others.
+
+
+Code that does not cause unaligned access
+=========================================
+
+At first, the concepts above may seem a little hard to relate to actual
+coding practice. After all, you don't have a great deal of control over
+memory addresses of certain variables, etc.
+
+Fortunately things are not too complex, as in most cases, the compiler
+ensures that things will work for you. For example, take the following
+structure::
+
+ struct foo {
+ u16 field1;
+ u32 field2;
+ u8 field3;
+ };
+
+Let us assume that an instance of the above structure resides in memory
+starting at address 0x10000. With a basic level of understanding, it would
+not be unreasonable to expect that accessing field2 would cause an unaligned
+access. You'd be expecting field2 to be located at offset 2 bytes into the
+structure, i.e. address 0x10002, but that address is not evenly divisible
+by 4 (remember, we're reading a 4 byte value here).
+
+Fortunately, the compiler understands the alignment constraints, so in the
+above case it would insert 2 bytes of padding in between field1 and field2.
+Therefore, for standard structure types you can always rely on the compiler
+to pad structures so that accesses to fields are suitably aligned (assuming
+you do not cast the field to a type of different length).
+
+Similarly, you can also rely on the compiler to align variables and function
+parameters to a naturally aligned scheme, based on the size of the type of
+the variable.
+
+At this point, it should be clear that accessing a single byte (u8 or char)
+will never cause an unaligned access, because all memory addresses are evenly
+divisible by one.
+
+On a related topic, with the above considerations in mind you may observe
+that you could reorder the fields in the structure in order to place fields
+where padding would otherwise be inserted, and hence reduce the overall
+resident memory size of structure instances. The optimal layout of the
+above example is::
+
+ struct foo {
+ u32 field2;
+ u16 field1;
+ u8 field3;
+ };
+
+For a natural alignment scheme, the compiler would only have to add a single
+byte of padding at the end of the structure. This padding is added in order
+to satisfy alignment constraints for arrays of these structures.
+
+Another point worth mentioning is the use of __attribute__((packed)) on a
+structure type. This GCC-specific attribute tells the compiler never to
+insert any padding within structures, useful when you want to use a C struct
+to represent some data that comes in a fixed arrangement 'off the wire'.
+
+You might be inclined to believe that usage of this attribute can easily
+lead to unaligned accesses when accessing fields that do not satisfy
+architectural alignment requirements. However, again, the compiler is aware
+of the alignment constraints and will generate extra instructions to perform
+the memory access in a way that does not cause unaligned access. Of course,
+the extra instructions obviously cause a loss in performance compared to the
+non-packed case, so the packed attribute should only be used when avoiding
+structure padding is of importance.
+
+
+Code that causes unaligned access
+=================================
+
+With the above in mind, let's move onto a real life example of a function
+that can cause an unaligned memory access. The following function taken
+from include/linux/etherdevice.h is an optimized routine to compare two
+ethernet MAC addresses for equality::
+
+ bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
+ {
+ #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
+ ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));
+
+ return fold == 0;
+ #else
+ const u16 *a = (const u16 *)addr1;
+ const u16 *b = (const u16 *)addr2;
+ return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
+ #endif
+ }
+
+In the above function, when the hardware has efficient unaligned access
+capability, there is no issue with this code. But when the hardware isn't
+able to access memory on arbitrary boundaries, the reference to a[0] causes
+2 bytes (16 bits) to be read from memory starting at address addr1.
+
+Think about what would happen if addr1 was an odd address such as 0x10003.
+(Hint: it'd be an unaligned access.)
+
+Despite the potential unaligned access problems with the above function, it
+is included in the kernel anyway but is understood to only work normally on
+16-bit-aligned addresses. It is up to the caller to ensure this alignment or
+not use this function at all. This alignment-unsafe function is still useful
+as it is a decent optimization for the cases when you can ensure alignment,
+which is true almost all of the time in ethernet networking context.
+
+
+Here is another example of some code that could cause unaligned accesses::
+
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ *((u32 *) data) = cpu_to_le32(value);
+ [...]
+ }
+
+This code will cause unaligned accesses every time the data parameter points
+to an address that is not evenly divisible by 4.
+
+In summary, the 2 main scenarios where you may run into unaligned access
+problems involve:
+
+ 1. Casting variables to types of different lengths
+ 2. Pointer arithmetic followed by access to at least 2 bytes of data
+
+
+Avoiding unaligned accesses
+===========================
+
+The easiest way to avoid unaligned access is to use the get_unaligned() and
+put_unaligned() macros provided by the <asm/unaligned.h> header file.
+
+Going back to an earlier example of code that potentially causes unaligned
+access::
+
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ *((u32 *) data) = cpu_to_le32(value);
+ [...]
+ }
+
+To avoid the unaligned memory access, you would rewrite it as follows::
+
+ void myfunc(u8 *data, u32 value)
+ {
+ [...]
+ value = cpu_to_le32(value);
+ put_unaligned(value, (u32 *) data);
+ [...]
+ }
+
+The get_unaligned() macro works similarly. Assuming 'data' is a pointer to
+memory and you wish to avoid unaligned access, its usage is as follows::
+
+ u32 value = get_unaligned((u32 *) data);
+
+These macros work for memory accesses of any length (not just 32 bits as
+in the examples above). Be aware that when compared to standard access of
+aligned memory, using these macros to access unaligned memory can be costly in
+terms of performance.
+
+If use of such macros is not convenient, another option is to use memcpy(),
+where the source or destination (or both) are of type u8* or unsigned char*.
+Due to the byte-wise nature of this operation, unaligned accesses are avoided.
+
+
+Alignment vs. Networking
+========================
+
+On architectures that require aligned loads, networking requires that the IP
+header is aligned on a four-byte boundary to optimise the IP stack. For
+regular ethernet hardware, the constant NET_IP_ALIGN is used. On most
+architectures this constant has the value 2 because the normal ethernet
+header is 14 bytes long, so in order to get proper alignment one needs to
+DMA to an address which can be expressed as 4*n + 2. One notable exception
+here is powerpc which defines NET_IP_ALIGN to 0 because DMA to unaligned
+addresses can be very expensive and dwarf the cost of unaligned loads.
+
+For some ethernet hardware that cannot DMA to unaligned addresses like
+4*n+2 or non-ethernet hardware, this can be a problem, and it is then
+required to copy the incoming frame into an aligned buffer. Because this is
+unnecessary on architectures that can do unaligned accesses, the code can be
+made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so::
+
+ #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ skb = original skb
+ #else
+ skb = copy skb
+ #endif
diff --git a/Documentation/core-api/watch_queue.rst b/Documentation/core-api/watch_queue.rst
new file mode 100644
index 000000000000..54f13ad5fc17
--- /dev/null
+++ b/Documentation/core-api/watch_queue.rst
@@ -0,0 +1,343 @@
+==============================
+General notification mechanism
+==============================
+
+The general notification mechanism is built on top of the standard pipe driver
+whereby it effectively splices notification messages from the kernel into pipes
+opened by userspace. This can be used in conjunction with::
+
+ * Key/keyring notifications
+
+
+The notifications buffers can be enabled by:
+
+ "General setup"/"General notification queue"
+ (CONFIG_WATCH_QUEUE)
+
+This document has the following sections:
+
+.. contents:: :local:
+
+
+Overview
+========
+
+This facility appears as a pipe that is opened in a special mode. The pipe's
+internal ring buffer is used to hold messages that are generated by the kernel.
+These messages are then read out by read(). Splice and similar are disabled on
+such pipes due to them wanting to, under some circumstances, revert their
+additions to the ring - which might end up interleaved with notification
+messages.
+
+The owner of the pipe has to tell the kernel which sources it would like to
+watch through that pipe. Only sources that have been connected to a pipe will
+insert messages into it. Note that a source may be bound to multiple pipes and
+insert messages into all of them simultaneously.
+
+Filters may also be emplaced on a pipe so that certain source types and
+subevents can be ignored if they're not of interest.
+
+A message will be discarded if there isn't a slot available in the ring or if
+no preallocated message buffer is available. In both of these cases, read()
+will insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after
+the last message currently in the buffer has been read.
+
+Note that when producing a notification, the kernel does not wait for the
+consumers to collect it, but rather just continues on. This means that
+notifications can be generated whilst spinlocks are held and also protects the
+kernel from being held up indefinitely by a userspace malfunction.
+
+
+Message Structure
+=================
+
+Notification messages begin with a short header::
+
+ struct watch_notification {
+ __u32 type:24;
+ __u32 subtype:8;
+ __u32 info;
+ };
+
+"type" indicates the source of the notification record and "subtype" indicates
+the type of record from that source (see the Watch Sources section below). The
+type may also be "WATCH_TYPE_META". This is a special record type generated
+internally by the watch queue itself. There are two subtypes:
+
+ * WATCH_META_REMOVAL_NOTIFICATION
+ * WATCH_META_LOSS_NOTIFICATION
+
+The first indicates that an object on which a watch was installed was removed
+or destroyed and the second indicates that some messages have been lost.
+
+"info" indicates a bunch of things, including:
+
+ * The length of the message in bytes, including the header (mask with
+ WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT). This indicates
+ the size of the record, which may be between 8 and 127 bytes.
+
+ * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT).
+ This indicates that caller's ID of the watch, which may be between 0
+ and 255. Multiple watches may share a queue, and this provides a means to
+ distinguish them.
+
+ * A type-specific field (WATCH_INFO_TYPE_INFO). This is set by the
+ notification producer to indicate some meaning specific to the type and
+ subtype.
+
+Everything in info apart from the length can be used for filtering.
+
+The header can be followed by supplementary information. The format of this is
+at the discretion is defined by the type and subtype.
+
+
+Watch List (Notification Source) API
+====================================
+
+A "watch list" is a list of watchers that are subscribed to a source of
+notifications. A list may be attached to an object (say a key or a superblock)
+or may be global (say for device events). From a userspace perspective, a
+non-global watch list is typically referred to by reference to the object it
+belongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to
+watch that specific key).
+
+To manage a watch list, the following functions are provided:
+
+ * ::
+
+ void init_watch_list(struct watch_list *wlist,
+ void (*release_watch)(struct watch *wlist));
+
+ Initialise a watch list. If ``release_watch`` is not NULL, then this
+ indicates a function that should be called when the watch_list object is
+ destroyed to discard any references the watch list holds on the watched
+ object.
+
+ * ``void remove_watch_list(struct watch_list *wlist);``
+
+ This removes all of the watches subscribed to a watch_list and frees them
+ and then destroys the watch_list object itself.
+
+
+Watch Queue (Notification Output) API
+=====================================
+
+A "watch queue" is the buffer allocated by an application that notification
+records will be written into. The workings of this are hidden entirely inside
+of the pipe device driver, but it is necessary to gain a reference to it to set
+a watch. These can be managed with:
+
+ * ``struct watch_queue *get_watch_queue(int fd);``
+
+ Since watch queues are indicated to the kernel by the fd of the pipe that
+ implements the buffer, userspace must hand that fd through a system call.
+ This can be used to look up an opaque pointer to the watch queue from the
+ system call.
+
+ * ``void put_watch_queue(struct watch_queue *wqueue);``
+
+ This discards the reference obtained from ``get_watch_queue()``.
+
+
+Watch Subscription API
+======================
+
+A "watch" is a subscription on a watch list, indicating the watch queue, and
+thus the buffer, into which notification records should be written. The watch
+queue object may also carry filtering rules for that object, as set by
+userspace. Some parts of the watch struct can be set by the driver::
+
+ struct watch {
+ union {
+ u32 info_id; /* ID to be OR'd in to info field */
+ ...
+ };
+ void *private; /* Private data for the watched object */
+ u64 id; /* Internal identifier */
+ ...
+ };
+
+The ``info_id`` value should be an 8-bit number obtained from userspace and
+shifted by WATCH_INFO_ID__SHIFT. This is OR'd into the WATCH_INFO_ID field of
+struct watch_notification::info when and if the notification is written into
+the associated watch queue buffer.
+
+The ``private`` field is the driver's data associated with the watch_list and
+is cleaned up by the ``watch_list::release_watch()`` method.
+
+The ``id`` field is the source's ID. Notifications that are posted with a
+different ID are ignored.
+
+The following functions are provided to manage watches:
+
+ * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);``
+
+ Initialise a watch object, setting its pointer to the watch queue, using
+ appropriate barriering to avoid lockdep complaints.
+
+ * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);``
+
+ Subscribe a watch to a watch list (notification source). The
+ driver-settable fields in the watch struct must have been set before this
+ is called.
+
+ * ::
+
+ int remove_watch_from_object(struct watch_list *wlist,
+ struct watch_queue *wqueue,
+ u64 id, false);
+
+ Remove a watch from a watch list, where the watch must match the specified
+ watch queue (``wqueue``) and object identifier (``id``). A notification
+ (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to
+ indicate that the watch got removed.
+
+ * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);``
+
+ Remove all the watches from a watch list. It is expected that this will be
+ called preparatory to destruction and that the watch list will be
+ inaccessible to new watches by this point. A notification
+ (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each
+ subscribed watch to indicate that the watch got removed.
+
+
+Notification Posting API
+========================
+
+To post a notification to watch list so that the subscribed watches can see it,
+the following function should be used::
+
+ void post_watch_notification(struct watch_list *wlist,
+ struct watch_notification *n,
+ const struct cred *cred,
+ u64 id);
+
+The notification should be preformatted and a pointer to the header (``n``)
+should be passed in. The notification may be larger than this and the size in
+units of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``.
+
+The ``cred`` struct indicates the credentials of the source (subject) and is
+passed to the LSMs, such as SELinux, to allow or suppress the recording of the
+note in each individual queue according to the credentials of that queue
+(object).
+
+The ``id`` is the ID of the source object (such as the serial number on a key).
+Only watches that have the same ID set in them will see this notification.
+
+
+Watch Sources
+=============
+
+Any particular buffer can be fed from multiple sources. Sources include:
+
+ * WATCH_TYPE_KEY_NOTIFY
+
+ Notifications of this type indicate changes to keys and keyrings, including
+ the changes of keyring contents or the attributes of keys.
+
+ See Documentation/security/keys/core.rst for more information.
+
+
+Event Filtering
+===============
+
+Once a watch queue has been created, a set of filters can be applied to limit
+the events that are received using::
+
+ struct watch_notification_filter filter = {
+ ...
+ };
+ ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter)
+
+The filter description is a variable of type::
+
+ struct watch_notification_filter {
+ __u32 nr_filters;
+ __u32 __reserved;
+ struct watch_notification_type_filter filters[];
+ };
+
+Where "nr_filters" is the number of filters in filters[] and "__reserved"
+should be 0. The "filters" array has elements of the following type::
+
+ struct watch_notification_type_filter {
+ __u32 type;
+ __u32 info_filter;
+ __u32 info_mask;
+ __u32 subtype_filter[8];
+ };
+
+Where:
+
+ * ``type`` is the event type to filter for and should be something like
+ "WATCH_TYPE_KEY_NOTIFY"
+
+ * ``info_filter`` and ``info_mask`` act as a filter on the info field of the
+ notification record. The notification is only written into the buffer if::
+
+ (watch.info & info_mask) == info_filter
+
+ This could be used, for example, to ignore events that are not exactly on
+ the watched point in a mount tree.
+
+ * ``subtype_filter`` is a bitmask indicating the subtypes that are of
+ interest. Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to
+ subtype 1, and so on.
+
+If the argument to the ioctl() is NULL, then the filters will be removed and
+all events from the watched sources will come through.
+
+
+Userspace Code Example
+======================
+
+A buffer is created with something like the following::
+
+ pipe2(fds, O_TMPFILE);
+ ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
+
+It can then be set to receive keyring change notifications::
+
+ keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
+
+The notifications can then be consumed by something like the following::
+
+ static void consumer(int rfd, struct watch_queue_buffer *buf)
+ {
+ unsigned char buffer[128];
+ ssize_t buf_len;
+
+ while (buf_len = read(rfd, buffer, sizeof(buffer)),
+ buf_len > 0
+ ) {
+ void *p = buffer;
+ void *end = buffer + buf_len;
+ while (p < end) {
+ union {
+ struct watch_notification n;
+ unsigned char buf1[128];
+ } n;
+ size_t largest, len;
+
+ largest = end - p;
+ if (largest > 128)
+ largest = 128;
+ memcpy(&n, p, largest);
+
+ len = (n->info & WATCH_INFO_LENGTH) >>
+ WATCH_INFO_LENGTH__SHIFT;
+ if (len == 0 || len > largest)
+ return;
+
+ switch (n.n.type) {
+ case WATCH_TYPE_META:
+ got_meta(&n.n);
+ case WATCH_TYPE_KEY_NOTIFY:
+ saw_key_change(&n.n);
+ break;
+ }
+
+ p += len;
+ }
+ }
+ }
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 00a5ba51e63f..3599cf9267b4 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -1,6 +1,6 @@
-====================================
-Concurrency Managed Workqueue (cmwq)
-====================================
+=========
+Workqueue
+=========
:Date: September, 2010
:Author: Tejun Heo <tj@kernel.org>
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
When a new work item gets queued, the worker begins executing again.
-Why cmwq?
-=========
+Why Concurrency Managed Workqueue?
+==================================
In the original wq implementation, a multi threaded (MT) wq had one
worker thread per CPU and a single threaded (ST) wq had one worker
@@ -216,25 +216,20 @@ resources, scheduled and executed.
This flag is meaningless for unbound wq.
-Note that the flag ``WQ_NON_REENTRANT`` no longer exists as all
-workqueues are now non-reentrant - any work item is guaranteed to be
-executed by at most one worker system-wide at any given time.
-
``max_active``
--------------
-``@max_active`` determines the maximum number of execution contexts
-per CPU which can be assigned to the work items of a wq. For example,
-with ``@max_active`` of 16, at most 16 work items of the wq can be
-executing at the same time per CPU.
+``@max_active`` determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq. For example, with
+``@max_active`` of 16, at most 16 work items of the wq can be executing
+at the same time per CPU. This is always a per-CPU attribute, even for
+unbound workqueues.
-Currently, for a bound wq, the maximum limit for ``@max_active`` is
-512 and the default value used when 0 is specified is 256. For an
-unbound wq, the limit is higher of 512 and 4 *
-``num_possible_cpus()``. These values are chosen sufficiently high
-such that they are not the limiting factor while providing protection
-in runaway cases.
+The maximum limit for ``@max_active`` is 512 and the default value used
+when 0 is specified is 256. These values are chosen sufficiently high
+such that they are not the limiting factor while providing protection in
+runaway cases.
The number of active work items of a wq is usually regulated by the
users of the wq, more specifically, by how many work items the users
@@ -249,7 +244,7 @@ unbound worker-pools and only one work item could be active at any given
time thus achieving the same ordering property as ST wq.
In the current implementation the above configuration only guarantees
-ST behavior within a given NUMA node. Instead ``alloc_ordered_queue()`` should
+ST behavior within a given NUMA node. Instead ``alloc_ordered_workqueue()`` should
be used to achieve system-wide ST behavior.
@@ -352,6 +347,356 @@ Guidelines
level of locality in wq operations and work item execution.
+Affinity Scopes
+===============
+
+An unbound workqueue groups CPUs according to its affinity scope to improve
+cache locality. For example, if a workqueue is using the default affinity
+scope of "cache", it will group CPUs according to last level cache
+boundaries. A work item queued on the workqueue will be assigned to a worker
+on one of the CPUs which share the last level cache with the issuing CPU.
+Once started, the worker may or may not be allowed to move outside the scope
+depending on the ``affinity_strict`` setting of the scope.
+
+Workqueue currently supports the following affinity scopes.
+
+``default``
+ Use the scope in module parameter ``workqueue.default_affinity_scope``
+ which is always set to one of the scopes below.
+
+``cpu``
+ CPUs are not grouped. A work item issued on one CPU is processed by a
+ worker on the same CPU. This makes unbound workqueues behave as per-cpu
+ workqueues without concurrency management.
+
+``smt``
+ CPUs are grouped according to SMT boundaries. This usually means that the
+ logical threads of each physical CPU core are grouped together.
+
+``cache``
+ CPUs are grouped according to cache boundaries. Which specific cache
+ boundary is used is determined by the arch code. L3 is used in a lot of
+ cases. This is the default affinity scope.
+
+``numa``
+ CPUs are grouped according to NUMA boundaries.
+
+``system``
+ All CPUs are put in the same group. Workqueue makes no effort to process a
+ work item on a CPU close to the issuing CPU.
+
+The default affinity scope can be changed with the module parameter
+``workqueue.default_affinity_scope`` and a specific workqueue's affinity
+scope can be changed using ``apply_workqueue_attrs()``.
+
+If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope
+related interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/``
+directory.
+
+``affinity_scope``
+ Read to see the current affinity scope. Write to change.
+
+ When default is the current scope, reading this file will also show the
+ current effective scope in parentheses, for example, ``default (cache)``.
+
+``affinity_strict``
+ 0 by default indicating that affinity scopes are not strict. When a work
+ item starts execution, workqueue makes a best-effort attempt to ensure
+ that the worker is inside its affinity scope, which is called
+ repatriation. Once started, the scheduler is free to move the worker
+ anywhere in the system as it sees fit. This enables benefiting from scope
+ locality while still being able to utilize other CPUs if necessary and
+ available.
+
+ If set to 1, all workers of the scope are guaranteed always to be in the
+ scope. This may be useful when crossing affinity scopes has other
+ implications, for example, in terms of power consumption or workload
+ isolation. Strict NUMA scope can also be used to match the workqueue
+ behavior of older kernels.
+
+
+Affinity Scopes and Performance
+===============================
+
+It'd be ideal if an unbound workqueue's behavior is optimal for vast
+majority of use cases without further tuning. Unfortunately, in the current
+kernel, there exists a pronounced trade-off between locality and utilization
+necessitating explicit configurations when workqueues are heavily used.
+
+Higher locality leads to higher efficiency where more work is performed for
+the same number of consumed CPU cycles. However, higher locality may also
+cause lower overall system utilization if the work items are not spread
+enough across the affinity scopes by the issuers. The following performance
+testing with dm-crypt clearly illustrates this trade-off.
+
+The tests are run on a CPU with 12-cores/24-threads split across four L3
+caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
+``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
+opened with ``cryptsetup`` with default settings.
+
+
+Scenario 1: Enough issuers and work spread across the machine
+-------------------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
+ --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
+ --name=iops-test-job --verify=sha512
+
+There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
+makes ``fio`` generate and read back the content each time which makes
+execution locality matter between the issuer and ``kcryptd``. The following
+are the read bandwidths and CPU utilizations depending on different affinity
+scope settings on ``kcryptd`` measured over five runs. Bandwidths are in
+MiBps, and CPU util in percents.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 1159.40 ±1.34
+ - 99.31 ±0.02
+
+ * - cache
+ - 1166.40 ±0.89
+ - 99.34 ±0.01
+
+ * - cache (strict)
+ - 1166.00 ±0.71
+ - 99.35 ±0.01
+
+With enough issuers spread across the system, there is no downside to
+"cache", strict or otherwise. All three configurations saturate the whole
+machine but the cache-affine ones outperform by 0.6% thanks to improved
+locality.
+
+
+Scenario 2: Fewer issuers, enough work for saturation
+-----------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+The only difference from the previous scenario is ``--numjobs=8``. There are
+a third of the issuers but is still enough total work to saturate the
+system.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 1155.40 ±0.89
+ - 97.41 ±0.05
+
+ * - cache
+ - 1154.40 ±1.14
+ - 96.15 ±0.09
+
+ * - cache (strict)
+ - 1112.00 ±4.64
+ - 93.26 ±0.35
+
+This is more than enough work to saturate the system. Both "system" and
+"cache" are nearly saturating the machine but not fully. "cache" is using
+less CPU but the better efficiency puts it at the same bandwidth as
+"system".
+
+Eight issuers moving around over four L3 cache scope still allow "cache
+(strict)" to mostly saturate the machine but the loss of work conservation
+is now starting to hurt with 3.7% bandwidth loss.
+
+
+Scenario 3: Even fewer issuers, not enough work to saturate
+-----------------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+Again, the only difference is ``--numjobs=4``. With the number of issuers
+reduced to four, there now isn't enough work to saturate the whole system
+and the bandwidth becomes dependent on completion latencies.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 993.60 ±1.82
+ - 75.49 ±0.06
+
+ * - cache
+ - 973.40 ±1.52
+ - 74.90 ±0.07
+
+ * - cache (strict)
+ - 828.20 ±4.49
+ - 66.84 ±0.29
+
+Now, the tradeoff between locality and utilization is clearer. "cache" shows
+2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
+
+
+Conclusion and Recommendations
+------------------------------
+
+In the above experiments, the efficiency advantage of the "cache" affinity
+scope over "system" is, while consistent and noticeable, small. However, the
+impact is dependent on the distances between the scopes and may be more
+pronounced in processors with more complex topologies.
+
+While the loss of work-conservation in certain scenarios hurts, it is a lot
+better than "cache (strict)" and maximizing workqueue utilization is
+unlikely to be the common case anyway. As such, "cache" is the default
+affinity scope for unbound pools.
+
+* As there is no one option which is great for most cases, workqueue usages
+ that may consume a significant amount of CPU are recommended to configure
+ the workqueues using ``apply_workqueue_attrs()`` and/or enable
+ ``WQ_SYSFS``.
+
+* An unbound workqueue with strict "cpu" affinity scope behaves the same as
+ ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
+ latter and an unbound workqueue provides a lot more flexibility.
+
+* Affinity scopes are introduced in Linux v6.5. To emulate the previous
+ behavior, use strict "numa" affinity scope.
+
+* The loss of work-conservation in non-strict affinity scopes is likely
+ originating from the scheduler. There is no theoretical reason why the
+ kernel wouldn't be able to do the right thing and maintain
+ work-conservation in most cases. As such, it is possible that future
+ scheduler improvements may make most of these tunables unnecessary.
+
+
+Examining Configuration
+=======================
+
+Use tools/workqueue/wq_dump.py to examine unbound CPU affinity
+configuration, worker pools and how workqueues map to the pools: ::
+
+ $ tools/workqueue/wq_dump.py
+ Affinity Scopes
+ ===============
+ wq_unbound_cpumask=0000000f
+
+ CPU
+ nr_pods 4
+ pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+ pod_node [0]=0 [1]=0 [2]=1 [3]=1
+ cpu_pod [0]=0 [1]=1 [2]=2 [3]=3
+
+ SMT
+ nr_pods 4
+ pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+ pod_node [0]=0 [1]=0 [2]=1 [3]=1
+ cpu_pod [0]=0 [1]=1 [2]=2 [3]=3
+
+ CACHE (default)
+ nr_pods 2
+ pod_cpus [0]=00000003 [1]=0000000c
+ pod_node [0]=0 [1]=1
+ cpu_pod [0]=0 [1]=0 [2]=1 [3]=1
+
+ NUMA
+ nr_pods 2
+ pod_cpus [0]=00000003 [1]=0000000c
+ pod_node [0]=0 [1]=1
+ cpu_pod [0]=0 [1]=0 [2]=1 [3]=1
+
+ SYSTEM
+ nr_pods 1
+ pod_cpus [0]=0000000f
+ pod_node [0]=-1
+ cpu_pod [0]=0 [1]=0 [2]=0 [3]=0
+
+ Worker Pools
+ ============
+ pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0
+ pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0
+ pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1
+ pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1
+ pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2
+ pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2
+ pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3
+ pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3
+ pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f
+ pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003
+ pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c
+ pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f
+ pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003
+ pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c
+
+ Workqueue CPU -> pool
+ =====================
+ [ workqueue \ CPU 0 1 2 3 dfl]
+ events percpu 0 2 4 6
+ events_highpri percpu 1 3 5 7
+ events_long percpu 0 2 4 6
+ events_unbound unbound 9 9 10 10 8
+ events_freezable percpu 0 2 4 6
+ events_power_efficient percpu 0 2 4 6
+ events_freezable_power_ percpu 0 2 4 6
+ rcu_gp percpu 0 2 4 6
+ rcu_par_gp percpu 0 2 4 6
+ slub_flushwq percpu 0 2 4 6
+ netns ordered 8 8 8 8 8
+ ...
+
+See the command's help message for more info.
+
+
+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+ $ tools/workqueue/wq_monitor.py events
+ total infl CPUtime CPUhog CMW/RPR mayday rescued
+ events 18545 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38306 0 0.1 - 7 - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29598 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ total infl CPUtime CPUhog CMW/RPR mayday rescued
+ events 18548 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38322 0 0.1 - 7 - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29603 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ ...
+
+See the command's help message for more info.
+
+
Debugging
=========
@@ -374,8 +719,8 @@ of possible problems:
The first one can be tracked using tracing: ::
- $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
+ $ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace_pipe > out.txt
(wait a few secs)
^C
@@ -392,7 +737,27 @@ The work item's function should be trivially visible in the stack
trace.
+Non-reentrance Conditions
+=========================
+
+Workqueue guarantees that a work item cannot be re-entrant if the following
+conditions hold after a work item gets queued:
+
+ 1. The work function hasn't been changed.
+ 2. No one queues the work item to another workqueue.
+ 3. The work item hasn't been reinitiated.
+
+In other words, if the above conditions hold, the work item is guaranteed to be
+executed by at most one worker system-wide at any given time.
+
+Note that requeuing the work item (to the same queue) in the self function
+doesn't break these conditions, so it's safe to do. Otherwise, caution is
+required when breaking the conditions inside a work function.
+
+
Kernel Inline Documentations Reference
======================================
.. kernel-doc:: include/linux/workqueue.h
+
+.. kernel-doc:: kernel/workqueue.c
diff --git a/Documentation/core-api/wrappers/atomic_bitops.rst b/Documentation/core-api/wrappers/atomic_bitops.rst
new file mode 100644
index 000000000000..bf24e4081a8f
--- /dev/null
+++ b/Documentation/core-api/wrappers/atomic_bitops.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring atomic_bitops.txt into the RST world
+ until such a time as that file can be converted directly.
+
+=============
+Atomic bitops
+=============
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../atomic_bitops.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
diff --git a/Documentation/core-api/wrappers/atomic_t.rst b/Documentation/core-api/wrappers/atomic_t.rst
new file mode 100644
index 000000000000..ed109a964c77
--- /dev/null
+++ b/Documentation/core-api/wrappers/atomic_t.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring atomic_t.txt into the RST world
+ until such a time as that file can be converted directly.
+
+============
+Atomic types
+============
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../atomic_t.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
+
diff --git a/Documentation/core-api/wrappers/memory-barriers.rst b/Documentation/core-api/wrappers/memory-barriers.rst
new file mode 100644
index 000000000000..532460b5e3eb
--- /dev/null
+++ b/Documentation/core-api/wrappers/memory-barriers.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+ This is a simple wrapper to bring memory-barriers.txt into the RST world
+ until such a time as that file can be converted directly.
+
+============================
+Linux kernel memory barriers
+============================
+
+.. raw:: latex
+
+ \footnotesize
+
+.. include:: ../../memory-barriers.txt
+ :literal:
+
+.. raw:: latex
+
+ \normalsize
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index 640934b6f7b4..77e0ece2b1d6 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -315,11 +315,15 @@ indeed the normal API is implemented in terms of the advanced API. The
advanced API is only available to modules with a GPL-compatible license.
The advanced API is based around the xa_state. This is an opaque data
-structure which you declare on the stack using the XA_STATE()
-macro. This macro initialises the xa_state ready to start walking
-around the XArray. It is used as a cursor to maintain the position
-in the XArray and let you compose various operations together without
-having to restart from the top every time.
+structure which you declare on the stack using the XA_STATE() macro.
+This macro initialises the xa_state ready to start walking around the
+XArray. It is used as a cursor to maintain the position in the XArray
+and let you compose various operations together without having to restart
+from the top every time. The contents of the xa_state are protected by
+the rcu_read_lock() or the xas_lock(). If you need to drop whichever of
+those locks is protecting your state and tree, you must call xas_pause()
+so that future calls do not rely on the parts of the state which were
+left unprotected.
The xa_state is also used to store errors. You can call
xas_error() to retrieve the error. All operations check whether
@@ -475,13 +479,15 @@ or iterations will move the index to the first index in the range.
Each entry will only be returned once, no matter how many indices it
occupies.
-Using xas_next() or xas_prev() with a multi-index xa_state
-is not supported. Using either of these functions on a multi-index entry
-will reveal sibling entries; these should be skipped over by the caller.
+Using xas_next() or xas_prev() with a multi-index xa_state is not
+supported. Using either of these functions on a multi-index entry will
+reveal sibling entries; these should be skipped over by the caller.
-Storing ``NULL`` into any index of a multi-index entry will set the entry
-at every index to ``NULL`` and dissolve the tie. Splitting a multi-index
-entry into entries occupying smaller ranges is not yet supported.
+Storing ``NULL`` into any index of a multi-index entry will set the
+entry at every index to ``NULL`` and dissolve the tie. A multi-index
+entry can be split into entries occupying smaller ranges by calling
+xas_split_alloc() without the xa_lock held, followed by taking the lock
+and calling xas_split().
Functions and structures
========================