aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/locking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/locking')
-rw-r--r--Documentation/locking/futex-requeue-pi.rst132
-rw-r--r--Documentation/locking/hwspinlock.rst485
-rw-r--r--Documentation/locking/index.rst10
-rw-r--r--Documentation/locking/lockdep-design.rst281
-rw-r--r--Documentation/locking/locktorture.rst10
-rw-r--r--Documentation/locking/locktypes.rst221
-rw-r--r--Documentation/locking/mutex-design.rst22
-rw-r--r--Documentation/locking/percpu-rw-semaphore.rst28
-rw-r--r--Documentation/locking/pi-futex.rst122
-rw-r--r--Documentation/locking/preempt-locking.rst144
-rw-r--r--Documentation/locking/robust-futex-ABI.rst184
-rw-r--r--Documentation/locking/robust-futexes.rst221
-rw-r--r--Documentation/locking/rt-mutex.rst2
-rw-r--r--Documentation/locking/seqlock.rst239
-rw-r--r--Documentation/locking/ww-mutex-design.rst6
15 files changed, 2072 insertions, 35 deletions
diff --git a/Documentation/locking/futex-requeue-pi.rst b/Documentation/locking/futex-requeue-pi.rst
new file mode 100644
index 000000000000..dd4ecf4528a4
--- /dev/null
+++ b/Documentation/locking/futex-requeue-pi.rst
@@ -0,0 +1,132 @@
+================
+Futex Requeue PI
+================
+
+Requeueing of tasks from a non-PI futex to a PI futex requires
+special handling in order to ensure the underlying rt_mutex is never
+left without an owner if it has waiters; doing so would break the PI
+boosting logic [see rt-mutex-design.rst] For the purposes of
+brevity, this action will be referred to as "requeue_pi" throughout
+this document. Priority inheritance is abbreviated throughout as
+"PI".
+
+Motivation
+----------
+
+Without requeue_pi, the glibc implementation of
+pthread_cond_broadcast() must resort to waking all the tasks waiting
+on a pthread_condvar and letting them try to sort out which task
+gets to run first in classic thundering-herd formation. An ideal
+implementation would wake the highest-priority waiter, and leave the
+rest to the natural wakeup inherent in unlocking the mutex
+associated with the condvar.
+
+Consider the simplified glibc calls::
+
+ /* caller must lock mutex */
+ pthread_cond_wait(cond, mutex)
+ {
+ lock(cond->__data.__lock);
+ unlock(mutex);
+ do {
+ unlock(cond->__data.__lock);
+ futex_wait(cond->__data.__futex);
+ lock(cond->__data.__lock);
+ } while(...)
+ unlock(cond->__data.__lock);
+ lock(mutex);
+ }
+
+ pthread_cond_broadcast(cond)
+ {
+ lock(cond->__data.__lock);
+ unlock(cond->__data.__lock);
+ futex_requeue(cond->data.__futex, cond->mutex);
+ }
+
+Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
+has waiters. Note that pthread_cond_wait() attempts to lock the
+mutex only after it has returned to user space. This will leave the
+underlying rt_mutex with waiters, and no owner, breaking the
+previously mentioned PI-boosting algorithms.
+
+In order to support PI-aware pthread_condvar's, the kernel needs to
+be able to requeue tasks to PI futexes. This support implies that
+upon a successful futex_wait system call, the caller would return to
+user space already holding the PI futex. The glibc implementation
+would be modified as follows::
+
+
+ /* caller must lock mutex */
+ pthread_cond_wait_pi(cond, mutex)
+ {
+ lock(cond->__data.__lock);
+ unlock(mutex);
+ do {
+ unlock(cond->__data.__lock);
+ futex_wait_requeue_pi(cond->__data.__futex);
+ lock(cond->__data.__lock);
+ } while(...)
+ unlock(cond->__data.__lock);
+ /* the kernel acquired the mutex for us */
+ }
+
+ pthread_cond_broadcast_pi(cond)
+ {
+ lock(cond->__data.__lock);
+ unlock(cond->__data.__lock);
+ futex_requeue_pi(cond->data.__futex, cond->mutex);
+ }
+
+The actual glibc implementation will likely test for PI and make the
+necessary changes inside the existing calls rather than creating new
+calls for the PI cases. Similar changes are needed for
+pthread_cond_timedwait() and pthread_cond_signal().
+
+Implementation
+--------------
+
+In order to ensure the rt_mutex has an owner if it has waiters, it
+is necessary for both the requeue code, as well as the waiting code,
+to be able to acquire the rt_mutex before returning to user space.
+The requeue code cannot simply wake the waiter and leave it to
+acquire the rt_mutex as it would open a race window between the
+requeue call returning to user space and the waiter waking and
+starting to run. This is especially true in the uncontended case.
+
+The solution involves two new rt_mutex helper routines,
+rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
+allow the requeue code to acquire an uncontended rt_mutex on behalf
+of the waiter and to enqueue the waiter on a contended rt_mutex.
+Two new system calls provide the kernel<->user interface to
+requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
+
+FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
+and pthread_cond_timedwait()) to block on the initial futex and wait
+to be requeued to a PI-aware futex. The implementation is the
+result of a high-speed collision between futex_wait() and
+futex_lock_pi(), with some extra logic to check for the additional
+wake-up scenarios.
+
+FUTEX_CMP_REQUEUE_PI is called by the waker
+(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
+possibly wake the waiting tasks. Internally, this system call is
+still handled by futex_requeue (by passing requeue_pi=1). Before
+requeueing, futex_requeue() attempts to acquire the requeue target
+PI futex on behalf of the top waiter. If it can, this waiter is
+woken. futex_requeue() then proceeds to requeue the remaining
+nr_wake+nr_requeue tasks to the PI futex, calling
+rt_mutex_start_proxy_lock() prior to each requeue to prepare the
+task as a waiter on the underlying rt_mutex. It is possible that
+the lock can be acquired at this stage as well, if so, the next
+waiter is woken to finish the acquisition of the lock.
+
+FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
+their sum is all that really matters. futex_requeue() will wake or
+requeue up to nr_wake + nr_requeue tasks. It will wake only as many
+tasks as it can acquire the lock for, which in the majority of cases
+should be 0 as good programming practice dictates that the caller of
+either pthread_cond_broadcast() or pthread_cond_signal() acquire the
+mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
+nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
+signal.
diff --git a/Documentation/locking/hwspinlock.rst b/Documentation/locking/hwspinlock.rst
new file mode 100644
index 000000000000..6f03713b7003
--- /dev/null
+++ b/Documentation/locking/hwspinlock.rst
@@ -0,0 +1,485 @@
+===========================
+Hardware Spinlock Framework
+===========================
+
+Introduction
+============
+
+Hardware spinlock modules provide hardware assistance for synchronization
+and mutual exclusion between heterogeneous processors and those not operating
+under a single, shared operating system.
+
+For example, OMAP4 has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP,
+each of which is running a different Operating System (the master, A9,
+is usually running Linux and the slave processors, the M3 and the DSP,
+are running some flavor of RTOS).
+
+A generic hwspinlock framework allows platform-independent drivers to use
+the hwspinlock device in order to access data structures that are shared
+between remote processors, that otherwise have no alternative mechanism
+to accomplish synchronization and mutual exclusion operations.
+
+This is necessary, for example, for Inter-processor communications:
+on OMAP4, cpu-intensive multimedia tasks are offloaded by the host to the
+remote M3 and/or C64x+ slave processors (by an IPC subsystem called Syslink).
+
+To achieve fast message-based communications, a minimal kernel support
+is needed to deliver messages arriving from a remote processor to the
+appropriate user process.
+
+This communication is based on simple data structures that is shared between
+the remote processors, and access to it is synchronized using the hwspinlock
+module (remote processor directly places new messages in this shared data
+structure).
+
+A common hwspinlock interface makes it possible to have generic, platform-
+independent, drivers.
+
+User API
+========
+
+::
+
+ struct hwspinlock *hwspin_lock_request(void);
+
+Dynamically assign an hwspinlock and return its address, or NULL
+in case an unused hwspinlock isn't available. Users of this
+API will usually want to communicate the lock's id to the remote core
+before it can be used to achieve synchronization.
+
+Should be called from a process context (might sleep).
+
+::
+
+ struct hwspinlock *hwspin_lock_request_specific(unsigned int id);
+
+Assign a specific hwspinlock id and return its address, or NULL
+if that hwspinlock is already in use. Usually board code will
+be calling this function in order to reserve specific hwspinlock
+ids for predefined purposes.
+
+Should be called from a process context (might sleep).
+
+::
+
+ int of_hwspin_lock_get_id(struct device_node *np, int index);
+
+Retrieve the global lock id for an OF phandle-based specific lock.
+This function provides a means for DT users of a hwspinlock module
+to get the global lock id of a specific hwspinlock, so that it can
+be requested using the normal hwspin_lock_request_specific() API.
+
+The function returns a lock id number on success, -EPROBE_DEFER if
+the hwspinlock device is not yet registered with the core, or other
+error values.
+
+Should be called from a process context (might sleep).
+
+::
+
+ int hwspin_lock_free(struct hwspinlock *hwlock);
+
+Free a previously-assigned hwspinlock; returns 0 on success, or an
+appropriate error code on failure (e.g. -EINVAL if the hwspinlock
+is already free).
+
+Should be called from a process context (might sleep).
+
+::
+
+ int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout);
+
+Lock a previously-assigned hwspinlock with a timeout limit (specified in
+msecs). If the hwspinlock is already taken, the function will busy loop
+waiting for it to be released, but give up when the timeout elapses.
+Upon a successful return from this function, preemption is disabled so
+the caller must not sleep, and is advised to release the hwspinlock as
+soon as possible, in order to minimize remote cores polling on the
+hardware interconnect.
+
+Returns 0 when successful and an appropriate error code otherwise (most
+notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
+The function will never sleep.
+
+::
+
+ int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout);
+
+Lock a previously-assigned hwspinlock with a timeout limit (specified in
+msecs). If the hwspinlock is already taken, the function will busy loop
+waiting for it to be released, but give up when the timeout elapses.
+Upon a successful return from this function, preemption and the local
+interrupts are disabled, so the caller must not sleep, and is advised to
+release the hwspinlock as soon as possible.
+
+Returns 0 when successful and an appropriate error code otherwise (most
+notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
+The function will never sleep.
+
+::
+
+ int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to,
+ unsigned long *flags);
+
+Lock a previously-assigned hwspinlock with a timeout limit (specified in
+msecs). If the hwspinlock is already taken, the function will busy loop
+waiting for it to be released, but give up when the timeout elapses.
+Upon a successful return from this function, preemption is disabled,
+local interrupts are disabled and their previous state is saved at the
+given flags placeholder. The caller must not sleep, and is advised to
+release the hwspinlock as soon as possible.
+
+Returns 0 when successful and an appropriate error code otherwise (most
+notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
+
+The function will never sleep.
+
+::
+
+ int hwspin_lock_timeout_raw(struct hwspinlock *hwlock, unsigned int timeout);
+
+Lock a previously-assigned hwspinlock with a timeout limit (specified in
+msecs). If the hwspinlock is already taken, the function will busy loop
+waiting for it to be released, but give up when the timeout elapses.
+
+Caution: User must protect the routine of getting hardware lock with mutex
+or spinlock to avoid dead-lock, that will let user can do some time-consuming
+or sleepable operations under the hardware lock.
+
+Returns 0 when successful and an appropriate error code otherwise (most
+notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
+
+The function will never sleep.
+
+::
+
+ int hwspin_lock_timeout_in_atomic(struct hwspinlock *hwlock, unsigned int to);
+
+Lock a previously-assigned hwspinlock with a timeout limit (specified in
+msecs). If the hwspinlock is already taken, the function will busy loop
+waiting for it to be released, but give up when the timeout elapses.
+
+This function shall be called only from an atomic context and the timeout
+value shall not exceed a few msecs.
+
+Returns 0 when successful and an appropriate error code otherwise (most
+notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
+
+The function will never sleep.
+
+::
+
+ int hwspin_trylock(struct hwspinlock *hwlock);
+
+
+Attempt to lock a previously-assigned hwspinlock, but immediately fail if
+it is already taken.
+
+Upon a successful return from this function, preemption is disabled so
+caller must not sleep, and is advised to release the hwspinlock as soon as
+possible, in order to minimize remote cores polling on the hardware
+interconnect.
+
+Returns 0 on success and an appropriate error code otherwise (most
+notably -EBUSY if the hwspinlock was already taken).
+The function will never sleep.
+
+::
+
+ int hwspin_trylock_irq(struct hwspinlock *hwlock);
+
+
+Attempt to lock a previously-assigned hwspinlock, but immediately fail if
+it is already taken.
+
+Upon a successful return from this function, preemption and the local
+interrupts are disabled so caller must not sleep, and is advised to
+release the hwspinlock as soon as possible.
+
+Returns 0 on success and an appropriate error code otherwise (most
+notably -EBUSY if the hwspinlock was already taken).
+
+The function will never sleep.
+
+::
+
+ int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags);
+
+Attempt to lock a previously-assigned hwspinlock, but immediately fail if
+it is already taken.
+
+Upon a successful return from this function, preemption is disabled,
+the local interrupts are disabled and their previous state is saved
+at the given flags placeholder. The caller must not sleep, and is advised
+to release the hwspinlock as soon as possible.
+
+Returns 0 on success and an appropriate error code otherwise (most
+notably -EBUSY if the hwspinlock was already taken).
+The function will never sleep.
+
+::
+
+ int hwspin_trylock_raw(struct hwspinlock *hwlock);
+
+Attempt to lock a previously-assigned hwspinlock, but immediately fail if
+it is already taken.
+
+Caution: User must protect the routine of getting hardware lock with mutex
+or spinlock to avoid dead-lock, that will let user can do some time-consuming
+or sleepable operations under the hardware lock.
+
+Returns 0 on success and an appropriate error code otherwise (most
+notably -EBUSY if the hwspinlock was already taken).
+The function will never sleep.
+
+::
+
+ int hwspin_trylock_in_atomic(struct hwspinlock *hwlock);
+
+Attempt to lock a previously-assigned hwspinlock, but immediately fail if
+it is already taken.
+
+This function shall be called only from an atomic context.
+
+Returns 0 on success and an appropriate error code otherwise (most
+notably -EBUSY if the hwspinlock was already taken).
+The function will never sleep.
+
+::
+
+ void hwspin_unlock(struct hwspinlock *hwlock);
+
+Unlock a previously-locked hwspinlock. Always succeed, and can be called
+from any context (the function never sleeps).
+
+.. note::
+
+ code should **never** unlock an hwspinlock which is already unlocked
+ (there is no protection against this).
+
+::
+
+ void hwspin_unlock_irq(struct hwspinlock *hwlock);
+
+Unlock a previously-locked hwspinlock and enable local interrupts.
+The caller should **never** unlock an hwspinlock which is already unlocked.
+
+Doing so is considered a bug (there is no protection against this).
+Upon a successful return from this function, preemption and local
+interrupts are enabled. This function will never sleep.
+
+::
+
+ void
+ hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags);
+
+Unlock a previously-locked hwspinlock.
+
+The caller should **never** unlock an hwspinlock which is already unlocked.
+Doing so is considered a bug (there is no protection against this).
+Upon a successful return from this function, preemption is reenabled,
+and the state of the local interrupts is restored to the state saved at
+the given flags. This function will never sleep.
+
+::
+
+ void hwspin_unlock_raw(struct hwspinlock *hwlock);
+
+Unlock a previously-locked hwspinlock.
+
+The caller should **never** unlock an hwspinlock which is already unlocked.
+Doing so is considered a bug (there is no protection against this).
+This function will never sleep.
+
+::
+
+ void hwspin_unlock_in_atomic(struct hwspinlock *hwlock);
+
+Unlock a previously-locked hwspinlock.
+
+The caller should **never** unlock an hwspinlock which is already unlocked.
+Doing so is considered a bug (there is no protection against this).
+This function will never sleep.
+
+::
+
+ int hwspin_lock_get_id(struct hwspinlock *hwlock);
+
+Retrieve id number of a given hwspinlock. This is needed when an
+hwspinlock is dynamically assigned: before it can be used to achieve
+mutual exclusion with a remote cpu, the id number should be communicated
+to the remote task with which we want to synchronize.
+
+Returns the hwspinlock id number, or -EINVAL if hwlock is null.
+
+Typical usage
+=============
+
+::
+
+ #include <linux/hwspinlock.h>
+ #include <linux/err.h>
+
+ int hwspinlock_example1(void)
+ {
+ struct hwspinlock *hwlock;
+ int ret;
+
+ /* dynamically assign a hwspinlock */
+ hwlock = hwspin_lock_request();
+ if (!hwlock)
+ ...
+
+ id = hwspin_lock_get_id(hwlock);
+ /* probably need to communicate id to a remote processor now */
+
+ /* take the lock, spin for 1 sec if it's already taken */
+ ret = hwspin_lock_timeout(hwlock, 1000);
+ if (ret)
+ ...
+
+ /*
+ * we took the lock, do our thing now, but do NOT sleep
+ */
+
+ /* release the lock */
+ hwspin_unlock(hwlock);
+
+ /* free the lock */
+ ret = hwspin_lock_free(hwlock);
+ if (ret)
+ ...
+
+ return ret;
+ }
+
+ int hwspinlock_example2(void)
+ {
+ struct hwspinlock *hwlock;
+ int ret;
+
+ /*
+ * assign a specific hwspinlock id - this should be called early
+ * by board init code.
+ */
+ hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID);
+ if (!hwlock)
+ ...
+
+ /* try to take it, but don't spin on it */
+ ret = hwspin_trylock(hwlock);
+ if (!ret) {
+ pr_info("lock is already taken\n");
+ return -EBUSY;
+ }
+
+ /*
+ * we took the lock, do our thing now, but do NOT sleep
+ */
+
+ /* release the lock */
+ hwspin_unlock(hwlock);
+
+ /* free the lock */
+ ret = hwspin_lock_free(hwlock);
+ if (ret)
+ ...
+
+ return ret;
+ }
+
+
+API for implementors
+====================
+
+::
+
+ int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev,
+ const struct hwspinlock_ops *ops, int base_id, int num_locks);
+
+To be called from the underlying platform-specific implementation, in
+order to register a new hwspinlock device (which is usually a bank of
+numerous locks). Should be called from a process context (this function
+might sleep).
+
+Returns 0 on success, or appropriate error code on failure.
+
+::
+
+ int hwspin_lock_unregister(struct hwspinlock_device *bank);
+
+To be called from the underlying vendor-specific implementation, in order
+to unregister an hwspinlock device (which is usually a bank of numerous
+locks).
+
+Should be called from a process context (this function might sleep).
+
+Returns the address of hwspinlock on success, or NULL on error (e.g.
+if the hwspinlock is still in use).
+
+Important structs
+=================
+
+struct hwspinlock_device is a device which usually contains a bank
+of hardware locks. It is registered by the underlying hwspinlock
+implementation using the hwspin_lock_register() API.
+
+::
+
+ /**
+ * struct hwspinlock_device - a device which usually spans numerous hwspinlocks
+ * @dev: underlying device, will be used to invoke runtime PM api
+ * @ops: platform-specific hwspinlock handlers
+ * @base_id: id index of the first lock in this device
+ * @num_locks: number of locks in this device
+ * @lock: dynamically allocated array of 'struct hwspinlock'
+ */
+ struct hwspinlock_device {
+ struct device *dev;
+ const struct hwspinlock_ops *ops;
+ int base_id;
+ int num_locks;
+ struct hwspinlock lock[0];
+ };
+
+struct hwspinlock_device contains an array of hwspinlock structs, each
+of which represents a single hardware lock::
+
+ /**
+ * struct hwspinlock - this struct represents a single hwspinlock instance
+ * @bank: the hwspinlock_device structure which owns this lock
+ * @lock: initialized and used by hwspinlock core
+ * @priv: private data, owned by the underlying platform-specific hwspinlock drv
+ */
+ struct hwspinlock {
+ struct hwspinlock_device *bank;
+ spinlock_t lock;
+ void *priv;
+ };
+
+When registering a bank of locks, the hwspinlock driver only needs to
+set the priv members of the locks. The rest of the members are set and
+initialized by the hwspinlock core itself.
+
+Implementation callbacks
+========================
+
+There are three possible callbacks defined in 'struct hwspinlock_ops'::
+
+ struct hwspinlock_ops {
+ int (*trylock)(struct hwspinlock *lock);
+ void (*unlock)(struct hwspinlock *lock);
+ void (*relax)(struct hwspinlock *lock);
+ };
+
+The first two callbacks are mandatory:
+
+The ->trylock() callback should make a single attempt to take the lock, and
+return 0 on failure and 1 on success. This callback may **not** sleep.
+
+The ->unlock() callback releases the lock. It always succeed, and it, too,
+may **not** sleep.
+
+The ->relax() callback is optional. It is called by hwspinlock core while
+spinning on a lock, and can be used by the underlying implementation to force
+a delay between two successive invocations of ->trylock(). It may **not** sleep.
diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst
index 5d6800a723dc..6a9ea96c8bcb 100644
--- a/Documentation/locking/index.rst
+++ b/Documentation/locking/index.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
=======
-locking
+Locking
=======
.. toctree::
@@ -14,8 +14,16 @@ locking
mutex-design
rt-mutex-design
rt-mutex
+ seqlock
spinlocks
ww-mutex-design
+ preempt-locking
+ pi-futex
+ futex-requeue-pi
+ hwspinlock
+ percpu-rw-semaphore
+ robust-futexes
+ robust-futex-ABI
.. only:: subproject and html
diff --git a/Documentation/locking/lockdep-design.rst b/Documentation/locking/lockdep-design.rst
index 23fcbc4d3fc0..56b90eea2731 100644
--- a/Documentation/locking/lockdep-design.rst
+++ b/Documentation/locking/lockdep-design.rst
@@ -29,7 +29,7 @@ the validator will shoot a splat if incorrect.
A lock-class's behavior is constructed by its instances collectively:
when the first instance of a lock-class is used after bootup the class
gets registered, then all (subsequent) instances will be mapped to the
-class and hence their usages and dependecies will contribute to those of
+class and hence their usages and dependencies will contribute to those of
the class. A lock-class does not go away when a lock instance does, but
it can be removed if the memory space of the lock class (static or
dynamic) is reclaimed, this happens for example when a module is
@@ -42,6 +42,7 @@ The validator tracks lock-class usage history and divides the usage into
(4 usages * n STATEs + 1) categories:
where the 4 usages can be:
+
- 'ever held in STATE context'
- 'ever held as readlock in STATE context'
- 'ever held with STATE enabled'
@@ -49,10 +50,12 @@ where the 4 usages can be:
where the n STATEs are coded in kernel/locking/lockdep_states.h and as of
now they include:
+
- hardirq
- softirq
where the last 1 category is:
+
- 'ever used' [ == !unused ]
When locking rules are violated, these usage bits are presented in the
@@ -96,13 +99,13 @@ exact case is for the lock as of the reporting time.
+--------------+-------------+--------------+
| | irq enabled | irq disabled |
+--------------+-------------+--------------+
- | ever in irq | ? | - |
+ | ever in irq | '?' | '-' |
+--------------+-------------+--------------+
- | never in irq | + | . |
+ | never in irq | '+' | '.' |
+--------------+-------------+--------------+
The character '-' suggests irq is disabled because if otherwise the
-charactor '?' would have been shown instead. Similar deduction can be
+character '?' would have been shown instead. Similar deduction can be
applied for '+' too.
Unused locks (e.g., mutexes) cannot be part of the cause of an error.
@@ -216,7 +219,7 @@ looks like this::
BD_MUTEX_PARTITION
};
-mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION);
+ mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION);
In this case the locking is done on a bdev object that is known to be a
partition.
@@ -334,7 +337,7 @@ Troubleshooting:
----------------
The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes.
-Exceeding this number will trigger the following lockdep warning:
+Exceeding this number will trigger the following lockdep warning::
(DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS))
@@ -392,3 +395,269 @@ Run the command and save the output, then compare against the output from
a later run of this command to identify the leakers. This same output
can also help you find situations where runtime lock initialization has
been omitted.
+
+Recursive read locks:
+---------------------
+The whole of the rest document tries to prove a certain type of cycle is equivalent
+to deadlock possibility.
+
+There are three types of lockers: writers (i.e. exclusive lockers, like
+spin_lock() or write_lock()), non-recursive readers (i.e. shared lockers, like
+down_read()) and recursive readers (recursive shared lockers, like rcu_read_lock()).
+And we use the following notations of those lockers in the rest of the document:
+
+ W or E: stands for writers (exclusive lockers).
+ r: stands for non-recursive readers.
+ R: stands for recursive readers.
+ S: stands for all readers (non-recursive + recursive), as both are shared lockers.
+ N: stands for writers and non-recursive readers, as both are not recursive.
+
+Obviously, N is "r or W" and S is "r or R".
+
+Recursive readers, as their name indicates, are the lockers allowed to acquire
+even inside the critical section of another reader of the same lock instance,
+in other words, allowing nested read-side critical sections of one lock instance.
+
+While non-recursive readers will cause a self deadlock if trying to acquire inside
+the critical section of another reader of the same lock instance.
+
+The difference between recursive readers and non-recursive readers is because:
+recursive readers get blocked only by a write lock *holder*, while non-recursive
+readers could get blocked by a write lock *waiter*. Considering the follow
+example::
+
+ TASK A: TASK B:
+
+ read_lock(X);
+ write_lock(X);
+ read_lock_2(X);
+
+Task A gets the reader (no matter whether recursive or non-recursive) on X via
+read_lock() first. And when task B tries to acquire writer on X, it will block
+and become a waiter for writer on X. Now if read_lock_2() is recursive readers,
+task A will make progress, because writer waiters don't block recursive readers,
+and there is no deadlock. However, if read_lock_2() is non-recursive readers,
+it will get blocked by writer waiter B, and cause a self deadlock.
+
+Block conditions on readers/writers of the same lock instance:
+--------------------------------------------------------------
+There are simply four block conditions:
+
+1. Writers block other writers.
+2. Readers block writers.
+3. Writers block both recursive readers and non-recursive readers.
+4. And readers (recursive or not) don't block other recursive readers but
+ may block non-recursive readers (because of the potential co-existing
+ writer waiters)
+
+Block condition matrix, Y means the row blocks the column, and N means otherwise.
+
+ +---+---+---+---+
+ | | W | r | R |
+ +---+---+---+---+
+ | W | Y | Y | Y |
+ +---+---+---+---+
+ | r | Y | Y | N |
+ +---+---+---+---+
+ | R | Y | Y | N |
+ +---+---+---+---+
+
+ (W: writers, r: non-recursive readers, R: recursive readers)
+
+
+acquired recursively. Unlike non-recursive read locks, recursive read locks
+only get blocked by current write lock *holders* other than write lock
+*waiters*, for example::
+
+ TASK A: TASK B:
+
+ read_lock(X);
+
+ write_lock(X);
+
+ read_lock(X);
+
+is not a deadlock for recursive read locks, as while the task B is waiting for
+the lock X, the second read_lock() doesn't need to wait because it's a recursive
+read lock. However if the read_lock() is non-recursive read lock, then the above
+case is a deadlock, because even if the write_lock() in TASK B cannot get the
+lock, but it can block the second read_lock() in TASK A.
+
+Note that a lock can be a write lock (exclusive lock), a non-recursive read
+lock (non-recursive shared lock) or a recursive read lock (recursive shared
+lock), depending on the lock operations used to acquire it (more specifically,
+the value of the 'read' parameter for lock_acquire()). In other words, a single
+lock instance has three types of acquisition depending on the acquisition
+functions: exclusive, non-recursive read, and recursive read.
+
+To be concise, we call that write locks and non-recursive read locks as
+"non-recursive" locks and recursive read locks as "recursive" locks.
+
+Recursive locks don't block each other, while non-recursive locks do (this is
+even true for two non-recursive read locks). A non-recursive lock can block the
+corresponding recursive lock, and vice versa.
+
+A deadlock case with recursive locks involved is as follow::
+
+ TASK A: TASK B:
+
+ read_lock(X);
+ read_lock(Y);
+ write_lock(Y);
+ write_lock(X);
+
+Task A is waiting for task B to read_unlock() Y and task B is waiting for task
+A to read_unlock() X.
+
+Dependency types and strong dependency paths:
+---------------------------------------------
+Lock dependencies record the orders of the acquisitions of a pair of locks, and
+because there are 3 types for lockers, there are, in theory, 9 types of lock
+dependencies, but we can show that 4 types of lock dependencies are enough for
+deadlock detection.
+
+For each lock dependency::
+
+ L1 -> L2
+
+, which means lockdep has seen L1 held before L2 held in the same context at runtime.
+And in deadlock detection, we care whether we could get blocked on L2 with L1 held,
+IOW, whether there is a locker L3 that L1 blocks L3 and L2 gets blocked by L3. So
+we only care about 1) what L1 blocks and 2) what blocks L2. As a result, we can combine
+recursive readers and non-recursive readers for L1 (as they block the same types) and
+we can combine writers and non-recursive readers for L2 (as they get blocked by the
+same types).
+
+With the above combination for simplification, there are 4 types of dependency edges
+in the lockdep graph:
+
+1) -(ER)->:
+ exclusive writer to recursive reader dependency, "X -(ER)-> Y" means
+ X -> Y and X is a writer and Y is a recursive reader.
+
+2) -(EN)->:
+ exclusive writer to non-recursive locker dependency, "X -(EN)-> Y" means
+ X -> Y and X is a writer and Y is either a writer or non-recursive reader.
+
+3) -(SR)->:
+ shared reader to recursive reader dependency, "X -(SR)-> Y" means
+ X -> Y and X is a reader (recursive or not) and Y is a recursive reader.
+
+4) -(SN)->:
+ shared reader to non-recursive locker dependency, "X -(SN)-> Y" means
+ X -> Y and X is a reader (recursive or not) and Y is either a writer or
+ non-recursive reader.
+
+Note that given two locks, they may have multiple dependencies between them,
+for example::
+
+ TASK A:
+
+ read_lock(X);
+ write_lock(Y);
+ ...
+
+ TASK B:
+
+ write_lock(X);
+ write_lock(Y);
+
+, we have both X -(SN)-> Y and X -(EN)-> Y in the dependency graph.
+
+We use -(xN)-> to represent edges that are either -(EN)-> or -(SN)->, the
+similar for -(Ex)->, -(xR)-> and -(Sx)->
+
+A "path" is a series of conjunct dependency edges in the graph. And we define a
+"strong" path, which indicates the strong dependency throughout each dependency
+in the path, as the path that doesn't have two conjunct edges (dependencies) as
+-(xR)-> and -(Sx)->. In other words, a "strong" path is a path from a lock
+walking to another through the lock dependencies, and if X -> Y -> Z is in the
+path (where X, Y, Z are locks), and the walk from X to Y is through a -(SR)-> or
+-(ER)-> dependency, the walk from Y to Z must not be through a -(SN)-> or
+-(SR)-> dependency.
+
+We will see why the path is called "strong" in next section.
+
+Recursive Read Deadlock Detection:
+----------------------------------
+
+We now prove two things:
+
+Lemma 1:
+
+If there is a closed strong path (i.e. a strong circle), then there is a
+combination of locking sequences that causes deadlock. I.e. a strong circle is
+sufficient for deadlock detection.
+
+Lemma 2:
+
+If there is no closed strong path (i.e. strong circle), then there is no
+combination of locking sequences that could cause deadlock. I.e. strong
+circles are necessary for deadlock detection.
+
+With these two Lemmas, we can easily say a closed strong path is both sufficient
+and necessary for deadlocks, therefore a closed strong path is equivalent to
+deadlock possibility. As a closed strong path stands for a dependency chain that
+could cause deadlocks, so we call it "strong", considering there are dependency
+circles that won't cause deadlocks.
+
+Proof for sufficiency (Lemma 1):
+
+Let's say we have a strong circle::
+
+ L1 -> L2 ... -> Ln -> L1
+
+, which means we have dependencies::
+
+ L1 -> L2
+ L2 -> L3
+ ...
+ Ln-1 -> Ln
+ Ln -> L1
+
+We now can construct a combination of locking sequences that cause deadlock:
+
+Firstly let's make one CPU/task get the L1 in L1 -> L2, and then another get
+the L2 in L2 -> L3, and so on. After this, all of the Lx in Lx -> Lx+1 are
+held by different CPU/tasks.
+
+And then because we have L1 -> L2, so the holder of L1 is going to acquire L2
+in L1 -> L2, however since L2 is already held by another CPU/task, plus L1 ->
+L2 and L2 -> L3 are not -(xR)-> and -(Sx)-> (the definition of strong), which
+means either L2 in L1 -> L2 is a non-recursive locker (blocked by anyone) or
+the L2 in L2 -> L3, is writer (blocking anyone), therefore the holder of L1
+cannot get L2, it has to wait L2's holder to release.
+
+Moreover, we can have a similar conclusion for L2's holder: it has to wait L3's
+holder to release, and so on. We now can prove that Lx's holder has to wait for
+Lx+1's holder to release, and note that Ln+1 is L1, so we have a circular
+waiting scenario and nobody can get progress, therefore a deadlock.
+
+Proof for necessary (Lemma 2):
+
+Lemma 2 is equivalent to: If there is a deadlock scenario, then there must be a
+strong circle in the dependency graph.
+
+According to Wikipedia[1], if there is a deadlock, then there must be a circular
+waiting scenario, means there are N CPU/tasks, where CPU/task P1 is waiting for
+a lock held by P2, and P2 is waiting for a lock held by P3, ... and Pn is waiting
+for a lock held by P1. Let's name the lock Px is waiting as Lx, so since P1 is waiting
+for L1 and holding Ln, so we will have Ln -> L1 in the dependency graph. Similarly,
+we have L1 -> L2, L2 -> L3, ..., Ln-1 -> Ln in the dependency graph, which means we
+have a circle::
+
+ Ln -> L1 -> L2 -> ... -> Ln
+
+, and now let's prove the circle is strong:
+
+For a lock Lx, Px contributes the dependency Lx-1 -> Lx and Px+1 contributes
+the dependency Lx -> Lx+1, and since Px is waiting for Px+1 to release Lx,
+so it's impossible that Lx on Px+1 is a reader and Lx on Px is a recursive
+reader, because readers (no matter recursive or not) don't block recursive
+readers, therefore Lx-1 -> Lx and Lx -> Lx+1 cannot be a -(xR)-> -(Sx)-> pair,
+and this is true for any lock in the circle, therefore, the circle is strong.
+
+References:
+-----------
+[1]: https://en.wikipedia.org/wiki/Deadlock
+[2]: Shibu, K. (2009). Intro To Embedded Systems (1st ed.). Tata McGraw-Hill
diff --git a/Documentation/locking/locktorture.rst b/Documentation/locking/locktorture.rst
index 5bcb99ba7bd9..3e763f77a620 100644
--- a/Documentation/locking/locktorture.rst
+++ b/Documentation/locking/locktorture.rst
@@ -5,7 +5,7 @@ Kernel Lock Torture Test Operation
CONFIG_LOCK_TORTURE_TEST
========================
-The CONFIG LOCK_TORTURE_TEST config option provides a kernel module
+The CONFIG_LOCK_TORTURE_TEST config option provides a kernel module
that runs torture tests on core kernel locking primitives. The kernel
module, 'locktorture', may be built after the fact on the running
kernel to be tested, if desired. The tests periodically output status
@@ -67,7 +67,7 @@ torture_type
- "rtmutex_lock":
rtmutex_lock() and rtmutex_unlock() pairs.
- Kernel must have CONFIG_RT_MUTEX=y.
+ Kernel must have CONFIG_RT_MUTEXES=y.
- "rwsem_lock":
read/write down() and up() semaphore pairs.
@@ -110,10 +110,10 @@ stutter
same period of time. Defaults to "stutter=5", so as
to run and pause for (roughly) five-second intervals.
Specifying "stutter=0" causes the test to run continuously
- without pausing, which is the old default behavior.
+ without pausing.
shuffle_interval
- The number of seconds to keep the test threads affinitied
+ The number of seconds to keep the test threads affinitized
to a particular subset of the CPUs, defaults to 3 seconds.
Used in conjunction with test_no_idle_hz.
@@ -166,4 +166,4 @@ checked for such errors. The "rmmod" command forces a "SUCCESS",
two are self-explanatory, while the last indicates that while there
were no locking failures, CPU-hotplug problems were detected.
-Also see: Documentation/RCU/torture.txt
+Also see: Documentation/RCU/torture.rst
diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst
index 09f45ce38d26..80c914f6eae7 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -10,9 +10,10 @@ Introduction
============
The kernel provides a variety of locking primitives which can be divided
-into two categories:
+into three categories:
- Sleeping locks
+ - CPU local locks
- Spinning locks
This document conceptually describes these lock types and provides rules
@@ -44,9 +45,23 @@ Sleeping lock types:
On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
+ - local_lock
- spinlock_t
- rwlock_t
+
+CPU local locks
+---------------
+
+ - local_lock
+
+On non-PREEMPT_RT kernels, local_lock functions are wrappers around
+preemption and interrupt disabling primitives. Contrary to other locking
+mechanisms, disabling preemption or interrupts are pure CPU local
+concurrency control mechanisms and not suited for inter-CPU concurrency
+control.
+
+
Spinning locks
--------------
@@ -67,6 +82,7 @@ can have suffixes which apply further protections:
_irqsave/restore() Save and disable / restore interrupt disabled state
=================== ====================================================
+
Owner semantics
===============
@@ -139,15 +155,62 @@ implementation, thus changing the fairness:
writer from starving readers.
+local_lock
+==========
+
+local_lock provides a named scope to critical sections which are protected
+by disabling preemption or interrupts.
+
+On non-PREEMPT_RT kernels local_lock operations map to the preemption and
+interrupt disabling and enabling primitives:
+
+ =============================== ======================
+ local_lock(&llock) preempt_disable()
+ local_unlock(&llock) preempt_enable()
+ local_lock_irq(&llock) local_irq_disable()
+ local_unlock_irq(&llock) local_irq_enable()
+ local_lock_irqsave(&llock) local_irq_save()
+ local_unlock_irqrestore(&llock) local_irq_restore()
+ =============================== ======================
+
+The named scope of local_lock has two advantages over the regular
+primitives:
+
+ - The lock name allows static analysis and is also a clear documentation
+ of the protection scope while the regular primitives are scopeless and
+ opaque.
+
+ - If lockdep is enabled the local_lock gains a lockmap which allows to
+ validate the correctness of the protection. This can detect cases where
+ e.g. a function using preempt_disable() as protection mechanism is
+ invoked from interrupt or soft-interrupt context. Aside of that
+ lockdep_assert_held(&llock) works as with any other locking primitive.
+
+local_lock and PREEMPT_RT
+-------------------------
+
+PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
+semantics:
+
+ - All spinlock_t changes also apply to local_lock.
+
+local_lock usage
+----------------
+
+local_lock should be used in situations where disabling preemption or
+interrupts is the appropriate form of concurrency control to protect
+per-CPU data structures on a non PREEMPT_RT kernel.
+
+local_lock is not suitable to protect against preemption or interrupts on a
+PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
+
+
raw_spinlock_t and spinlock_t
=============================
raw_spinlock_t
--------------
-raw_spinlock_t is a strict spinning lock implementation regardless of the
-kernel configuration including PREEMPT_RT enabled kernels.
-
raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
core code, low-level interrupt handling and places where disabling
@@ -181,7 +244,7 @@ based on rt_mutex which changes the semantics:
Non-PREEMPT_RT kernels disable preemption to get this effect.
PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
- preemption disabled. The lock disables softirq handlers and also
+ preemption enabled. The lock disables softirq handlers and also
prevents reentrancy due to task preemption.
PREEMPT_RT kernels preserve all other spinlock_t semantics:
@@ -258,10 +321,82 @@ implementation, thus changing semantics:
PREEMPT_RT caveats
==================
+local_lock on RT
+----------------
+
+The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
+implications. For example, on a non-PREEMPT_RT kernel the following code
+sequence works as expected::
+
+ local_lock_irq(&local_lock);
+ raw_spin_lock(&lock);
+
+and is fully equivalent to::
+
+ raw_spin_lock_irq(&lock);
+
+On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
+is mapped to a per-CPU spinlock_t which neither disables interrupts nor
+preemption. The following code sequence works perfectly correct on both
+PREEMPT_RT and non-PREEMPT_RT kernels::
+
+ local_lock_irq(&local_lock);
+ spin_lock(&lock);
+
+Another caveat with local locks is that each local_lock has a specific
+protection scope. So the following substitution is wrong::
+
+ func1()
+ {
+ local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags);
+ func3();
+ local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
+ }
+
+ func2()
+ {
+ local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags);
+ func3();
+ local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
+ }
+
+ func3()
+ {
+ lockdep_assert_irqs_disabled();
+ access_protected_data();
+ }
+
+On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
+local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
+of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
+because local_lock_irqsave() does not disable interrupts due to the
+PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
+
+ func1()
+ {
+ local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
+ func3();
+ local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
+ }
+
+ func2()
+ {
+ local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
+ func3();
+ local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
+ }
+
+ func3()
+ {
+ lockdep_assert_held(&local_lock);
+ access_protected_data();
+ }
+
+
spinlock_t and rwlock_t
-----------------------
-These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
+The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
have a few implications. For example, on a non-PREEMPT_RT kernel the
following code sequence works as expected::
@@ -282,9 +417,58 @@ local_lock mechanism. Acquiring the local_lock pins the task to a CPU,
allowing things like per-CPU interrupt disabled locks to be acquired.
However, this approach should be used only where absolutely necessary.
+A typical scenario is protection of per-CPU variables in thread context::
-raw_spinlock_t
---------------
+ struct foo *p = get_cpu_ptr(&var1);
+
+ spin_lock(&p->lock);
+ p->count += this_cpu_read(var2);
+
+This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
+this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
+not allow to acquire p->lock because get_cpu_ptr() implicitly disables
+preemption. The following substitution works on both kernels::
+
+ struct foo *p;
+
+ migrate_disable();
+ p = this_cpu_ptr(&var1);
+ spin_lock(&p->lock);
+ p->count += this_cpu_read(var2);
+
+migrate_disable() ensures that the task is pinned on the current CPU which
+in turn guarantees that the per-CPU access to var1 and var2 are staying on
+the same CPU while the task remains preemptible.
+
+The migrate_disable() substitution is not valid for the following
+scenario::
+
+ func()
+ {
+ struct foo *p;
+
+ migrate_disable();
+ p = this_cpu_ptr(&var1);
+ p->val = func2();
+
+This breaks because migrate_disable() does not protect against reentrancy from
+a preempting task. A correct substitution for this case is::
+
+ func()
+ {
+ struct foo *p;
+
+ local_lock(&foo_lock);
+ p = this_cpu_ptr(&var1);
+ p->val = func2();
+
+On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
+preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
+underlying per-CPU spinlock.
+
+
+raw_spinlock_t on RT
+--------------------
Acquiring a raw_spinlock_t disables preemption and possibly also
interrupts, so the critical section must avoid acquiring a regular
@@ -316,7 +500,7 @@ caveats also apply to bit spinlocks.
Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
using conditional (#ifdef'ed) code changes at the usage site. In contrast,
usage-site changes are not needed for the spinlock_t substitution.
-Instead, conditionals in header files and the core locking implemementation
+Instead, conditionals in header files and the core locking implementation
enable the compiler to do the substitution transparently.
@@ -325,22 +509,25 @@ Lock type nesting rules
The most basic rules are:
- - Lock types of the same lock category (sleeping, spinning) can nest
- arbitrarily as long as they respect the general lock ordering rules to
- prevent deadlocks.
+ - Lock types of the same lock category (sleeping, CPU local, spinning)
+ can nest arbitrarily as long as they respect the general lock ordering
+ rules to prevent deadlocks.
+
+ - Sleeping lock types cannot nest inside CPU local and spinning lock types.
- - Sleeping lock types cannot nest inside spinning lock types.
+ - CPU local and spinning lock types can nest inside sleeping lock types.
- - Spinning lock types can nest inside sleeping lock types.
+ - Spinning lock types can nest inside all lock types
These constraints apply both in PREEMPT_RT and otherwise.
The fact that PREEMPT_RT changes the lock category of spinlock_t and
-rwlock_t from spinning to sleeping means that they cannot be acquired while
-holding a raw spinlock. This results in the following nesting ordering:
+rwlock_t from spinning to sleeping and substitutes local_lock with a
+per-CPU spinlock_t means that they cannot be acquired while holding a raw
+spinlock. This results in the following nesting ordering:
1) Sleeping locks
- 2) spinlock_t and rwlock_t
+ 2) spinlock_t, rwlock_t, local_lock
3) raw_spinlock_t and bit spinlocks
Lockdep will complain if these constraints are violated, both in
diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
index 4d8236b81fa5..7c30b4aa5e28 100644
--- a/Documentation/locking/mutex-design.rst
+++ b/Documentation/locking/mutex-design.rst
@@ -18,7 +18,7 @@ as an alternative to these. This new data structure provided a number
of advantages, including simpler interfaces, and at that time smaller
code (see Disadvantages).
-[1] http://lwn.net/Articles/164802/
+[1] https://lwn.net/Articles/164802/
Implementation
--------------
@@ -28,7 +28,7 @@ and implemented in kernel/locking/mutex.c. These locks use an atomic variable
(->owner) to keep track of the lock state during its lifetime. Field owner
actually contains `struct task_struct *` to the current lock owner and it is
therefore NULL if not currently owned. Since task_struct pointers are aligned
-at at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g.,
+to at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g.,
if waiter list is non-empty). In its most basic form it also includes a
wait-queue and a spinlock that serializes access to it. Furthermore,
CONFIG_MUTEX_SPIN_ON_OWNER=y systems use a spinner MCS lock (->osq), described
@@ -101,6 +101,24 @@ features that make lock debugging easier and faster:
- Detects multi-task circular deadlocks and prints out all affected
locks and tasks (and only those tasks).
+Mutexes - and most other sleeping locks like rwsems - do not provide an
+implicit reference for the memory they occupy, which reference is released
+with mutex_unlock().
+
+[ This is in contrast with spin_unlock() [or completion_done()], which
+ APIs can be used to guarantee that the memory is not touched by the
+ lock implementation after spin_unlock()/completion_done() releases
+ the lock. ]
+
+mutex_unlock() may access the mutex structure even after it has internally
+released the lock already - so it's not safe for another context to
+acquire the mutex and assume that the mutex_unlock() context is not using
+the structure anymore.
+
+The mutex user must ensure that the mutex is not destroyed while a
+release operation is still in progress - in other words, callers of
+mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
+has returned.
Interfaces
----------
diff --git a/Documentation/locking/percpu-rw-semaphore.rst b/Documentation/locking/percpu-rw-semaphore.rst
new file mode 100644
index 000000000000..247de6410855
--- /dev/null
+++ b/Documentation/locking/percpu-rw-semaphore.rst
@@ -0,0 +1,28 @@
+====================
+Percpu rw semaphores
+====================
+
+Percpu rw semaphores is a new read-write semaphore design that is
+optimized for locking for reading.
+
+The problem with traditional read-write semaphores is that when multiple
+cores take the lock for reading, the cache line containing the semaphore
+is bouncing between L1 caches of the cores, causing performance
+degradation.
+
+Locking for reading is very fast, it uses RCU and it avoids any atomic
+instruction in the lock and unlock path. On the other hand, locking for
+writing is very expensive, it calls synchronize_rcu() that can take
+hundreds of milliseconds.
+
+The lock is declared with "struct percpu_rw_semaphore" type.
+The lock is initialized percpu_init_rwsem, it returns 0 on success and
+-ENOMEM on allocation failure.
+The lock must be freed with percpu_free_rwsem to avoid memory leak.
+
+The lock is locked for read with percpu_down_read, percpu_up_read and
+for write with percpu_down_write, percpu_up_write.
+
+The idea of using RCU for optimized rw-lock was introduced by
+Eric Dumazet <eric.dumazet@gmail.com>.
+The code was written by Mikulas Patocka <mpatocka@redhat.com>
diff --git a/Documentation/locking/pi-futex.rst b/Documentation/locking/pi-futex.rst
new file mode 100644
index 000000000000..c33ba2befbf8
--- /dev/null
+++ b/Documentation/locking/pi-futex.rst
@@ -0,0 +1,122 @@
+======================
+Lightweight PI-futexes
+======================
+
+We are calling them lightweight for 3 reasons:
+
+ - in the user-space fastpath a PI-enabled futex involves no kernel work
+ (or any other PI complexity) at all. No registration, no extra kernel
+ calls - just pure fast atomic ops in userspace.
+
+ - even in the slowpath, the system call and scheduling pattern is very
+ similar to normal futexes.
+
+ - the in-kernel PI implementation is streamlined around the mutex
+ abstraction, with strict rules that keep the implementation
+ relatively simple: only a single owner may own a lock (i.e. no
+ read-write lock support), only the owner may unlock a lock, no
+ recursive locking, etc.
+
+Priority Inheritance - why?
+---------------------------
+
+The short reply: user-space PI helps achieving/improving determinism for
+user-space applications. In the best-case, it can help achieve
+determinism and well-bound latencies. Even in the worst-case, PI will
+improve the statistical distribution of locking related application
+delays.
+
+The longer reply
+----------------
+
+Firstly, sharing locks between multiple tasks is a common programming
+technique that often cannot be replaced with lockless algorithms. As we
+can see it in the kernel [which is a quite complex program in itself],
+lockless structures are rather the exception than the norm - the current
+ratio of lockless vs. locky code for shared data structures is somewhere
+between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
+algorithms often endangers to ability to do robust reviews of said code.
+I.e. critical RT apps often choose lock structures to protect critical
+data structures, instead of lockless algorithms. Furthermore, there are
+cases (like shared hardware, or other resource limits) where lockless
+access is mathematically impossible.
+
+Media players (such as Jack) are an example of reasonable application
+design with multiple tasks (with multiple priority levels) sharing
+short-held locks: for example, a highprio audio playback thread is
+combined with medium-prio construct-audio-data threads and low-prio
+display-colory-stuff threads. Add video and decoding to the mix and
+we've got even more priority levels.
+
+So once we accept that synchronization objects (locks) are an
+unavoidable fact of life, and once we accept that multi-task userspace
+apps have a very fair expectation of being able to use locks, we've got
+to think about how to offer the option of a deterministic locking
+implementation to user-space.
+
+Most of the technical counter-arguments against doing priority
+inheritance only apply to kernel-space locks. But user-space locks are
+different, there we cannot disable interrupts or make the task
+non-preemptible in a critical section, so the 'use spinlocks' argument
+does not apply (user-space spinlocks have the same priority inversion
+problems as other user-space locking constructs). Fact is, pretty much
+the only technique that currently enables good determinism for userspace
+locks (such as futex-based pthread mutexes) is priority inheritance:
+
+Currently (without PI), if a high-prio and a low-prio task shares a lock
+[this is a quite common scenario for most non-trivial RT applications],
+even if all critical sections are coded carefully to be deterministic
+(i.e. all critical sections are short in duration and only execute a
+limited number of instructions), the kernel cannot guarantee any
+deterministic execution of the high-prio task: any medium-priority task
+could preempt the low-prio task while it holds the shared lock and
+executes the critical section, and could delay it indefinitely.
+
+Implementation
+--------------
+
+As mentioned before, the userspace fastpath of PI-enabled pthread
+mutexes involves no kernel work at all - they behave quite similarly to
+normal futex-based locks: a 0 value means unlocked, and a value==TID
+means locked. (This is the same method as used by list-based robust
+futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
+entering the kernel.
+
+To handle the slowpath, we have added two new futex ops:
+
+ - FUTEX_LOCK_PI
+ - FUTEX_UNLOCK_PI
+
+If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
+TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
+remaining work: if there is no futex-queue attached to the futex address
+yet then the code looks up the task that owns the futex [it has put its
+own TID into the futex value], and attaches a 'PI state' structure to
+the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
+kernel-based synchronization object. The 'other' task is made the owner
+of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
+futex value. Then this task tries to lock the rt-mutex, on which it
+blocks. Once it returns, it has the mutex acquired, and it sets the
+futex value to its own TID and returns. Userspace has no other work to
+perform - it now owns the lock, and futex value contains
+FUTEX_WAITERS|TID.
+
+If the unlock side fastpath succeeds, [i.e. userspace manages to do a
+TID -> 0 atomic transition of the futex value], then no kernel work is
+triggered.
+
+If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
+then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
+behalf of userspace - and it also unlocks the attached
+pi_state->rt_mutex and thus wakes up any potential waiters.
+
+Note that under this approach, contrary to previous PI-futex approaches,
+there is no prior 'registration' of a PI-futex. [which is not quite
+possible anyway, due to existing ABI properties of pthread mutexes.]
+
+Also, under this scheme, 'robustness' and 'PI' are two orthogonal
+properties of futexes, and all four combinations are possible: futex,
+robust-futex, PI-futex, robust+PI-futex.
+
+More details about priority inheritance can be found in
+Documentation/locking/rt-mutex.rst.
diff --git a/Documentation/locking/preempt-locking.rst b/Documentation/locking/preempt-locking.rst
new file mode 100644
index 000000000000..dce336134e54
--- /dev/null
+++ b/Documentation/locking/preempt-locking.rst
@@ -0,0 +1,144 @@
+===========================================================================
+Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe
+===========================================================================
+
+:Author: Robert Love <rml@tech9.net>
+
+
+Introduction
+============
+
+
+A preemptible kernel creates new locking issues. The issues are the same as
+those under SMP: concurrency and reentrancy. Thankfully, the Linux preemptible
+kernel model leverages existing SMP locking mechanisms. Thus, the kernel
+requires explicit additional locking for very few additional situations.
+
+This document is for all kernel hackers. Developing code in the kernel
+requires protecting these situations.
+
+
+RULE #1: Per-CPU data structures need explicit protection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+Two similar problems arise. An example code snippet::
+
+ struct this_needs_locking tux[NR_CPUS];
+ tux[smp_processor_id()] = some_value;
+ /* task is preempted here... */
+ something = tux[smp_processor_id()];
+
+First, since the data is per-CPU, it may not have explicit SMP locking, but
+require it otherwise. Second, when a preempted task is finally rescheduled,
+the previous value of smp_processor_id may not equal the current. You must
+protect these situations by disabling preemption around them.
+
+You can also use put_cpu() and get_cpu(), which will disable preemption.
+
+
+RULE #2: CPU state must be protected.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+Under preemption, the state of the CPU must be protected. This is arch-
+dependent, but includes CPU structures and state not preserved over a context
+switch. For example, on x86, entering and exiting FPU mode is now a critical
+section that must occur while preemption is disabled. Think what would happen
+if the kernel is executing a floating-point instruction and is then preempted.
+Remember, the kernel does not save FPU state except for user tasks. Therefore,
+upon preemption, the FPU registers will be sold to the lowest bidder. Thus,
+preemption must be disabled around such regions.
+
+Note, some FPU functions are already explicitly preempt safe. For example,
+kernel_fpu_begin and kernel_fpu_end will disable and enable preemption.
+
+
+RULE #3: Lock acquire and release must be performed by same task
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+A lock acquired in one task must be released by the same task. This
+means you can't do oddball things like acquire a lock and go off to
+play while another task releases it. If you want to do something
+like this, acquire and release the task in the same code path and
+have the caller wait on an event by the other task.
+
+
+Solution
+========
+
+
+Data protection under preemption is achieved by disabling preemption for the
+duration of the critical region.
+
+::
+
+ preempt_enable() decrement the preempt counter
+ preempt_disable() increment the preempt counter
+ preempt_enable_no_resched() decrement, but do not immediately preempt
+ preempt_check_resched() if needed, reschedule
+ preempt_count() return the preempt counter
+
+The functions are nestable. In other words, you can call preempt_disable
+n-times in a code path, and preemption will not be reenabled until the n-th
+call to preempt_enable. The preempt statements define to nothing if
+preemption is not enabled.
+
+Note that you do not need to explicitly prevent preemption if you are holding
+any locks or interrupts are disabled, since preemption is implicitly disabled
+in those cases.
+
+But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
+disabling preemption - any cond_resched() or cond_resched_lock() might trigger
+a reschedule if the preempt count is 0. A simple printk() might trigger a
+reschedule. So use this implicit preemption-disabling property only if you
+know that the affected codepath does not do any of this. Best policy is to use
+this only for small, atomic code that you wrote and which calls no complex
+functions.
+
+Example::
+
+ cpucache_t *cc; /* this is per-CPU */
+ preempt_disable();
+ cc = cc_data(searchp);
+ if (cc && cc->avail) {
+ __free_block(searchp, cc_entry(cc), cc->avail);
+ cc->avail = 0;
+ }
+ preempt_enable();
+ return 0;
+
+Notice how the preemption statements must encompass every reference of the
+critical variables. Another example::
+
+ int buf[NR_CPUS];
+ set_cpu_val(buf);
+ if (buf[smp_processor_id()] == -1) printf(KERN_INFO "wee!\n");
+ spin_lock(&buf_lock);
+ /* ... */
+
+This code is not preempt-safe, but see how easily we can fix it by simply
+moving the spin_lock up two lines.
+
+
+Preventing preemption using interrupt disabling
+===============================================
+
+
+It is possible to prevent a preemption event using local_irq_disable and
+local_irq_save. Note, when doing so, you must be very careful to not cause
+an event that would set need_resched and result in a preemption check. When
+in doubt, rely on locking or explicit preemption disabling.
+
+Note in 2.5 interrupt disabling is now only per-CPU (e.g. local).
+
+An additional concern is proper usage of local_irq_disable and local_irq_save.
+These may be used to protect from preemption, however, on exit, if preemption
+may be enabled, a test to see if preemption is required should be done. If
+these are called from the spin_lock and read/write lock macros, the right thing
+is done. They may also be called within a spin-lock protected region, however,
+if they are ever called outside of this context, a test for preemption should
+be made. Do note that calls from interrupt context or bottom half/ tasklets
+are also protected by preemption locks and so may use the versions which do
+not check preemption.
diff --git a/Documentation/locking/robust-futex-ABI.rst b/Documentation/locking/robust-futex-ABI.rst
new file mode 100644
index 000000000000..f24904f1c16f
--- /dev/null
+++ b/Documentation/locking/robust-futex-ABI.rst
@@ -0,0 +1,184 @@
+====================
+The robust futex ABI
+====================
+
+:Author: Started by Paul Jackson <pj@sgi.com>
+
+
+Robust_futexes provide a mechanism that is used in addition to normal
+futexes, for kernel assist of cleanup of held locks on task exit.
+
+The interesting data as to what futexes a thread is holding is kept on a
+linked list in user space, where it can be updated efficiently as locks
+are taken and dropped, without kernel intervention. The only additional
+kernel intervention required for robust_futexes above and beyond what is
+required for futexes is:
+
+ 1) a one time call, per thread, to tell the kernel where its list of
+ held robust_futexes begins, and
+ 2) internal kernel code at exit, to handle any listed locks held
+ by the exiting thread.
+
+The existing normal futexes already provide a "Fast Userspace Locking"
+mechanism, which handles uncontested locking without needing a system
+call, and handles contested locking by maintaining a list of waiting
+threads in the kernel. Options on the sys_futex(2) system call support
+waiting on a particular futex, and waking up the next waiter on a
+particular futex.
+
+For robust_futexes to work, the user code (typically in a library such
+as glibc linked with the application) has to manage and place the
+necessary list elements exactly as the kernel expects them. If it fails
+to do so, then improperly listed locks will not be cleaned up on exit,
+probably causing deadlock or other such failure of the other threads
+waiting on the same locks.
+
+A thread that anticipates possibly using robust_futexes should first
+issue the system call::
+
+ asmlinkage long
+ sys_set_robust_list(struct robust_list_head __user *head, size_t len);
+
+The pointer 'head' points to a structure in the threads address space
+consisting of three words. Each word is 32 bits on 32 bit arch's, or 64
+bits on 64 bit arch's, and local byte order. Each thread should have
+its own thread private 'head'.
+
+If a thread is running in 32 bit compatibility mode on a 64 native arch
+kernel, then it can actually have two such structures - one using 32 bit
+words for 32 bit compatibility mode, and one using 64 bit words for 64
+bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit
+compatibility mode, will attempt to process both lists on each task
+exit, if the corresponding sys_set_robust_list() call has been made to
+setup that list.
+
+ The first word in the memory structure at 'head' contains a
+ pointer to a single linked list of 'lock entries', one per lock,
+ as described below. If the list is empty, the pointer will point
+ to itself, 'head'. The last 'lock entry' points back to the 'head'.
+
+ The second word, called 'offset', specifies the offset from the
+ address of the associated 'lock entry', plus or minus, of what will
+ be called the 'lock word', from that 'lock entry'. The 'lock word'
+ is always a 32 bit word, unlike the other words above. The 'lock
+ word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
+ of the thread holding the lock in the bottom 30 bits. See further
+ below for a description of the flag bits.
+
+ The third word, called 'list_op_pending', contains transient copy of
+ the address of the 'lock entry', during list insertion and removal,
+ and is needed to correctly resolve races should a thread exit while
+ in the middle of a locking or unlocking operation.
+
+Each 'lock entry' on the single linked list starting at 'head' consists
+of just a single word, pointing to the next 'lock entry', or back to
+'head' if there are no more entries. In addition, nearby to each 'lock
+entry', at an offset from the 'lock entry' specified by the 'offset'
+word, is one 'lock word'.
+
+The 'lock word' is always 32 bits, and is intended to be the same 32 bit
+lock variable used by the futex mechanism, in conjunction with
+robust_futexes. The kernel will only be able to wakeup the next thread
+waiting for a lock on a threads exit if that next thread used the futex
+mechanism to register the address of that 'lock word' with the kernel.
+
+For each futex lock currently held by a thread, if it wants this
+robust_futex support for exit cleanup of that lock, it should have one
+'lock entry' on this list, with its associated 'lock word' at the
+specified 'offset'. Should a thread die while holding any such locks,
+the kernel will walk this list, mark any such locks with a bit
+indicating their holder died, and wakeup the next thread waiting for
+that lock using the futex mechanism.
+
+When a thread has invoked the above system call to indicate it
+anticipates using robust_futexes, the kernel stores the passed in 'head'
+pointer for that task. The task may retrieve that value later on by
+using the system call::
+
+ asmlinkage long
+ sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
+ size_t __user *len_ptr);
+
+It is anticipated that threads will use robust_futexes embedded in
+larger, user level locking structures, one per lock. The kernel
+robust_futex mechanism doesn't care what else is in that structure, so
+long as the 'offset' to the 'lock word' is the same for all
+robust_futexes used by that thread. The thread should link those locks
+it currently holds using the 'lock entry' pointers. It may also have
+other links between the locks, such as the reverse side of a double
+linked list, but that doesn't matter to the kernel.
+
+By keeping its locks linked this way, on a list starting with a 'head'
+pointer known to the kernel, the kernel can provide to a thread the
+essential service available for robust_futexes, which is to help clean
+up locks held at the time of (a perhaps unexpectedly) exit.
+
+Actual locking and unlocking, during normal operations, is handled
+entirely by user level code in the contending threads, and by the
+existing futex mechanism to wait for, and wakeup, locks. The kernels
+only essential involvement in robust_futexes is to remember where the
+list 'head' is, and to walk the list on thread exit, handling locks
+still held by the departing thread, as described below.
+
+There may exist thousands of futex lock structures in a threads shared
+memory, on various data structures, at a given point in time. Only those
+lock structures for locks currently held by that thread should be on
+that thread's robust_futex linked lock list a given time.
+
+A given futex lock structure in a user shared memory region may be held
+at different times by any of the threads with access to that region. The
+thread currently holding such a lock, if any, is marked with the threads
+TID in the lower 30 bits of the 'lock word'.
+
+When adding or removing a lock from its list of held locks, in order for
+the kernel to correctly handle lock cleanup regardless of when the task
+exits (perhaps it gets an unexpected signal 9 in the middle of
+manipulating this list), the user code must observe the following
+protocol on 'lock entry' insertion and removal:
+
+On insertion:
+
+ 1) set the 'list_op_pending' word to the address of the 'lock entry'
+ to be inserted,
+ 2) acquire the futex lock,
+ 3) add the lock entry, with its thread id (TID) in the bottom 30 bits
+ of the 'lock word', to the linked list starting at 'head', and
+ 4) clear the 'list_op_pending' word.
+
+On removal:
+
+ 1) set the 'list_op_pending' word to the address of the 'lock entry'
+ to be removed,
+ 2) remove the lock entry for this lock from the 'head' list,
+ 3) release the futex lock, and
+ 4) clear the 'lock_op_pending' word.
+
+On exit, the kernel will consider the address stored in
+'list_op_pending' and the address of each 'lock word' found by walking
+the list starting at 'head'. For each such address, if the bottom 30
+bits of the 'lock word' at offset 'offset' from that address equals the
+exiting threads TID, then the kernel will do two things:
+
+ 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
+ wakeup on that address, which will waken the next thread that has
+ used to the futex mechanism to wait on that address, and
+ 2) atomically set bit 30 (0x40000000) in the 'lock word'.
+
+In the above, bit 31 was set by futex waiters on that lock to indicate
+they were waiting, and bit 30 is set by the kernel to indicate that the
+lock owner died holding the lock.
+
+The kernel exit code will silently stop scanning the list further if at
+any point:
+
+ 1) the 'head' pointer or an subsequent linked list pointer
+ is not a valid address of a user space word
+ 2) the calculated location of the 'lock word' (address plus
+ 'offset') is not the valid address of a 32 bit user space
+ word
+ 3) if the list contains more than 1 million (subject to
+ future kernel configuration changes) elements.
+
+When the kernel sees a list entry whose 'lock word' doesn't have the
+current threads TID in the lower 30 bits, it does nothing with that
+entry, and goes on to the next entry.
diff --git a/Documentation/locking/robust-futexes.rst b/Documentation/locking/robust-futexes.rst
new file mode 100644
index 000000000000..6361fb01c9c1
--- /dev/null
+++ b/Documentation/locking/robust-futexes.rst
@@ -0,0 +1,221 @@
+========================================
+A description of what robust futexes are
+========================================
+
+:Started by: Ingo Molnar <mingo@redhat.com>
+
+Background
+----------
+
+what are robust futexes? To answer that, we first need to understand
+what futexes are: normal futexes are special types of locks that in the
+noncontended case can be acquired/released from userspace without having
+to enter the kernel.
+
+A futex is in essence a user-space address, e.g. a 32-bit lock variable
+field. If userspace notices contention (the lock is already owned and
+someone else wants to grab it too) then the lock is marked with a value
+that says "there's a waiter pending", and the sys_futex(FUTEX_WAIT)
+syscall is used to wait for the other guy to release it. The kernel
+creates a 'futex queue' internally, so that it can later on match up the
+waiter with the waker - without them having to know about each other.
+When the owner thread releases the futex, it notices (via the variable
+value) that there were waiter(s) pending, and does the
+sys_futex(FUTEX_WAKE) syscall to wake them up. Once all waiters have
+taken and released the lock, the futex is again back to 'uncontended'
+state, and there's no in-kernel state associated with it. The kernel
+completely forgets that there ever was a futex at that address. This
+method makes futexes very lightweight and scalable.
+
+"Robustness" is about dealing with crashes while holding a lock: if a
+process exits prematurely while holding a pthread_mutex_t lock that is
+also shared with some other process (e.g. yum segfaults while holding a
+pthread_mutex_t, or yum is kill -9-ed), then waiters for that lock need
+to be notified that the last owner of the lock exited in some irregular
+way.
+
+To solve such types of problems, "robust mutex" userspace APIs were
+created: pthread_mutex_lock() returns an error value if the owner exits
+prematurely - and the new owner can decide whether the data protected by
+the lock can be recovered safely.
+
+There is a big conceptual problem with futex based mutexes though: it is
+the kernel that destroys the owner task (e.g. due to a SEGFAULT), but
+the kernel cannot help with the cleanup: if there is no 'futex queue'
+(and in most cases there is none, futexes being fast lightweight locks)
+then the kernel has no information to clean up after the held lock!
+Userspace has no chance to clean up after the lock either - userspace is
+the one that crashes, so it has no opportunity to clean up. Catch-22.
+
+In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot
+is needed to release that futex based lock. This is one of the leading
+bugreports against yum.
+
+To solve this problem, the traditional approach was to extend the vma
+(virtual memory area descriptor) concept to have a notion of 'pending
+robust futexes attached to this area'. This approach requires 3 new
+syscall variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and
+FUTEX_RECOVER. At do_exit() time, all vmas are searched to see whether
+they have a robust_head set. This approach has two fundamental problems
+left:
+
+ - it has quite complex locking and race scenarios. The vma-based
+ approach had been pending for years, but they are still not completely
+ reliable.
+
+ - they have to scan _every_ vma at sys_exit() time, per thread!
+
+The second disadvantage is a real killer: pthread_exit() takes around 1
+microsecond on Linux, but with thousands (or tens of thousands) of vmas
+every pthread_exit() takes a millisecond or more, also totally
+destroying the CPU's L1 and L2 caches!
+
+This is very much noticeable even for normal process sys_exit_group()
+calls: the kernel has to do the vma scanning unconditionally! (this is
+because the kernel has no knowledge about how many robust futexes there
+are to be cleaned up, because a robust futex might have been registered
+in another task, and the futex variable might have been simply mmap()-ed
+into this process's address space).
+
+This huge overhead forced the creation of CONFIG_FUTEX_ROBUST so that
+normal kernels can turn it off, but worse than that: the overhead makes
+robust futexes impractical for any type of generic Linux distribution.
+
+So something had to be done.
+
+New approach to robust futexes
+------------------------------
+
+At the heart of this new approach there is a per-thread private list of
+robust locks that userspace is holding (maintained by glibc) - which
+userspace list is registered with the kernel via a new syscall [this
+registration happens at most once per thread lifetime]. At do_exit()
+time, the kernel checks this user-space list: are there any robust futex
+locks to be cleaned up?
+
+In the common case, at do_exit() time, there is no list registered, so
+the cost of robust futexes is just a simple current->robust_list != NULL
+comparison. If the thread has registered a list, then normally the list
+is empty. If the thread/process crashed or terminated in some incorrect
+way then the list might be non-empty: in this case the kernel carefully
+walks the list [not trusting it], and marks all locks that are owned by
+this thread with the FUTEX_OWNER_DIED bit, and wakes up one waiter (if
+any).
+
+The list is guaranteed to be private and per-thread at do_exit() time,
+so it can be accessed by the kernel in a lockless way.
+
+There is one race possible though: since adding to and removing from the
+list is done after the futex is acquired by glibc, there is a few
+instructions window for the thread (or process) to die there, leaving
+the futex hung. To protect against this possibility, userspace (glibc)
+also maintains a simple per-thread 'list_op_pending' field, to allow the
+kernel to clean up if the thread dies after acquiring the lock, but just
+before it could have added itself to the list. Glibc sets this
+list_op_pending field before it tries to acquire the futex, and clears
+it after the list-add (or list-remove) has finished.
+
+That's all that is needed - all the rest of robust-futex cleanup is done
+in userspace [just like with the previous patches].
+
+Ulrich Drepper has implemented the necessary glibc support for this new
+mechanism, which fully enables robust mutexes.
+
+Key differences of this userspace-list based approach, compared to the
+vma based method:
+
+ - it's much, much faster: at thread exit time, there's no need to loop
+ over every vma (!), which the VM-based method has to do. Only a very
+ simple 'is the list empty' op is done.
+
+ - no VM changes are needed - 'struct address_space' is left alone.
+
+ - no registration of individual locks is needed: robust mutexes don't
+ need any extra per-lock syscalls. Robust mutexes thus become a very
+ lightweight primitive - so they don't force the application designer
+ to do a hard choice between performance and robustness - robust
+ mutexes are just as fast.
+
+ - no per-lock kernel allocation happens.
+
+ - no resource limits are needed.
+
+ - no kernel-space recovery call (FUTEX_RECOVER) is needed.
+
+ - the implementation and the locking is "obvious", and there are no
+ interactions with the VM.
+
+Performance
+-----------
+
+I have benchmarked the time needed for the kernel to process a list of 1
+million (!) held locks, using the new method [on a 2GHz CPU]:
+
+ - with FUTEX_WAIT set [contended mutex]: 130 msecs
+ - without FUTEX_WAIT set [uncontended mutex]: 30 msecs
+
+I have also measured an approach where glibc does the lock notification
+[which it currently does for !pshared robust mutexes], and that took 256
+msecs - clearly slower, due to the 1 million FUTEX_WAKE syscalls
+userspace had to do.
+
+(1 million held locks are unheard of - we expect at most a handful of
+locks to be held at a time. Nevertheless it's nice to know that this
+approach scales nicely.)
+
+Implementation details
+----------------------
+
+The patch adds two new syscalls: one to register the userspace list, and
+one to query the registered list pointer::
+
+ asmlinkage long
+ sys_set_robust_list(struct robust_list_head __user *head,
+ size_t len);
+
+ asmlinkage long
+ sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
+ size_t __user *len_ptr);
+
+List registration is very fast: the pointer is simply stored in
+current->robust_list. [Note that in the future, if robust futexes become
+widespread, we could extend sys_clone() to register a robust-list head
+for new threads, without the need of another syscall.]
+
+So there is virtually zero overhead for tasks not using robust futexes,
+and even for robust futex users, there is only one extra syscall per
+thread lifetime, and the cleanup operation, if it happens, is fast and
+straightforward. The kernel doesn't have any internal distinction between
+robust and normal futexes.
+
+If a futex is found to be held at exit time, the kernel sets the
+following bit of the futex word::
+
+ #define FUTEX_OWNER_DIED 0x40000000
+
+and wakes up the next futex waiter (if any). User-space does the rest of
+the cleanup.
+
+Otherwise, robust futexes are acquired by glibc by putting the TID into
+the futex field atomically. Waiters set the FUTEX_WAITERS bit::
+
+ #define FUTEX_WAITERS 0x80000000
+
+and the remaining bits are for the TID.
+
+Testing, architecture support
+-----------------------------
+
+I've tested the new syscalls on x86 and x86_64, and have made sure the
+parsing of the userspace list is robust [ ;-) ] even if the list is
+deliberately corrupted.
+
+i386 and x86_64 syscalls are wired up at the moment, and Ulrich has
+tested the new glibc code (on x86_64 and i386), and it works for his
+robust-mutex testcases.
+
+All other architectures should build just fine too - but they won't have
+the new syscalls yet.
+
+Architectures need to implement the new futex_atomic_cmpxchg_inatomic()
+inline function before writing up the syscalls.
diff --git a/Documentation/locking/rt-mutex.rst b/Documentation/locking/rt-mutex.rst
index c365dc302081..3b5097a380e6 100644
--- a/Documentation/locking/rt-mutex.rst
+++ b/Documentation/locking/rt-mutex.rst
@@ -4,7 +4,7 @@ RT-mutex subsystem with PI support
RT-mutexes with priority inheritance are used to support PI-futexes,
which enable pthread_mutex_t priority inheritance attributes
-(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
+(PTHREAD_PRIO_INHERIT). [See Documentation/locking/pi-futex.rst for more details
about PI-futexes.]
This technology was developed in the -rt tree and streamlined for
diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst
new file mode 100644
index 000000000000..bfda1a5fecad
--- /dev/null
+++ b/Documentation/locking/seqlock.rst
@@ -0,0 +1,239 @@
+======================================
+Sequence counters and sequential locks
+======================================
+
+Introduction
+============
+
+Sequence counters are a reader-writer consistency mechanism with
+lockless readers (read-only retry loops), and no writer starvation. They
+are used for data that's rarely written to (e.g. system time), where the
+reader wants a consistent set of information and is willing to retry if
+that information changes.
+
+A data set is consistent when the sequence count at the beginning of the
+read side critical section is even and the same sequence count value is
+read again at the end of the critical section. The data in the set must
+be copied out inside the read side critical section. If the sequence
+count has changed between the start and the end of the critical section,
+the reader must retry.
+
+Writers increment the sequence count at the start and the end of their
+critical section. After starting the critical section the sequence count
+is odd and indicates to the readers that an update is in progress. At
+the end of the write side critical section the sequence count becomes
+even again which lets readers make progress.
+
+A sequence counter write side critical section must never be preempted
+or interrupted by read side sections. Otherwise the reader will spin for
+the entire scheduler tick due to the odd sequence count value and the
+interrupted writer. If that reader belongs to a real-time scheduling
+class, it can spin forever and the kernel will livelock.
+
+This mechanism cannot be used if the protected data contains pointers,
+as the writer can invalidate a pointer that the reader is following.
+
+
+.. _seqcount_t:
+
+Sequence counters (``seqcount_t``)
+==================================
+
+This is the raw counting mechanism, which does not protect against
+multiple writers. Write side critical sections must thus be serialized
+by an external lock.
+
+If the write serialization primitive is not implicitly disabling
+preemption, preemption must be explicitly disabled before entering the
+write side section. If the read section can be invoked from hardirq or
+softirq contexts, interrupts or bottom halves must also be respectively
+disabled before entering the write section.
+
+If it's desired to automatically handle the sequence counter
+requirements of writer serialization and non-preemptibility, use
+:ref:`seqlock_t` instead.
+
+Initialization::
+
+ /* dynamic */
+ seqcount_t foo_seqcount;
+ seqcount_init(&foo_seqcount);
+
+ /* static */
+ static seqcount_t foo_seqcount = SEQCNT_ZERO(foo_seqcount);
+
+ /* C99 struct init */
+ struct {
+ .seq = SEQCNT_ZERO(foo.seq),
+ } foo;
+
+Write path::
+
+ /* Serialized context with disabled preemption */
+
+ write_seqcount_begin(&foo_seqcount);
+
+ /* ... [[write-side critical section]] ... */
+
+ write_seqcount_end(&foo_seqcount);
+
+Read path::
+
+ do {
+ seq = read_seqcount_begin(&foo_seqcount);
+
+ /* ... [[read-side critical section]] ... */
+
+ } while (read_seqcount_retry(&foo_seqcount, seq));
+
+
+.. _seqcount_locktype_t:
+
+Sequence counters with associated locks (``seqcount_LOCKNAME_t``)
+-----------------------------------------------------------------
+
+As discussed at :ref:`seqcount_t`, sequence count write side critical
+sections must be serialized and non-preemptible. This variant of
+sequence counters associate the lock used for writer serialization at
+initialization time, which enables lockdep to validate that the write
+side critical sections are properly serialized.
+
+This lock association is a NOOP if lockdep is disabled and has neither
+storage nor runtime overhead. If lockdep is enabled, the lock pointer is
+stored in struct seqcount and lockdep's "lock is held" assertions are
+injected at the beginning of the write side critical section to validate
+that it is properly protected.
+
+For lock types which do not implicitly disable preemption, preemption
+protection is enforced in the write side function.
+
+The following sequence counters with associated locks are defined:
+
+ - ``seqcount_spinlock_t``
+ - ``seqcount_raw_spinlock_t``
+ - ``seqcount_rwlock_t``
+ - ``seqcount_mutex_t``
+ - ``seqcount_ww_mutex_t``
+
+The sequence counter read and write APIs can take either a plain
+seqcount_t or any of the seqcount_LOCKNAME_t variants above.
+
+Initialization (replace "LOCKNAME" with one of the supported locks)::
+
+ /* dynamic */
+ seqcount_LOCKNAME_t foo_seqcount;
+ seqcount_LOCKNAME_init(&foo_seqcount, &lock);
+
+ /* static */
+ static seqcount_LOCKNAME_t foo_seqcount =
+ SEQCNT_LOCKNAME_ZERO(foo_seqcount, &lock);
+
+ /* C99 struct init */
+ struct {
+ .seq = SEQCNT_LOCKNAME_ZERO(foo.seq, &lock),
+ } foo;
+
+Write path: same as in :ref:`seqcount_t`, while running from a context
+with the associated write serialization lock acquired.
+
+Read path: same as in :ref:`seqcount_t`.
+
+
+.. _seqcount_latch_t:
+
+Latch sequence counters (``seqcount_latch_t``)
+----------------------------------------------
+
+Latch sequence counters are a multiversion concurrency control mechanism
+where the embedded seqcount_t counter even/odd value is used to switch
+between two copies of protected data. This allows the sequence counter
+read path to safely interrupt its own write side critical section.
+
+Use seqcount_latch_t when the write side sections cannot be protected
+from interruption by readers. This is typically the case when the read
+side can be invoked from NMI handlers.
+
+Check `raw_write_seqcount_latch()` for more information.
+
+
+.. _seqlock_t:
+
+Sequential locks (``seqlock_t``)
+================================
+
+This contains the :ref:`seqcount_t` mechanism earlier discussed, plus an
+embedded spinlock for writer serialization and non-preemptibility.
+
+If the read side section can be invoked from hardirq or softirq context,
+use the write side function variants which disable interrupts or bottom
+halves respectively.
+
+Initialization::
+
+ /* dynamic */
+ seqlock_t foo_seqlock;
+ seqlock_init(&foo_seqlock);
+
+ /* static */
+ static DEFINE_SEQLOCK(foo_seqlock);
+
+ /* C99 struct init */
+ struct {
+ .seql = __SEQLOCK_UNLOCKED(foo.seql)
+ } foo;
+
+Write path::
+
+ write_seqlock(&foo_seqlock);
+
+ /* ... [[write-side critical section]] ... */
+
+ write_sequnlock(&foo_seqlock);
+
+Read path, three categories:
+
+1. Normal Sequence readers which never block a writer but they must
+ retry if a writer is in progress by detecting change in the sequence
+ number. Writers do not wait for a sequence reader::
+
+ do {
+ seq = read_seqbegin(&foo_seqlock);
+
+ /* ... [[read-side critical section]] ... */
+
+ } while (read_seqretry(&foo_seqlock, seq));
+
+2. Locking readers which will wait if a writer or another locking reader
+ is in progress. A locking reader in progress will also block a writer
+ from entering its critical section. This read lock is
+ exclusive. Unlike rwlock_t, only one locking reader can acquire it::
+
+ read_seqlock_excl(&foo_seqlock);
+
+ /* ... [[read-side critical section]] ... */
+
+ read_sequnlock_excl(&foo_seqlock);
+
+3. Conditional lockless reader (as in 1), or locking reader (as in 2),
+ according to a passed marker. This is used to avoid lockless readers
+ starvation (too much retry loops) in case of a sharp spike in write
+ activity. First, a lockless read is tried (even marker passed). If
+ that trial fails (odd sequence counter is returned, which is used as
+ the next iteration marker), the lockless read is transformed to a
+ full locking read and no retry loop is necessary::
+
+ /* marker; even initialization */
+ int seq = 0;
+ do {
+ read_seqbegin_or_lock(&foo_seqlock, &seq);
+
+ /* ... [[read-side critical section]] ... */
+
+ } while (need_seqretry(&foo_seqlock, seq));
+ done_seqretry(&foo_seqlock, seq);
+
+
+API documentation
+=================
+
+.. kernel-doc:: include/linux/seqlock.h
diff --git a/Documentation/locking/ww-mutex-design.rst b/Documentation/locking/ww-mutex-design.rst
index 1846c199da23..6a8f8beb9ec4 100644
--- a/Documentation/locking/ww-mutex-design.rst
+++ b/Documentation/locking/ww-mutex-design.rst
@@ -2,7 +2,7 @@
Wound/Wait Deadlock-Proof Mutex Design
======================================
-Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
+Please read mutex-design.rst first, as it applies to wait/wound mutexes too.
Motivation for WW-Mutexes
-------------------------
@@ -49,7 +49,7 @@ However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
compared to Wait-Die, but is, on the other hand, associated with more work than
Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
algorithm in that transactions are wounded by other transactions, and that
-requires a reliable way to pick up up the wounded condition and preempt the
+requires a reliable way to pick up the wounded condition and preempt the
running transaction. Note that this is not the same as process preemption. A
Wound-Wait transaction is considered preempted when it dies (returning
-EDEADLK) following a wound.
@@ -60,7 +60,7 @@ Concepts
Compared to normal mutexes two additional concepts/objects show up in the lock
interface for w/w mutexes:
-Acquire context: To ensure eventual forward progress it is important the a task
+Acquire context: To ensure eventual forward progress it is important that a task
trying to acquire locks doesn't grab a new reservation id, but keeps the one it
acquired when starting the lock acquisition. This ticket is stored in the
acquire context. Furthermore the acquire context keeps track of debugging state