diff options
Diffstat (limited to 'Documentation/RCU')
18 files changed, 251 insertions, 305 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index 18f179807563..c30c1957c7e6 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -155,8 +155,7 @@ keeping lock contention under control at all tree levels regardless of the level of loading on the system. </p><p>RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via <tt>call_rcu()</tt> and -friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>), +RCU callbacks, either directly via <tt>call_rcu()</tt> or indirectly via <tt>synchronize_rcu()</tt> and friends. RCU callbacks are represented by <tt>rcu_head</tt> structures, which are queued on <tt>rcu_data</tt> structures while they are diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg index e4233ac93c2b..6189ffcc6aff 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg @@ -328,13 +328,13 @@ inkscape:window-height="1148" id="namedview90" showgrid="true" - inkscape:zoom="0.80021373" - inkscape:cx="462.49289" - inkscape:cy="473.6718" + inkscape:zoom="0.69092787" + inkscape:cx="476.34085" + inkscape:cy="712.80957" inkscape:window-x="770" inkscape:window-y="24" inkscape:window-maximized="0" - inkscape:current-layer="g4114-9-3-9" + inkscape:current-layer="g4" inkscape:snap-grids="false" fit-margin-top="5" fit-margin-right="5" @@ -813,14 +813,18 @@ <text sodipodi:linespacing="125%" id="text4110-5-7-6-2-4-0" - y="841.88086" + y="670.74316" x="1460.1007" style="font-size:267.24359131px;font-style:normal;font-weight:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans" xml:space="preserve"><tspan - y="841.88086" + y="670.74316" + x="1460.1007" + sodipodi:role="line" + id="tspan4925-1-2-4-5">Request</tspan><tspan + y="1004.7976" x="1460.1007" sodipodi:role="line" - id="tspan4925-1-2-4-5">reched_cpu()</tspan></text> + id="tspan3100">context switch</tspan></text> </g> </g> </svg> diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html index 8e4f873b979f..57300db4b5ff 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -56,6 +56,7 @@ sections. RCU-preempt Expedited Grace Periods</a></h2> <p> +<tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt. The overall flow of the handling of a given CPU by an RCU-preempt expedited grace period is shown in the following diagram: @@ -72,10 +73,10 @@ will ignore it because idle and offline CPUs are already residing in quiescent states. Otherwise, the expedited grace period will use <tt>smp_call_function_single()</tt> to send the CPU an IPI, which -is handled by <tt>sync_rcu_exp_handler()</tt>. +is handled by <tt>rcu_exp_handler()</tt>. <p> -However, because this is preemptible RCU, <tt>sync_rcu_exp_handler()</tt> +However, because this is preemptible RCU, <tt>rcu_exp_handler()</tt> can check to see if the CPU is currently running in an RCU read-side critical section. If not, the handler can immediately report a quiescent state. @@ -139,25 +140,25 @@ or offline, among other things. RCU-sched Expedited Grace Periods</a></h2> <p> +<tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched. The overall flow of the handling of a given CPU by an RCU-sched expedited grace period is shown in the following diagram: <p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> <p> -As with RCU-preempt's <tt>synchronize_rcu_expedited()</tt>, -<tt>synchronize_sched_expedited()</tt> ignores offline and +As with RCU-preempt, RCU-sched's +<tt>synchronize_rcu_expedited()</tt> ignores offline and idle CPUs, again because they are in remotely detectable quiescent states. -However, the <tt>synchronize_rcu_expedited()</tt> handler -is <tt>sync_sched_exp_handler()</tt>, and because the +However, because the <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> leave no trace of their invocation, in general it is not possible to tell whether or not the current CPU is in an RCU read-side critical section. -The best that <tt>sync_sched_exp_handler()</tt> can do is to check +The best that RCU-sched's <tt>rcu_exp_handler()</tt> can do is to check for idle, on the off-chance that the CPU went idle while the IPI was in flight. -If the CPU is idle, then <tt>sync_sched_exp_handler()</tt> reports +If the CPU is idle, then <tt>rcu_exp_handler()</tt> reports the quiescent state. <p> Otherwise, the handler forces a future context switch by setting the @@ -298,19 +299,18 @@ Instead, the task pushing the grace period forward will include the idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. <p> -For RCU-sched, there is an additional check for idle in the IPI -handler, <tt>sync_sched_exp_handler()</tt>. +For RCU-sched, there is an additional check: If the IPI has interrupted the idle loop, then -<tt>sync_sched_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> +<tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent state. <p> For RCU-preempt, there is no specific check for idle in the -IPI handler (<tt>sync_rcu_exp_handler()</tt>), but because +IPI handler (<tt>rcu_exp_handler()</tt>), but because RCU read-side critical sections are not permitted within the -idle loop, if <tt>sync_rcu_exp_handler()</tt> sees that the CPU is within +idle loop, if <tt>rcu_exp_handler()</tt> sees that the CPU is within RCU read-side critical section, the CPU cannot possibly be idle. -Otherwise, <tt>sync_rcu_exp_handler()</tt> invokes +Otherwise, <tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent state, regardless of whether or not that quiescent state was due to the CPU being idle. @@ -625,6 +625,8 @@ checks, but only during the mid-boot dead zone. <p> With this refinement, synchronous grace periods can now be used from task context pretty much any time during the life of the kernel. +That is, aside from some points in the suspend, hibernate, or shutdown +code path. <h3><a name="Summary"> Summary</a></h3> diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html index e4d94fba6c89..c64f8d26609f 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -34,12 +34,11 @@ Similarly, any code that happens before the beginning of a given RCU grace period is guaranteed to see the effects of all accesses following the end of that grace period that are within RCU read-side critical sections. -<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>, -for which RCU-sched read-side critical sections include any region +<p>Note well that RCU-sched read-side critical sections include any region of code for which preemption is disabled. Given that each individual machine instruction can be thought of as an extremely small region of preemption-disabled code, one can think of -<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids. +<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids. <p>RCU updaters use this guarantee by splitting their updates into two phases, one of which is executed before the grace period and @@ -485,13 +484,13 @@ section that the grace period must wait on. noted by <tt>rcu_node_context_switch()</tt> on the left. On the other hand, if the CPU takes a scheduler-clock interrupt while executing in usermode, a quiescent state will be noted by -<tt>rcu_check_callbacks()</tt> on the right. +<tt>rcu_sched_clock_irq()</tt> on the right. Either way, the passage through a quiescent state will be noted in a per-CPU variable. <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on this CPU (for example, after the next scheduler-clock -interrupt), <tt>__rcu_process_callbacks()</tt> will invoke +interrupt), <tt>rcu_core()</tt> will invoke <tt>rcu_check_quiescent_state()</tt>, which will notice the recorded quiescent state, and invoke <tt>rcu_report_qs_rdp()</tt>. @@ -651,7 +650,7 @@ to end. These callbacks are identified by <tt>rcu_advance_cbs()</tt>, which is usually invoked by <tt>__note_gp_changes()</tt>. As shown in the diagram below, this invocation can be triggered by -the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on +the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on the right, but only for kernels build with <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg index 832408313d93..3fcf0c17cef2 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -349,7 +349,7 @@ font-weight="bold" font-size="192" id="text202-7-5" - style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_check_callbacks()</text> + style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_sched_clock_irq()</text> <rect x="7069.6187" y="5087.4678" diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg index acd73c7ad0f4..2bcd742d6e49 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -3902,7 +3902,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_check_callbacks()</text> + xml:space="preserve">rcu_sched_clock_irq()</text> </g> <g transform="translate(-850.30204,55463.106)" @@ -3924,7 +3924,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_process_callbacks()</text> + xml:space="preserve">rcu_core()</text> <text style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier" id="text202-7-5-3-27-0" @@ -3933,7 +3933,7 @@ font-style="normal" y="-4165.7954" x="3745.7725" - xml:space="preserve">rcu_check_quiescent_state())</text> + xml:space="preserve">rcu_check_quiescent_state()</text> <text style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier" id="text202-7-5-3-27-0-9" @@ -4968,7 +4968,7 @@ font-weight="bold" font-size="192" id="text202-7-5-19" - style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_check_callbacks()</text> + style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier">rcu_sched_clock_irq()</text> <rect x="5314.2671" y="82817.688" diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg index 149bec2a4493..779c9ac31a52 100644 --- a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -775,7 +775,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_check_callbacks()</text> + xml:space="preserve">rcu_sched_clock_irq()</text> </g> <g transform="translate(399.7744,828.86448)" @@ -797,7 +797,7 @@ font-style="normal" y="-4418.6582" x="3745.7725" - xml:space="preserve">rcu_process_callbacks()</text> + xml:space="preserve">rcu_core()</text> <text style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier" id="text202-7-5-3-27-0" @@ -806,7 +806,7 @@ font-style="normal" y="-4165.7954" x="3745.7725" - xml:space="preserve">rcu_check_quiescent_state())</text> + xml:space="preserve">rcu_check_quiescent_state()</text> <text style="font-size:192px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;stroke-width:0.025in;font-family:Courier" id="text202-7-5-3-27-0-9" diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 9fca73e03a98..5a9238a2883c 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -3099,7 +3099,7 @@ If you block forever in one of a given domain's SRCU read-side critical sections, then that domain's grace periods will also be blocked forever. Of course, one good way to block forever is to deadlock, which can happen if any operation in a given domain's SRCU read-side critical -section can block waiting, either directly or indirectly, for that domain's +section can wait, either directly or indirectly, for that domain's grace period to elapse. For example, this results in a self-deadlock: @@ -3139,12 +3139,18 @@ API, which, in combination with <tt>srcu_read_unlock()</tt>, guarantees a full memory barrier. <p> -Also unlike other RCU flavors, SRCU's callbacks-wait function -<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers, -though this is not necessarily a good idea. -The reason that this is possible is that SRCU is insensitive -to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> -need not exclude CPU-hotplug operations. +Also unlike other RCU flavors, <tt>synchronize_srcu()</tt> may <b>not</b> +be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being temporarily +“stranded” on the outgoing CPU. +This stranding of timers means that timers posted to the outgoing CPU +will not fire until late in the CPU-hotplug process. +The problem is that if a notifier is waiting on an SRCU grace period, +that grace period is waiting on a timer, and that timer is stranded on the +outgoing CPU, then the notifier will never be awakened, in other words, +deadlock has occurred. +This same situation of course also prohibits <tt>srcu_barrier()</tt> +from being invoked from CPU-hotplug notifiers. <p> SRCU also differs from other RCU flavors in that SRCU's expedited and diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt index 687777f83b23..881353fd5bff 100644 --- a/Documentation/RCU/NMI-RCU.txt +++ b/Documentation/RCU/NMI-RCU.txt @@ -81,18 +81,19 @@ currently executing on some other CPU. We therefore cannot free up any data structures used by the old NMI handler until execution of it completes on all other CPUs. -One way to accomplish this is via synchronize_sched(), perhaps as +One way to accomplish this is via synchronize_rcu(), perhaps as follows: unset_nmi_callback(); - synchronize_sched(); + synchronize_rcu(); kfree(my_nmi_data); -This works because synchronize_sched() blocks until all CPUs complete -any preemption-disabled segments of code that they were executing. -Since NMI handlers disable preemption, synchronize_sched() is guaranteed +This works because (as of v4.20) synchronize_rcu() blocks until all +CPUs complete any preemption-disabled segments of code that they were +executing. +Since NMI handlers disable preemption, synchronize_rcu() is guaranteed not to return until all ongoing NMI handlers exit. It is therefore safe -to free up the handler's data as soon as synchronize_sched() returns. +to free up the handler's data as soon as synchronize_rcu() returns. Important note: for this to work, the architecture in question must invoke nmi_enter() and nmi_exit() on NMI entry and exit, respectively. diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt index 90ec5341ee98..53bde717017b 100644 --- a/Documentation/RCU/UP.txt +++ b/Documentation/RCU/UP.txt @@ -86,10 +86,8 @@ even on a UP system. So do not do it! Even on a UP system, the RCU infrastructure -must- respect grace periods, and -must- invoke callbacks from a known environment in which no locks are held. -It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return -immediately on an UP system. It is also safe for synchronize_rcu() -to return immediately on UP systems, except when running preemptable -RCU. +Note that it -is- safe for synchronize_rcu() to return immediately on +UP systems, including !PREEMPT SMP builds running on UP systems. Quick Quiz #3: Why can't synchronize_rcu() return immediately on UP systems running preemptable RCU? diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 6f469864d9f5..e98ff261a438 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -182,16 +182,13 @@ over a rather long period of time, but improvements are always welcome! when publicizing a pointer to a structure that can be traversed by an RCU read-side critical section. -5. If call_rcu(), or a related primitive such as call_rcu_bh(), - call_rcu_sched(), or call_srcu() is used, the callback function - will be called from softirq context. In particular, it cannot - block. +5. If call_rcu() or call_srcu() is used, the callback function will + be called from softirq context. In particular, it cannot block. -6. Since synchronize_rcu() can block, it cannot be called from - any sort of irq context. The same rule applies for - synchronize_rcu_bh(), synchronize_sched(), synchronize_srcu(), - synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), - synchronize_sched_expedite(), and synchronize_srcu_expedited(). +6. Since synchronize_rcu() can block, it cannot be called + from any sort of irq context. The same rule applies + for synchronize_srcu(), synchronize_rcu_expedited(), and + synchronize_srcu_expedited(). The expedited forms of these primitives have the same semantics as the non-expedited forms, but expediting is both expensive and @@ -212,20 +209,20 @@ over a rather long period of time, but improvements are always welcome! of the system, especially to real-time workloads running on the rest of the system. -7. If the updater uses call_rcu() or synchronize_rcu(), then the - corresponding readers must use rcu_read_lock() and - rcu_read_unlock(). If the updater uses call_rcu_bh() or - synchronize_rcu_bh(), then the corresponding readers must - use rcu_read_lock_bh() and rcu_read_unlock_bh(). If the - updater uses call_rcu_sched() or synchronize_sched(), then - the corresponding readers must disable preemption, possibly - by calling rcu_read_lock_sched() and rcu_read_unlock_sched(). - If the updater uses synchronize_srcu() or call_srcu(), then - the corresponding readers must use srcu_read_lock() and +7. As of v4.20, a given kernel implements only one RCU flavor, + which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y. + If the updater uses call_rcu() or synchronize_rcu(), + then the corresponding readers my use rcu_read_lock() and + rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), + or any pair of primitives that disables and re-enables preemption, + for example, rcu_read_lock_sched() and rcu_read_unlock_sched(). + If the updater uses synchronize_srcu() or call_srcu(), + then the corresponding readers must use srcu_read_lock() and srcu_read_unlock(), and with the same srcu_struct. The rules for the expedited primitives are the same as for their non-expedited counterparts. Mixing things up will result in confusion and - broken kernels. + broken kernels, and has even resulted in an exploitable security + issue. One exception to this rule: rcu_read_lock() and rcu_read_unlock() may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() @@ -288,8 +285,7 @@ over a rather long period of time, but improvements are always welcome! d. Periodically invoke synchronize_rcu(), permitting a limited number of updates per grace period. - The same cautions apply to call_rcu_bh(), call_rcu_sched(), - call_srcu(), and kfree_rcu(). + The same cautions apply to call_srcu() and kfree_rcu(). Note that although these primitives do take action to avoid memory exhaustion when any given CPU has too many callbacks, a determined @@ -322,7 +318,7 @@ over a rather long period of time, but improvements are always welcome! 11. Any lock acquired by an RCU callback must be acquired elsewhere with softirq disabled, e.g., via spin_lock_irqsave(), - spin_lock_bh(), etc. Failing to disable irq on a given + spin_lock_bh(), etc. Failing to disable softirq on a given acquisition of that lock will result in deadlock as soon as the RCU softirq handler happens to run your RCU callback while interrupting that acquisition's critical section. @@ -335,13 +331,16 @@ over a rather long period of time, but improvements are always welcome! must use whatever locking or other synchronization is required to safely access and/or modify that data structure. - RCU callbacks are -usually- executed on the same CPU that executed - the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(), - but are by -no- means guaranteed to be. For example, if a given - CPU goes offline while having an RCU callback pending, then that - RCU callback will execute on some surviving CPU. (If this was - not the case, a self-spawning RCU callback would prevent the - victim CPU from ever going offline.) + Do not assume that RCU callbacks will be executed on the same + CPU that executed the corresponding call_rcu() or call_srcu(). + For example, if a given CPU goes offline while having an RCU + callback pending, then that RCU callback will execute on some + surviving CPU. (If this was not the case, a self-spawning RCU + callback would prevent the victim CPU from ever going offline.) + Furthermore, CPUs designated by rcu_nocbs= might well -always- + have their RCU callbacks executed on some other CPUs, in fact, + for some real-time workloads, this is the whole point of using + the rcu_nocbs= kernel boot parameter. 13. Unlike other forms of RCU, it -is- permissible to block in an SRCU read-side critical section (demarked by srcu_read_lock() @@ -381,11 +380,11 @@ over a rather long period of time, but improvements are always welcome! SRCU's expedited primitive (synchronize_srcu_expedited()) never sends IPIs to other CPUs, so it is easier on - real-time workloads than is synchronize_rcu_expedited(), - synchronize_rcu_bh_expedited() or synchronize_sched_expedited(). + real-time workloads than is synchronize_rcu_expedited(). - Note that rcu_dereference() and rcu_assign_pointer() relate to - SRCU just as they do to other forms of RCU. + Note that rcu_assign_pointer() relates to SRCU just as it does to + other forms of RCU, but instead of rcu_dereference() you should + use srcu_dereference() in order to avoid lockdep splats. 14. The whole point of call_rcu(), synchronize_rcu(), and friends is to wait until all pre-existing readers have finished before @@ -405,6 +404,9 @@ over a rather long period of time, but improvements are always welcome! read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this. + For SRCU readers, you can use smp_mb__after_srcu_read_unlock() + immediately after an srcu_read_unlock() to get a full barrier. + 16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the __rcu sparse checks to validate your RCU code. These can help find problems as follows: @@ -428,22 +430,19 @@ over a rather long period of time, but improvements are always welcome! These debugging aids can help you find problems that are otherwise extremely difficult to spot. -17. If you register a callback using call_rcu(), call_rcu_bh(), - call_rcu_sched(), or call_srcu(), and pass in a function defined - within a loadable module, then it in necessary to wait for - all pending callbacks to be invoked after the last invocation - and before unloading that module. Note that it is absolutely - -not- sufficient to wait for a grace period! The current (say) - synchronize_rcu() implementation waits only for all previous - callbacks registered on the CPU that synchronize_rcu() is running - on, but it is -not- guaranteed to wait for callbacks registered - on other CPUs. +17. If you register a callback using call_rcu() or call_srcu(), and + pass in a function defined within a loadable module, then it in + necessary to wait for all pending callbacks to be invoked after + the last invocation and before unloading that module. Note that + it is absolutely -not- sufficient to wait for a grace period! + The current (say) synchronize_rcu() implementation is -not- + guaranteed to wait for callbacks registered on other CPUs. + Or even on the current CPU if that CPU recently went offline + and came back online. You instead need to use one of the barrier functions: o call_rcu() -> rcu_barrier() - o call_rcu_bh() -> rcu_barrier() - o call_rcu_sched() -> rcu_barrier() o call_srcu() -> srcu_barrier() However, these barrier functions are absolutely -not- guaranteed diff --git a/Documentation/RCU/lockdep-splat.txt b/Documentation/RCU/lockdep-splat.txt index 238e9f61352f..9c015976b174 100644 --- a/Documentation/RCU/lockdep-splat.txt +++ b/Documentation/RCU/lockdep-splat.txt @@ -14,9 +14,9 @@ being the real world and all that. So let's look at an example RCU lockdep splat from 3.0-rc5, one that has long since been fixed: -=============================== -[ INFO: suspicious RCU usage. ] -------------------------------- +============================= +WARNING: suspicious RCU usage +----------------------------- block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage! other info that might help us debug this: @@ -24,11 +24,11 @@ other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by scsi_scan_6/1552: - #0: (&shost->scan_mutex){+.+.+.}, at: [<ffffffff8145efca>] + #0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>] scsi_scan_host_selected+0x5a/0x150 - #1: (&eq->sysfs_lock){+.+...}, at: [<ffffffff812a5032>] + #1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>] elevator_exit+0x22/0x60 - #2: (&(&q->__queue_lock)->rlock){-.-...}, at: [<ffffffff812b6233>] + #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>] cfq_exit_queue+0x43/0x190 stack backtrace: diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index 721b3e426515..c818cf65c5a9 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -52,10 +52,10 @@ o If I am running on a uniprocessor kernel, which can only do one o How can I see where RCU is currently used in the Linux kernel? Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", - "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", - "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", - "synchronize_net", "synchronize_srcu", and the other RCU - primitives. Or grab one of the cscope databases from: + "rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock", + "srcu_read_unlock", "synchronize_rcu", "synchronize_net", + "synchronize_srcu", and the other RCU primitives. Or grab one + of the cscope databases from: http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt index ab96227bad42..bf699e8cfc75 100644 --- a/Documentation/RCU/rcu_dereference.txt +++ b/Documentation/RCU/rcu_dereference.txt @@ -351,3 +351,106 @@ garbage values. In short, rcu_dereference() is -not- optional when you are going to dereference the resulting pointer. + + +WHICH MEMBER OF THE rcu_dereference() FAMILY SHOULD YOU USE? + +First, please avoid using rcu_dereference_raw() and also please avoid +using rcu_dereference_check() and rcu_dereference_protected() with a +second argument with a constant value of 1 (or true, for that matter). +With that caution out of the way, here is some guidance for which +member of the rcu_dereference() to use in various situations: + +1. If the access needs to be within an RCU read-side critical + section, use rcu_dereference(). With the new consolidated + RCU flavors, an RCU read-side critical section is entered + using rcu_read_lock(), anything that disables bottom halves, + anything that disables interrupts, or anything that disables + preemption. + +2. If the access might be within an RCU read-side critical section + on the one hand, or protected by (say) my_lock on the other, + use rcu_dereference_check(), for example: + + p1 = rcu_dereference_check(p->rcu_protected_pointer, + lockdep_is_held(&my_lock)); + + +3. If the access might be within an RCU read-side critical section + on the one hand, or protected by either my_lock or your_lock on + the other, again use rcu_dereference_check(), for example: + + p1 = rcu_dereference_check(p->rcu_protected_pointer, + lockdep_is_held(&my_lock) || + lockdep_is_held(&your_lock)); + +4. If the access is on the update side, so that it is always protected + by my_lock, use rcu_dereference_protected(): + + p1 = rcu_dereference_protected(p->rcu_protected_pointer, + lockdep_is_held(&my_lock)); + + This can be extended to handle multiple locks as in #3 above, + and both can be extended to check other conditions as well. + +5. If the protection is supplied by the caller, and is thus unknown + to this code, that is the rare case when rcu_dereference_raw() + is appropriate. In addition, rcu_dereference_raw() might be + appropriate when the lockdep expression would be excessively + complex, except that a better approach in that case might be to + take a long hard look at your synchronization design. Still, + there are data-locking cases where any one of a very large number + of locks or reference counters suffices to protect the pointer, + so rcu_dereference_raw() does have its place. + + However, its place is probably quite a bit smaller than one + might expect given the number of uses in the current kernel. + Ditto for its synonym, rcu_dereference_check( ... , 1), and + its close relative, rcu_dereference_protected(... , 1). + + +SPARSE CHECKING OF RCU-PROTECTED POINTERS + +The sparse static-analysis tool checks for direct access to RCU-protected +pointers, which can result in "interesting" bugs due to compiler +optimizations involving invented loads and perhaps also load tearing. +For example, suppose someone mistakenly does something like this: + + p = q->rcu_protected_pointer; + do_something_with(p->a); + do_something_else_with(p->b); + +If register pressure is high, the compiler might optimize "p" out +of existence, transforming the code to something like this: + + do_something_with(q->rcu_protected_pointer->a); + do_something_else_with(q->rcu_protected_pointer->b); + +This could fatally disappoint your code if q->rcu_protected_pointer +changed in the meantime. Nor is this a theoretical problem: Exactly +this sort of bug cost Paul E. McKenney (and several of his innocent +colleagues) a three-day weekend back in the early 1990s. + +Load tearing could of course result in dereferencing a mashup of a pair +of pointers, which also might fatally disappoint your code. + +These problems could have been avoided simply by making the code instead +read as follows: + + p = rcu_dereference(q->rcu_protected_pointer); + do_something_with(p->a); + do_something_else_with(p->b); + +Unfortunately, these sorts of bugs can be extremely hard to spot during +review. This is where the sparse tool comes into play, along with the +"__rcu" marker. If you mark a pointer declaration, whether in a structure +or as a formal parameter, with "__rcu", which tells sparse to complain if +this pointer is accessed directly. It will also cause sparse to complain +if a pointer not marked with "__rcu" is accessed using rcu_dereference() +and friends. For example, ->rcu_protected_pointer might be declared as +follows: + + struct foo __rcu *rcu_protected_pointer; + +Use of "__rcu" is opt-in. If you choose not to use it, then you should +ignore the sparse warnings. diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 5d7759071a3e..a2782df69732 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt @@ -83,16 +83,15 @@ Pseudo-code using rcu_barrier() is as follows: 2. Execute rcu_barrier(). 3. Allow the module to be unloaded. -There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier() -functions for the other flavors of RCU, and you of course must match -the flavor of rcu_barrier() with that of call_rcu(). If your module -uses multiple flavors of call_rcu(), then it must also use multiple +There is also an srcu_barrier() function for SRCU, and you of course +must match the flavor of rcu_barrier() with that of call_rcu(). If your +module uses multiple flavors of call_rcu(), then it must also use multiple flavors of rcu_barrier() when unloading that module. For example, if -it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on +it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on srcu_struct_2(), then the following three lines of code will be required when unloading: - 1 rcu_barrier_bh(); + 1 rcu_barrier(); 2 srcu_barrier(&srcu_struct_1); 3 srcu_barrier(&srcu_struct_2); @@ -185,12 +184,12 @@ module invokes call_rcu() from timers, you will need to first cancel all the timers, and only then invoke rcu_barrier() to wait for any remaining RCU callbacks to complete. -Of course, if you module uses call_rcu_bh(), you will need to invoke -rcu_barrier_bh() before unloading. Similarly, if your module uses -call_rcu_sched(), you will need to invoke rcu_barrier_sched() before -unloading. If your module uses call_rcu(), call_rcu_bh(), -and- -call_rcu_sched(), then you will need to invoke each of rcu_barrier(), -rcu_barrier_bh(), and rcu_barrier_sched(). +Of course, if you module uses call_rcu(), you will need to invoke +rcu_barrier() before unloading. Similarly, if your module uses +call_srcu(), you will need to invoke srcu_barrier() before unloading, +and on the same srcu_struct structure. If your module uses call_rcu() +-and- call_srcu(), then you will need to invoke rcu_barrier() -and- +srcu_barrier(). Implementing rcu_barrier() @@ -223,8 +222,8 @@ shown below. Note that the final "1" in on_each_cpu()'s argument list ensures that all the calls to rcu_barrier_func() will have completed before on_each_cpu() returns. Line 9 then waits for the completion. -This code was rewritten in 2008 to support rcu_barrier_bh() and -rcu_barrier_sched() in addition to the original rcu_barrier(). +This code was rewritten in 2008 and several times thereafter, but this +still gives the general idea. The rcu_barrier_func() runs on each CPU, where it invokes call_rcu() to post an RCU callback, as follows: diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 073dbc12d1ea..1ab70c37921f 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -219,17 +219,18 @@ an estimate of the total number of RCU callbacks queued across all CPUs In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed for each CPU: - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 Nonlazy posted: ..D The "last_accelerate:" prints the low-order 16 bits (in hex) of the jiffies counter when this CPU last invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from -rcu_prepare_for_idle(). The "nonlazy_posted:" prints the number -of non-lazy callbacks posted since the last call to rcu_needs_cpu(). -Finally, an "L" indicates that there are currently no non-lazy callbacks -("." is printed otherwise, as shown above) and "D" indicates that -dyntick-idle processing is enabled ("." is printed otherwise, for example, -if disabled via the "nohz=" kernel boot parameter). +rcu_prepare_for_idle(). The "Nonlazy posted:" indicates lazy-callback +status, so that an "l" indicates that all callbacks were lazy at the start +of the last idle period and an "L" indicates that there are currently +no non-lazy callbacks (in both cases, "." is printed otherwise, as +shown above) and "D" indicates that dyntick-idle processing is enabled +("." is printed otherwise, for example, if disabled via the "nohz=" +kernel boot parameter). If the grace period ends just as the stall warning starts printing, there will be a spurious stall-warning message, which will include diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index 55918b54808b..a41a0384d20c 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -10,173 +10,8 @@ status messages via printk(), which can be examined via the dmesg command (perhaps grepping for "torture"). The test is started when the module is loaded, and stops when the module is unloaded. - -MODULE PARAMETERS - -This module has the following parameters: - -fqs_duration Duration (in microseconds) of artificially induced bursts - of force_quiescent_state() invocations. In RCU - implementations having force_quiescent_state(), these - bursts help force races between forcing a given grace - period and that grace period ending on its own. - -fqs_holdoff Holdoff time (in microseconds) between consecutive calls - to force_quiescent_state() within a burst. - -fqs_stutter Wait time (in seconds) between consecutive bursts - of calls to force_quiescent_state(). - -gp_normal Make the fake writers use normal synchronous grace-period - primitives. - -gp_exp Make the fake writers use expedited synchronous grace-period - primitives. If both gp_normal and gp_exp are set, or - if neither gp_normal nor gp_exp are set, then randomly - choose the primitive so that about 50% are normal and - 50% expedited. By default, neither are set, which - gives best overall test coverage. - -irqreader Says to invoke RCU readers from irq level. This is currently - done via timers. Defaults to "1" for variants of RCU that - permit this. (Or, more accurately, variants of RCU that do - -not- permit this know to ignore this variable.) - -n_barrier_cbs If this is nonzero, RCU barrier testing will be conducted, - in which case n_barrier_cbs specifies the number of - RCU callbacks (and corresponding kthreads) to use for - this testing. The value cannot be negative. If you - specify this to be non-zero when torture_type indicates a - synchronous RCU implementation (one for which a member of - the synchronize_rcu() rather than the call_rcu() family is - used -- see the documentation for torture_type below), an - error will be reported and no testing will be carried out. - -nfakewriters This is the number of RCU fake writer threads to run. Fake - writer threads repeatedly use the synchronous "wait for - current readers" function of the interface selected by - torture_type, with a delay between calls to allow for various - different numbers of writers running in parallel. - nfakewriters defaults to 4, which provides enough parallelism - to trigger special cases caused by multiple writers, such as - the synchronize_srcu() early return optimization. - -nreaders This is the number of RCU reading threads supported. - The default is twice the number of CPUs. Why twice? - To properly exercise RCU implementations with preemptible - read-side critical sections. - -onoff_interval - The number of seconds between each attempt to execute a - randomly selected CPU-hotplug operation. Defaults to - zero, which disables CPU hotplugging. In HOTPLUG_CPU=n - kernels, rcutorture will silently refuse to do any - CPU-hotplug operations regardless of what value is - specified for onoff_interval. - -onoff_holdoff The number of seconds to wait until starting CPU-hotplug - operations. This would normally only be used when - rcutorture was built into the kernel and started - automatically at boot time, in which case it is useful - in order to avoid confusing boot-time code with CPUs - coming and going. - -shuffle_interval - The number of seconds to keep the test threads affinitied - to a particular subset of the CPUs, defaults to 3 seconds. - Used in conjunction with test_no_idle_hz. - -shutdown_secs The number of seconds to run the test before terminating - the test and powering off the system. The default is - zero, which disables test termination and system shutdown. - This capability is useful for automated testing. - -stall_cpu The number of seconds that a CPU should be stalled while - within both an rcu_read_lock() and a preempt_disable(). - This stall happens only once per rcutorture run. - If you need multiple stalls, use modprobe and rmmod to - repeatedly run rcutorture. The default for stall_cpu - is zero, which prevents rcutorture from stalling a CPU. - - Note that attempts to rmmod rcutorture while the stall - is ongoing will hang, so be careful what value you - choose for this module parameter! In addition, too-large - values for stall_cpu might well induce failures and - warnings in other parts of the kernel. You have been - warned! - -stall_cpu_holdoff - The number of seconds to wait after rcutorture starts - before stalling a CPU. Defaults to 10 seconds. - -stat_interval The number of seconds between output of torture - statistics (via printk()). Regardless of the interval, - statistics are printed when the module is unloaded. - Setting the interval to zero causes the statistics to - be printed -only- when the module is unloaded, and this - is the default. - -stutter The length of time to run the test before pausing for this - same period of time. Defaults to "stutter=5", so as - to run and pause for (roughly) five-second intervals. - Specifying "stutter=0" causes the test to run continuously - without pausing, which is the old default behavior. - -test_boost Whether or not to test the ability of RCU to do priority - boosting. Defaults to "test_boost=1", which performs - RCU priority-inversion testing only if the selected - RCU implementation supports priority boosting. Specifying - "test_boost=0" never performs RCU priority-inversion - testing. Specifying "test_boost=2" performs RCU - priority-inversion testing even if the selected RCU - implementation does not support RCU priority boosting, - which can be used to test rcutorture's ability to - carry out RCU priority-inversion testing. - -test_boost_interval - The number of seconds in an RCU priority-inversion test - cycle. Defaults to "test_boost_interval=7". It is - usually wise for this value to be relatively prime to - the value selected for "stutter". - -test_boost_duration - The number of seconds to do RCU priority-inversion testing - within any given "test_boost_interval". Defaults to - "test_boost_duration=4". - -test_no_idle_hz Whether or not to test the ability of RCU to operate in - a kernel that disables the scheduling-clock interrupt to - idle CPUs. Boolean parameter, "1" to test, "0" otherwise. - Defaults to omitting this test. - -torture_type The type of RCU to test, with string values as follows: - - "rcu": rcu_read_lock(), rcu_read_unlock() and call_rcu(), - along with expedited, synchronous, and polling - variants. - - "rcu_bh": rcu_read_lock_bh(), rcu_read_unlock_bh(), and - call_rcu_bh(), along with expedited and synchronous - variants. - - "rcu_busted": This tests an intentionally incorrect version - of RCU in order to help test rcutorture itself. - - "srcu": srcu_read_lock(), srcu_read_unlock() and - call_srcu(), along with expedited and - synchronous variants. - - "sched": preempt_disable(), preempt_enable(), and - call_rcu_sched(), along with expedited, - synchronous, and polling variants. - - "tasks": voluntary context switch and call_rcu_tasks(), - along with expedited and synchronous variants. - - Defaults to "rcu". - -verbose Enable debug printk()s. Default is disabled. - +Module parameters are prefixed by "rcutorture." in +Documentation/admin-guide/kernel-parameters.txt. OUTPUT diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 4a6854318b17..981651a8b65d 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -302,7 +302,7 @@ rcu_dereference() must prohibit. The rcu_dereference_protected() variant takes a lockdep expression to indicate which locks must be acquired by the caller. If the indicated protection is not provided, - a lockdep splat is emitted. See RCU/Design/Requirements.html + a lockdep splat is emitted. See RCU/Design/Requirements/Requirements.html and the API's code comments for more details and example usage. The following diagram shows how each API communicates among the @@ -310,7 +310,7 @@ reader, updater, and reclaimer. rcu_assign_pointer() - +--------+ + +--------+ +---------------------->| reader |---------+ | +--------+ | | | | @@ -318,12 +318,12 @@ reader, updater, and reclaimer. | | | rcu_read_lock() | | | rcu_read_unlock() | rcu_dereference() | | - +---------+ | | - | updater |<---------------------+ | - +---------+ V + +---------+ | | + | updater |<----------------+ | + +---------+ V | +-----------+ +----------------------------------->| reclaimer | - +-----------+ + +-----------+ Defer: synchronize_rcu() & call_rcu() @@ -560,7 +560,7 @@ presents two such "toy" implementations of RCU, one that is implemented in terms of familiar locking primitives, and another that more closely resembles "classic" RCU. Both are way too simple for real-world use, lacking both functionality and performance. However, they are useful -in getting a feel for how RCU works. See kernel/rcupdate.c for a +in getting a feel for how RCU works. See kernel/rcu/update.c for a production-quality implementation, and see: http://www.rdrop.com/users/paulmck/RCU |