yocto-kernel-cache - Patches and configuration for the linux-yocto kernel tree

Age	Commit message (Collapse)	Author
2021-05-12	rt: prep for 5.13	Bruce Ashfield
	Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-03-19	highmem: Don't disable preemption on RT in kmap_atomic()	Bruce Ashfield
	1/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: highmem: Don't disable preemption on RT in kmap_atomic() Date: Fri, 30 Oct 2020 13:59:06 +0100 Disabling preemption makes it impossible to acquire sleeping locks within kmap_atomic() section. For PREEMPT_RT it is sufficient to disable migration. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 2/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: timers: Move clearing of base::timer_running under base::lock Date: Sun, 6 Dec 2020 22:40:07 +0100 syzbot reported KCSAN data races vs. timer_base::timer_running being set to NULL without holding base::lock in expire_timers(). This looks innocent and most reads are clearly not problematic but for a non-RT kernel it's completely irrelevant whether the store happens before or after taking the lock. For an RT kernel moving the store under the lock requires an extra unlock/lock pair in the case that there is a waiter for the timer. But that's not the end of the world and definitely not worth the trouble of adding boatloads of comments and annotations to the code. Famous last words... Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable-rt@vger.kernel.org ] 3/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kthread: Move prio/affinite change into the newly created thread Date: Mon, 9 Nov 2020 21:30:41 +0100 With enabled threaded interrupts the nouveau driver reported the following: \| Chain exists of: \| &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem \| \| Possible unsafe locking scenario: \| \| CPU0 CPU1 \| ---- ---- \| lock(&cpuset_rwsem); \| lock(&device->mutex); \| lock(&cpuset_rwsem); \| lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority reset to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de ] 4/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: genirq: Move prio assignment into the newly created thread Date: Mon, 9 Nov 2020 23:32:39 +0100 With enabled threaded interrupts the nouveau driver reported the following: \| Chain exists of: \| &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem \| \| Possible unsafe locking scenario: \| \| CPU0 CPU1 \| ---- ---- \| lock(&cpuset_rwsem); \| lock(&device->mutex); \| lock(&cpuset_rwsem); \| lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority assignment to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: Patch description] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de ] 5/191 [ Author: Valentin Schneider Email: valentin.schneider@arm.com Subject: notifier: Make atomic_notifiers use raw_spinlock Date: Sun, 22 Nov 2020 20:19:04 +0000 Booting a recent PREEMPT_RT kernel (v5.10-rc3-rt7-rebase) on my arm64 Juno leads to the idle task blocking on an RT sleeping spinlock down some notifier path: [ 1.809101] BUG: scheduling while atomic: swapper/5/0/0x00000002 [ 1.809116] Modules linked in: [ 1.809123] Preemption disabled at: [ 1.809125] secondary_start_kernel (arch/arm64/kernel/smp.c:227) [ 1.809146] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.10.0-rc3-rt7 #168 [ 1.809153] Hardware name: ARM Juno development board (r0) (DT) [ 1.809158] Call trace: [ 1.809160] dump_backtrace (arch/arm64/kernel/stacktrace.c:100 (discriminator 1)) [ 1.809170] show_stack (arch/arm64/kernel/stacktrace.c:198) [ 1.809178] dump_stack (lib/dump_stack.c:122) [ 1.809188] __schedule_bug (kernel/sched/core.c:4886) [ 1.809197] __schedule (./arch/arm64/include/asm/preempt.h:18 kernel/sched/core.c:4913 kernel/sched/core.c:5040) [ 1.809204] preempt_schedule_lock (kernel/sched/core.c:5365 (discriminator 1)) [ 1.809210] rt_spin_lock_slowlock_locked (kernel/locking/rtmutex.c:1072) [ 1.809217] rt_spin_lock_slowlock (kernel/locking/rtmutex.c:1110) [ 1.809224] rt_spin_lock (./include/linux/rcupdate.h:647 kernel/locking/rtmutex.c:1139) [ 1.809231] atomic_notifier_call_chain_robust (kernel/notifier.c:71 kernel/notifier.c:118 kernel/notifier.c:186) [ 1.809240] cpu_pm_enter (kernel/cpu_pm.c:39 kernel/cpu_pm.c:93) [ 1.809249] psci_enter_idle_state (drivers/cpuidle/cpuidle-psci.c:52 drivers/cpuidle/cpuidle-psci.c:129) [ 1.809258] cpuidle_enter_state (drivers/cpuidle/cpuidle.c:238) [ 1.809267] cpuidle_enter (drivers/cpuidle/cpuidle.c:353) [ 1.809275] do_idle (kernel/sched/idle.c:132 kernel/sched/idle.c:213 kernel/sched/idle.c:273) [ 1.809282] cpu_startup_entry (kernel/sched/idle.c:368 (discriminator 1)) [ 1.809288] secondary_start_kernel (arch/arm64/kernel/smp.c:273) Two points worth noting: 1) That this is conceptually the same issue as pointed out in: 313c8c16ee62 ("PM / CPU: replace raw_notifier with atomic_notifier") 2) Only the _robust() variant of atomic_notifier callchains suffer from this AFAICT only the cpu_pm_notifier_chain really needs to be changed, but singling it out would mean introducing a new (truly) non-blocking API. At the same time, callers that are fine with any blocking within the call chain should use blocking notifiers, so patching up all atomic_notifier's doesn't seem too crazy to me. Fixes: 70d932985757 ("notifier: Fix broken error handling pattern") Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201122201904.30940-1-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 6/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/mm: Move the linear_mapping_mutex to the ifdef where it is used Date: Fri, 19 Feb 2021 17:51:07 +0100 The mutex linear_mapping_mutex is defined at the of the file while its only two user are within the CONFIG_MEMORY_HOTPLUG block. A compile without CONFIG_MEMORY_HOTPLUG set fails on PREEMPT_RT because its mutex implementation is smart enough to realize that it is unused. Move the definition of linear_mapping_mutex to ifdef block where it is used. Fixes: 1f73ad3e8d755 ("powerpc/mm: print warning in arch_remove_linear_mapping()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 7/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: limit second loop of syslog_print_all Date: Wed, 17 Feb 2021 16:15:31 +0100 The second loop of syslog_print_all() subtracts lengths that were added in the first loop. With commit b031a684bfd0 ("printk: remove logbuf_lock writer-protection of ringbuffer") it is possible that records are (over)written during syslog_print_all(). This allows the possibility of the second loop subtracting lengths that were never added in the first loop. This situation can result in syslog_print_all() filling the buffer starting from a later record, even though there may have been room to fit the earlier record(s) as well. Fixes: b031a684bfd0 ("printk: remove logbuf_lock writer-protection of ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> ] 8/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: kmsg_dump: remove unused fields Date: Mon, 21 Dec 2020 11:19:39 +0106 struct kmsg_dumper still contains some fields that were used to iterate the old ringbuffer. They are no longer used. Remove them and update the struct documentation. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 9/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: refactor kmsg_dump_get_buffer() Date: Mon, 30 Nov 2020 01:41:56 +0106 kmsg_dump_get_buffer() requires nearly the same logic as syslog_print_all(), but uses different variable names and does not make use of the ringbuffer loop macros. Modify kmsg_dump_get_buffer() so that the implementation is as similar to syslog_print_all() as possible. A follow-up commit will move this common logic into a separate helper function. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 10/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: consolidate kmsg_dump_get_buffer/syslog_print_all code Date: Wed, 13 Jan 2021 11:29:53 +0106 The logic for finding records to fit into a buffer is the same for kmsg_dump_get_buffer() and syslog_print_all(). Introduce a helper function find_first_fitting_seq() to handle this logic. Signed-off-by: John Ogness <john.ogness@linutronix.de> ] 11/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: introduce CONSOLE_LOG_MAX for improved multi-line support Date: Thu, 10 Dec 2020 12:48:01 +0106 Instead of using "LOG_LINE_MAX + PREFIX_MAX" for temporary buffer sizes, introduce CONSOLE_LOG_MAX. This represents the maximum size that is allowed to be printed to the console for a single record. Rather than setting CONSOLE_LOG_MAX to "LOG_LINE_MAX + PREFIX_MAX" (1024), increase it to 4096. With a larger buffer size, multi-line records that are nearly LOG_LINE_MAX in length will have a better chance of being fully printed. (When formatting a record for the console, each line of a multi-line record is prepended with a copy of the prefix.) Signed-off-by: John Ogness <john.ogness@linutronix.de> ] 12/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: use seqcount_latch for clear_seq Date: Mon, 30 Nov 2020 01:41:58 +0106 kmsg_dump_rewind_nolock() locklessly reads @clear_seq. However, this is not done atomically. Since @clear_seq is 64-bit, this cannot be an atomic operation for all platforms. Therefore, use a seqcount_latch to allow readers to always read a consistent value. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 13/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: use atomic64_t for devkmsg_user.seq Date: Thu, 10 Dec 2020 15:33:40 +0106 @user->seq is indirectly protected by @logbuf_lock. Once @logbuf_lock is removed, @user->seq will be no longer safe from an atomicity point of view. In preparation for the removal of @logbuf_lock, change it to atomic64_t to provide this safety. Signed-off-by: John Ogness <john.ogness@linutronix.de> ] 14/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add syslog_lock Date: Thu, 10 Dec 2020 16:58:02 +0106 The global variables @syslog_seq, @syslog_partial, @syslog_time and write access to @clear_seq are protected by @logbuf_lock. Once @logbuf_lock is removed, these variables will need their own synchronization method. Introduce @syslog_lock for this purpose. @syslog_lock is a raw_spin_lock for now. This simplifies the transition to removing @logbuf_lock. Once @logbuf_lock and the safe buffers are removed, @syslog_lock can change to spin_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 15/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: introduce a kmsg_dump iterator Date: Fri, 18 Dec 2020 11:40:08 +0000 Rather than store the iterator information into the registered kmsg_dump structure, create a separate iterator structure. The kmsg_dump_iter structure can reside on the stack of the caller, thus allowing lockless use of the kmsg_dump functions. This is in preparation for removal of @logbuf_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 16/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: um: synchronize kmsg_dumper Date: Mon, 21 Dec 2020 11:10:03 +0106 The kmsg_dumper can be called from any context and CPU, possibly from multiple CPUs simultaneously. Since a static buffer is used to retrieve the kernel logs, this buffer must be protected against simultaneous dumping. Cc: Richard Weinberger <richard@nod.at> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 17/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove logbuf_lock Date: Tue, 26 Jan 2021 17:43:19 +0106 Since the ringbuffer is lockless, there is no need for it to be protected by @logbuf_lock. Remove @logbuf_lock. This means that printk_nmi_direct and printk_safe_flush_on_panic() no longer need to acquire any lock to run. @console_seq, @exclusive_console_stop_seq, @console_dropped are protected by @console_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 18/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: kmsg_dump: remove _nolock() variants Date: Mon, 21 Dec 2020 10:27:58 +0106 kmsg_dump_rewind() and kmsg_dump_get_line() are lockless, so there is no need for _nolock() variants. Remove these functions and switch all callers of the _nolock() variants. The functions without _nolock() were chosen because they are already exported to kernel modules. Signed-off-by: John Ogness <john.ogness@linutronix.de> ] 19/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: kmsg_dump: use kmsg_dump_rewind Date: Wed, 17 Feb 2021 18:23:16 +0100 kmsg_dump() is open coding the kmsg_dump_rewind(). Call kmsg_dump_rewind() instead. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 20/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: console: remove unnecessary safe buffer usage Date: Wed, 17 Feb 2021 18:28:05 +0100 Upon registering a console, safe buffers are activated when setting up the sequence number to replay the log. However, these are already protected by @console_sem and @syslog_lock. Remove the unnecessary safe buffer usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> ] 21/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: track/limit recursion Date: Fri, 11 Dec 2020 00:55:25 +0106 Limit printk() recursion to 1 level. This is enough to print a stacktrace for the printk call, should a WARN or BUG occur. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 22/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove safe buffers Date: Mon, 30 Nov 2020 01:42:00 +0106 With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. In NMI or safe contexts, store the message immediately but still use irq_work to defer the console printing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 23/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: convert @syslog_lock to spin_lock Date: Thu, 18 Feb 2021 17:37:41 +0100 Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 24/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: console: add write_atomic interface Date: Mon, 30 Nov 2020 01:42:01 +0106 Add a write_atomic() callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write() callback. The write_atomic() callback will only be used in special situations, such as when the kernel panics. Creating an NMI safe write_atomic() that must synchronize with write() requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_() functions are provided: void console_atomic_lock(unsigned int flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using a processor-reentrant spinlock (called a cpulock). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 25/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: serial: 8250: implement write_atomic Date: Mon, 30 Nov 2020 01:42:02 +0106 Implement a non-sleeping NMI-safe write_atomic() console function in order to support emergency console printing. Since interrupts need to be disabled during transmit, all usage of the IER register is wrapped with access functions that use the console_atomic_lock() function to synchronize register access while tracking the state of the interrupts. This is necessary because write_atomic() can be called from an NMI context that has preempted write_atomic(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 26/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: relocate printk_delay() and vprintk_default() Date: Mon, 30 Nov 2020 01:42:03 +0106 Move printk_delay() and vprintk_default() "as is" further up so that they can be used by new functions in an upcoming commit. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 27/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: combine boot_delay_msec() into printk_delay() Date: Mon, 30 Nov 2020 01:42:04 +0106 boot_delay_msec() is always called immediately before printk_delay() so just combine the two. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 28/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: change @console_seq to atomic64_t Date: Mon, 30 Nov 2020 01:42:05 +0106 In preparation for atomic printing, change @console_seq to atomic so that it can be accessed without requiring @console_sem. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 29/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: introduce kernel sync mode Date: Mon, 30 Nov 2020 01:42:06 +0106 When the kernel performs an OOPS, enter into "sync mode": - only atomic consoles (write_atomic() callback) will print - printing occurs within vprintk_store() instead of console_unlock() CONSOLE_LOG_MAX is moved to printk.h to support the per-console buffer used in sync mode. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 30/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: move console printing to kthreads Date: Mon, 30 Nov 2020 01:42:07 +0106 Create a kthread for each console to perform console printing. Now all console printing is fully asynchronous except for the boot console and when the kernel enters sync mode (and there are atomic consoles available). The console_lock() and console_unlock() functions now only do what their name says... locking and unlocking of the console. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 31/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove deferred printing Date: Mon, 30 Nov 2020 01:42:08 +0106 Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk context tracking. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 32/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add console handover Date: Mon, 30 Nov 2020 01:42:09 +0106 If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 33/191 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add pr_flush() Date: Mon, 30 Nov 2020 01:42:10 +0106 Provide a function to allow waiting for console printers to catch up to the latest logged message. Use pr_flush() to give console printers a chance to finish in critical situations if no atomic console is available. For now pr_flush() is only used in the most common error paths: panic(), print_oops_end_marker(), report_bug(), kmsg_dump(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 34/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kcov: Remove kcov include from sched.h and move it to its users. Date: Thu, 18 Feb 2021 18:31:24 +0100 The recent addition of in_serving_softirq() to kconv.h results in compile failure on PREEMPT_RT because it requires task_struct::softirq_disable_cnt. This is not available if kconv.h is included from sched.h. It is not needed to include kconv.h from sched.h. All but the net/ user already include the kconv header file. Move the include of the kconv.h header from sched.h it its users. Additionally include sched.h from kconv.h to ensure that everything task_struct related is available. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Andrey Konovalov <andreyknvl@google.com> Link: https://lkml.kernel.org/r/20210218173124.iy5iyqv3a4oia4vv@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 35/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: use irqsave in cgroup_rstat_flush_locked() Date: Tue, 3 Jul 2018 18:19:48 +0200 All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 36/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: workingset: replace IRQ-off check with a lockdep assert. Date: Mon, 11 Feb 2019 10:40:46 +0100 Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 37/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: shmem: Use raw_spinlock_t for ->stat_lock Date: Fri, 14 Aug 2020 18:53:34 +0200 Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another sollution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() should be moved outside of the critical section to avoid invoking the destrutor with disabled preemption. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 38/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Move lockdep where it belongs Date: Tue, 8 Sep 2020 07:32:20 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 39/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: tcp: Remove superfluous BH-disable around listening_hash Date: Mon, 12 Oct 2020 17:33:54 +0200 Commit 9652dc2eb9e40 ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock, invoke inet_ehash_nolisten() with disabled BH. inet_unhash() conditionally acquires listening_hash->lock. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/linux-rt-users/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de/ Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net ] 40/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: smp: Wake ksoftirqd on PREEMPT_RT instead do_softirq(). Date: Mon, 15 Feb 2021 18:44:12 +0100 The softirq implementation on PREEMPT_RT does not provide do_softirq(). The other user of do_softirq() is replaced with a local_bh_disable() + enable() around the possible raise-softirq invocation. This can not be done here because migration_cpu_stop() is invoked with disabled preemption. Wake the softirq thread on PREEMPT_RT if there are any pending softirqs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 41/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Replace barrier() with cpu_relax() in tasklet_unlock_wait() Date: Tue, 9 Mar 2021 09:42:04 +0100 A barrier() in a tight loop which waits for something to happen on a remote CPU is a pointless exercise. Replace it with cpu_relax() which allows HT siblings to make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 42/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Use static inlines for stub implementations Date: Tue, 9 Mar 2021 09:42:05 +0100 Inlines exist for a reason. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 43/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Provide tasklet_disable_in_atomic() Date: Tue, 9 Mar 2021 09:42:06 +0100 Replacing the spin wait loops in tasklet_unlock_wait() with wait_var_event() is not possible as a handful of tasklet_disable() invocations are happening in atomic context. All other invocations are in teardown paths which can sleep. Provide tasklet_disable_in_atomic() and tasklet_unlock_spin_wait() to convert the few atomic use cases over, which allows to change tasklet_disable() and tasklet_unlock_wait() in a later step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 44/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Use spin wait in tasklet_disable() temporarily Date: Tue, 9 Mar 2021 09:42:07 +0100 To ease the transition use spin waiting in tasklet_disable() until all usage sites from atomic context have been cleaned up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 45/191 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: tasklets: Replace spin wait in tasklet_unlock_wait() Date: Tue, 9 Mar 2021 09:42:08 +0100 tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This is wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_unlock_wait() is invoked from tasklet_kill() which is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). There are no users of tasklet_unlock_wait() which are invoked from atomic contexts. The usage in tasklet_disable() has been replaced temporarily with the spin waiting variant until the atomic users are fixed up and will be converted to the sleep wait variant later. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 46/191 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: tasklets: Replace spin wait in tasklet_kill() Date: Tue, 9 Mar 2021 09:42:09 +0100 tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking yield() from inside the loop. yield() is an ill defined mechanism and the result might still be wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_kill() is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 47/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RT Date: Tue, 9 Mar 2021 09:42:10 +0100 tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in the tasklet state to be cleared. This works on !RT nicely because the corresponding execution can only happen on a different CPU. On RT softirq processing is preemptible, therefore a task preempting the softirq processing thread can spin forever. Prevent this by invoking local_bh_disable()/enable() inside the loop. In case that the softirq processing thread was preempted by the current task, current will block on the local lock which yields the CPU to the preempted softirq processing thread. If the tasklet is processed on a different CPU then the local_bh_disable()/enable() pair is just a waste of processor cycles. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 48/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: jme: Replace link-change tasklet with work Date: Tue, 9 Mar 2021 09:42:11 +0100 The link change tasklet disables the tasklets for tx/rx processing while upating hw parameters and then enables the tasklets again. This update can also be pushed into a workqueue where it can be performed in preemptible context. This allows tasklet_disable() to become sleeping. Replace the linkch_task tasklet with a work. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 49/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: sundance: Use tasklet_disable_in_atomic(). Date: Tue, 9 Mar 2021 09:42:12 +0100 tasklet_disable() is used in the timer callback. This might be distangled, but without access to the hardware that's a bit risky. Replace it with tasklet_disable_in_atomic() so tasklet_disable() can be changed to a sleep wait once all remaining atomic users are converted. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Denis Kirjanov <kda@linux-powerpc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 50/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ath9k: Use tasklet_disable_in_atomic() Date: Tue, 9 Mar 2021 09:42:13 +0100 All callers of ath9k_beacon_ensure_primary_slot() are preemptible / acquire a mutex except for this callchain: spin_lock_bh(&sc->sc_pcu_lock); ath_complete_reset() -> ath9k_calculate_summary_state() -> ath9k_beacon_ensure_primary_slot() It's unclear how that can be distangled, so use tasklet_disable_in_atomic() for now. This allows tasklet_disable() to become sleepable once the remaining atomic users are cleaned up. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ath9k-devel@qca.qualcomm.com Cc: Kalle Valo <kvalo@codeaurora.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 51/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: atm: eni: Use tasklet_disable_in_atomic() in the send() callback Date: Tue, 9 Mar 2021 09:42:14 +0100 The atmdev_ops::send callback which calls tasklet_disable() is invoked with bottom halfs disabled from net_device_ops::ndo_start_xmit(). All other invocations of tasklet_disable() in this driver happen in preemptible context. Change the send() call to use tasklet_disable_in_atomic() which allows tasklet_disable() to be made sleepable once the remaining atomic context usage sites are cleaned up. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chas Williams <3chas3@gmail.com> Cc: linux-atm-general@lists.sourceforge.net Cc: netdev@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 52/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: PCI: hv: Use tasklet_disable_in_atomic() Date: Tue, 9 Mar 2021 09:42:15 +0100 The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is invoked via irq_chip_compose_msi_msg(), which itself is always invoked from atomic contexts from the guts of the interrupt core code. There is no way to change this w/o rewriting the whole driver, so use tasklet_disable_in_atomic() which allows to make tasklet_disable() sleepable once the remaining atomic users are addressed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Rob Herring <robh@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-hyperv@vger.kernel.org Cc: linux-pci@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 53/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: firewire: ohci: Use tasklet_disable_in_atomic() where required Date: Tue, 9 Mar 2021 09:42:16 +0100 tasklet_disable() is invoked in several places. Some of them are in atomic context which prevents a conversion of tasklet_disable() to a sleepable function. The atomic callchains are: ar_context_tasklet() ohci_cancel_packet() tasklet_disable() ... ohci_flush_iso_completions() tasklet_disable() The invocation of tasklet_disable() from at_context_flush() is always in preemptible context. Use tasklet_disable_in_atomic() for the two invocations in ohci_cancel_packet() and ohci_flush_iso_completions(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: linux1394-devel@lists.sourceforge.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 54/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Switch tasklet_disable() to the sleep wait variant Date: Tue, 9 Mar 2021 09:42:17 +0100 -- NOT FOR IMMEDIATE MERGING -- Now that all users of tasklet_disable() are invoked from sleepable context, convert it to use tasklet_unlock_wait() which might sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 55/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Add RT specific softirq accounting Date: Tue, 9 Mar 2021 09:55:53 +0100 RT requires the softirq processing and local bottomhalf disabled regions to be preemptible. Using the normal preempt count based serialization is therefore not possible because this implicitely disables preemption. RT kernels use a per CPU local lock to serialize bottomhalfs. As local_bh_disable() can nest the lock can only be acquired on the outermost invocation of local_bh_disable() and released when the nest count becomes zero. Tasks which hold the local lock can be preempted so its required to keep track of the nest count per task. Add a RT only counter to task struct and adjust the relevant macros in preempt.h. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 56/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: irqtime: Make accounting correct on RT Date: Tue, 9 Mar 2021 09:55:54 +0100 vtime_account_irq and irqtime_account_irq() base checks on preempt_count() which fails on RT because preempt_count() does not contain the softirq accounting which is seperate on RT. These checks do not need the full preempt count as they only operate on the hard and softirq sections. Use irq_count() instead which provides the correct value on both RT and non RT kernels. The compiler is clever enough to fold the masking for !RT: 99b: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax - 9a2: 25 ff ff ff 7f and $0x7fffffff,%eax + 9a2: 25 00 ff ff 00 and $0xffff00,%eax Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 57/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Move various protections into inline helpers Date: Tue, 9 Mar 2021 09:55:55 +0100 To allow reuse of the bulk of softirq processing code for RT and to avoid #ifdeffery all over the place, split protections for various code sections out into inline helpers so the RT variant can just replace them in one go. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 58/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Make softirq control and processing RT aware Date: Tue, 9 Mar 2021 09:55:56 +0100 Provide a local lock based serialization for soft interrupts on RT which allows the local_bh_disabled() sections and servicing soft interrupts to be preemptible. Provide the necessary inline helpers which allow to reuse the bulk of the softirq processing code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 59/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tick/sched: Prevent false positive softirq pending warnings on RT Date: Tue, 9 Mar 2021 09:55:57 +0100 On RT a task which has soft interrupts disabled can block on a lock and schedule out to idle while soft interrupts are pending. This triggers the warning in the NOHZ idle code which complains about going idle with pending soft interrupts. But as the task is blocked soft interrupt processing is temporarily blocked as well which means that such a warning is a false positive. To prevent that check the per CPU state which indicates that a scheduled out task has soft interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 60/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rcu: Prevent false positive softirq warning on RT Date: Tue, 9 Mar 2021 09:55:58 +0100 Soft interrupt disabled sections can legitimately be preempted or schedule out when blocking on a lock on RT enabled kernels so the RCU preempt check warning has to be disabled for RT kernels. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 61/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove cruft Date: Tue, 29 Sep 2020 15:21:17 +0200 Most of this is around since the very beginning. I'm not sure if this was used while the rtmutex-deadlock-tester was around but today it seems to only waste memory: - save_state: No users - name: Assigned and printed if a dead lock was detected. I'm keeping it but want to point out that lockdep has the same information. - file + line: Printed if ::name was NULL. This is only used for in-kernel locks so it ::name shouldn't be NULL and then ::file and ::line isn't used. - magic: Assigned to NULL by rt_mutex_destroy(). Remove members of rt_mutex which are not used. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 62/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove output from deadlock detector. Date: Tue, 29 Sep 2020 16:05:11 +0200 In commit f5694788ad8da ("rt_mutex: Add lockdep annotations") rtmutex gained lockdep annotation for rt_mutex_lock() and and related functions. lockdep will see the locking order and may complain about a deadlock before rtmutex' own mechanism gets a chance to detect it. The rtmutex deadlock detector will only complain locks with the RT_MUTEX_MIN_CHAINWALK and a waiter must be pending. That means it works only for in-kernel locks because the futex interface always uses RT_MUTEX_FULL_CHAINWALK. The requirement for an active waiter limits the detector to actual deadlocks and makes it possible to report potential deadlocks like lockdep does. It looks like lockdep is better suited for reporting deadlocks. Remove rtmutex' debug print on deadlock detection. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 63/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES Date: Tue, 29 Sep 2020 16:32:49 +0200 rt_mutex_init() only initializes lockdep if CONFIG_DEBUG_RT_MUTEXES is enabled. The static initializer (DEFINE_RT_MUTEX) does not have such a restriction. Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES. Move the remaining functions in this CONFIG_DEBUG_RT_MUTEXES block to the upper block. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 64/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove rt_mutex_timed_lock() Date: Wed, 7 Oct 2020 12:11:33 +0200 rt_mutex_timed_lock() has no callers since commit c051b21f71d1f ("rtmutex: Confine deadlock logic to futex") Remove rt_mutex_timed_lock(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 65/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Handle the various new futex race conditions Date: Fri, 10 Jun 2011 11:04:15 +0200 RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 66/191 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: futex: Fix bug on when a requeued RT task times out Date: Tue, 14 Jul 2015 14:26:34 +0200 Requeue with timeout causes a bug with PREEMPT_RT. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() <timed out> double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 67/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Make lock_killable work Date: Sat, 1 Apr 2017 12:50:59 +0200 Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionally. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 68/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/spinlock: Split the lock types header Date: Wed, 29 Jun 2011 19:34:01 +0200 Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 69/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Avoid include hell Date: Wed, 29 Jun 2011 20:06:39 +0200 Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 70/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: Reduce header files in debug_locks.h Date: Fri, 14 Aug 2020 16:55:25 +0200 The inclusion of printk.h leads to circular dependency if spinlock_t is based on rt_mutex. Include only atomic.h (xchg()) and cache.h (__read_mostly). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 71/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: split out the rbtree definition Date: Fri, 14 Aug 2020 17:08:41 +0200 rtmutex.h needs the definition for rb_root_cached. By including kernel.h we will get to spinlock.h which requires rtmutex.h again. Split out the required struct definition and move it into its own header file which can be included by rtmutex.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 72/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Provide rt_mutex_slowlock_locked() Date: Thu, 12 Oct 2017 16:14:22 +0200 This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 73/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Date: Thu, 12 Oct 2017 16:36:39 +0200 Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 74/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add saved_state for tasks blocked on sleeping locks Date: Sat, 25 Jun 2011 09:21:04 +0200 Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 75/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add sleeping lock implementation Date: Thu, 12 Oct 2017 17:11:19 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 76/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Allow rt_mutex_trylock() on PREEMPT_RT Date: Wed, 2 Dec 2015 11:34:07 +0100 Non PREEMPT_RT kernel can deadlock on rt_mutex_trylock() in softirq context. On PREEMPT_RT the softirq context is handled in thread context. This avoids the deadlock in the slow path and PI-boosting will be done on the correct thread. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 77/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add mutex implementation based on rtmutex Date: Thu, 12 Oct 2017 17:17:03 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 78/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add rwsem implementation based on rtmutex Date: Thu, 12 Oct 2017 17:28:34 +0200 The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code paths shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 79/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add rwlock implementation based on rtmutex Date: Thu, 12 Oct 2017 17:18:06 +0200 The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 80/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: wire up RT's locking Date: Thu, 12 Oct 2017 17:31:14 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 81/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: add ww_mutex addon for mutex-rt Date: Thu, 12 Oct 2017 17:34:38 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 82/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Use custom scheduling function for spin-schedule() Date: Tue, 6 Oct 2020 13:07:17 +0200 PREEMPT_RT builds the rwsem, mutex, spinlock and rwlock typed locks on top of a rtmutex lock. While blocked task->pi_blocked_on is set (tsk_is_pi_blocked()) and task needs to schedule away while waiting. The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. Add preempt_schedule_lock() which is invoked if scheduling is required while blocking on a PREEMPT_RT-only sleeping lock. Remove tsk_is_pi_blocked() from the scheduler path which is no longer needed with the additional scheduler entry point. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 83/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signal: Revert ptrace preempt magic Date: Wed, 21 Sep 2011 19:57:12 +0200 Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 84/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: preempt: Provide preempt__(no)rt variants Date: Fri, 24 Jul 2009 12:38:56 +0200 RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 85/191 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm/vmstat: Protect per cpu variables with preempt disable on RT Date: Fri, 3 Jul 2009 08:30:13 -0500 Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 86/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Disable preemption in __mod_memcg_lruvec_state() Date: Wed, 28 Oct 2020 18:15:32 +0100 The callers expect disabled preemption/interrupts while invoking __mod_memcg_lruvec_state(). This works mainline because a lock of somekind is acquired. Use preempt_disable_rt() where per-CPU variables are accessed and a stable pointer is expected. This is also done in __mod_zone_page_state() for the same reason. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 87/191 [ Author: Ahmed S. Darwish Email: a.darwish@linutronix.de Subject: xfrm: Use sequence counter with associated spinlock Date: Wed, 10 Jun 2020 12:53:22 +0200 A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Upstream-status: The xfrm locking used for seqcoun writer serialization appears to be broken. If that's the case, a proper fix will need to be submitted upstream. (e.g. make the seqcount per network namespace?) Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 88/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: u64_stats: Disable preemption on 32bit-UP/SMP with RT during updates Date: Mon, 17 Aug 2020 12:28:10 +0200 On RT the seqcount_t is required even on UP because the softirq can be preempted. The IRQ handler is threaded so it is also preemptible. Disable preemption on 32bit-RT during value updates. There is no need to disable interrupts on RT because the handler is run threaded. Therefore disabling preemption is enough to guarantee that the update is not interruped. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 89/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: use swait_queue instead of waitqueue Date: Wed, 14 Sep 2016 14:35:49 +0200 __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 90/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: disable preemption on i_dir_seq's write side Date: Fri, 20 Oct 2017 11:29:53 +0200 i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 91/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/Qdisc: use a seqlock instead seqcount Date: Wed, 14 Sep 2016 17:36:35 +0200 The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 92/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Properly annotate the try-lock for the seqlock Date: Tue, 8 Sep 2020 16:57:11 +0200 In patch ("net/Qdisc: use a seqlock instead seqcount") the seqcount has been replaced with a seqlock to allow to reader to boost the preempted writer. The try_write_seqlock() acquired the lock with a try-lock but the seqcount annotation was "lock". Opencode write_seqcount_t_begin() and use the try-lock annotation for lockdep. Reported-by: Mike Galbraith <efault@gmx.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 93/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: kconfig: Disable config options which are not RT compatible Date: Sun, 24 Jul 2011 12:11:43 +0200 Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 94/191 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: Allow only SLUB on RT Date: Fri, 3 Jul 2009 08:44:03 -0500 Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 95/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable CONFIG_RT_GROUP_SCHED on RT Date: Mon, 18 Jul 2011 17:03:52 +0200 Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 96/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: disable NET_RX_BUSY_POLL on RT Date: Sat, 27 May 2017 19:02:06 +0200 napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 97/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Disable runtime services on RT Date: Thu, 26 Jul 2018 15:03:16 +0200 Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 98/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Allow efi=runtime Date: Thu, 26 Jul 2018 15:06:10 +0200 In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 99/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Add local irq locks Date: Mon, 20 Jun 2011 09:03:47 +0200 Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 100/191 [ Author: Oleg Nesterov Email: oleg@redhat.com Subject: signal/x86: Delay calling signals in atomic Date: Tue, 14 Jul 2015 14:26:34 +0200 On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: also needed on 32bit as per Yang Shi <yang.shi@linaro.org>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 101/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: add {put\|get}_cpu_light() Date: Sat, 27 May 2017 19:02:06 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 102/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: trace: Add migrate-disabled counter to tracing output Date: Sun, 17 Jul 2011 21:56:42 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 103/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Date: Fri, 4 Aug 2017 17:40:42 +0200 Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 104/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm: sl[au]b: Change list_lock to raw_spinlock_t Date: Mon, 28 May 2018 15:24:22 +0200 The list_lock is used with used with IRQs off on PREEMPT_RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on PREEMPT_RT. The locking rules remain unchanged. The lock is updated for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 105/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: slub: Make object_map_lock a raw_spinlock_t Date: Thu, 16 Jul 2020 18:47:50 +0200 The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 106/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm: slub: Enable irqs for __GFP_WAIT Date: Wed, 9 Jan 2013 12:08:15 +0100 SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. [bigeasy: Add warning on RT for allocations in atomic context. Don't enable interrupts on allocations during SYSTEM_SUSPEND. This is done during suspend by ACPI, noticed by Liwei Song <liwei.song@windriver.com> ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 107/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: slub: Move discard_slab() invocations out of IRQ-off sections Date: Fri, 26 Feb 2021 15:14:15 +0100 discard_slab() gives the memory back to the page-allocator. Some of its invocation occur from IRQ-disabled sections which were disabled by SLUB. An example is the deactivate_slab() invocation from within ___slab_alloc() or put_cpu_partial(). Instead of giving the memory back directly, put the pages on a list and process it once the caller is out of the known IRQ-off region. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 108/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context Date: Fri, 26 Feb 2021 17:11:55 +0100 flush_all() flushes a specific SLAB cache on each CPU (where the cache is present). The discard_delayed()/__free_slab() invocation happens within IPI handler and is problematic for PREEMPT_RT. The flush operation is not a frequent operation or a hot path. The per-CPU flush operation can be moved to within a workqueue. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 109/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: slub: Don't resize the location tracking cache on PREEMPT_RT Date: Fri, 26 Feb 2021 17:26:04 +0100 The location tracking cache has a size of a page and is resized if its current size is too small. This allocation happens with disabled interrupts and can't happen on PREEMPT_RT. Should one page be too small, then we have to allocate more at the beginning. The only downside is that less callers will be visible. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 110/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: page_alloc: Use migrate_disable() in drain_local_pages_wq() Date: Thu, 2 Jul 2020 14:27:23 +0200 drain_local_pages_wq() disables preemption to avoid CPU migration during CPU hotplug and can't use cpus_read_lock(). Using migrate_disable() works here, too. The scheduler won't take the CPU offline until the task left the migrate-disable section. Use migrate_disable() in drain_local_pages_wq(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 111/191 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: page_alloc: Use a local_lock instead of explicit local_irq_save(). Date: Fri, 3 Jul 2009 08:29:37 -0500 The page-allocator disables interrupts for a few reasons: - Decouple interrupt the irqsave operation from spin_lock() so it can be extended over the actual lock region and cover other areas. Areas like counters increments where the preemptible version can be avoided. - Access to the per-CPU pcp from struct zone. Replace the irqsave with a local-lock. The counters are expected to be always modified with disabled preemption and no access from interrupt context. Contains fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 112/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: slub: Don't enable partial CPU caches on PREEMPT_RT by default Date: Tue, 2 Mar 2021 18:58:04 +0100 SLUB's partial CPU caches lead to higher latencies in a hackbench benchmark. Don't enable partial CPU caches by default on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 113/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: memcontrol: Provide a local_lock for per-CPU memcg_stock Date: Tue, 18 Aug 2020 10:30:00 +0200 The interrupts are disabled to ensure CPU-local access to the per-CPU variable `memcg_stock'. As the code inside the interrupt disabled section acquires regular spinlocks, which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel, this conflicts with the RT semantics. Convert it to a local_lock which allows RT kernels to substitute them with a real per CPU lock. On non RT kernels this maps to local_irq_save() as before, but provides also lockdep coverage of the critical region. No functional change. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 114/191 [ Author: Yang Shi Email: yang.shi@windriver.com Subject: mm/memcontrol: Don't call schedule_work_on in preemption disabled context Date: Wed, 30 Oct 2013 11:48:33 -0700 The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff8112b361>] drain_all_stock+0xe1/0x1c0 [<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 [<ffffffff8112beda>] __mem_cgroup_try_charge+0x41a/0xc40 [<ffffffff810f1c91>] ? release_pages+0x1b1/0x1f0 [<ffffffff8106f200>] ? sched_exec+0x40/0xb0 [<ffffffff8112cc87>] mem_cgroup_charge_common+0x37/0x70 [<ffffffff8112e2c6>] mem_cgroup_newpage_charge+0x26/0x30 [<ffffffff8110af68>] handle_pte_fault+0x618/0x840 [<ffffffff8103ecf6>] ? unpin_current_cpu+0x16/0x70 [<ffffffff81070f94>] ? migrate_enable+0xd4/0x200 [<ffffffff8110cde5>] handle_mm_fault+0x145/0x1e0 [<ffffffff810301e1>] __do_page_fault+0x1a1/0x4c0 [<ffffffff8169c9eb>] ? preempt_schedule_irq+0x4b/0x70 [<ffffffff8169e3b7>] ? retint_kernel+0x37/0x40 [<ffffffff8103053e>] do_page_fault+0xe/0x10 [<ffffffff8169e4c2>] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 115/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Replace local_irq_disable with local locks Date: Wed, 28 Jan 2015 17:14:16 +0100 There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. [bigeasy: Move unlock after memcg_check_events() in mem_cgroup_swapout(), pointed out by Matt Fleming <matt@codeblueprint.co.uk>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 116/191 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: mm/zsmalloc: copy with get_cpu_var() and locking Date: Tue, 22 Mar 2016 11:16:09 +0100 get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 117/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: kvm Require const tsc for RT Date: Sun, 6 Nov 2011 12:26:18 +0100 Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 118/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: wait.h: include atomic.h Date: Mon, 28 Oct 2013 12:19:57 +0100 \| CC init/main.o \|In file included from include/linux/mmzone.h:9:0, \| from include/linux/gfp.h:4, \| from include/linux/kmod.h:22, \| from include/linux/module.h:13, \| from init/main.c:15: \|include/linux/wait.h: In function ‘wait_on_atomic_t’: \|include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] \| if (atomic_read(val) == 0) \| ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 119/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Limit the number of task migrations per batch Date: Mon, 6 Jun 2011 12:12:51 +0200 Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 120/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Move mmdrop to RCU on RT Date: Mon, 6 Jun 2011 12:20:33 +0200 Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 121/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: move stack + kprobe clean up to __put_task_struct() Date: Mon, 21 Nov 2016 19:31:08 +0100 There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 122/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Do not account rcu_preempt_depth on RT in might_sleep() Date: Tue, 7 Jun 2011 09:19:06 +0200 RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 123/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable TTWU_QUEUE on RT Date: Tue, 13 Sep 2011 16:42:35 +0200 The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 124/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Check preemption after reenabling interrupts Date: Sun, 13 Nov 2011 17:17:09 +0100 raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 125/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Disable softirq stacks for RT Date: Mon, 18 Jul 2011 13:59:17 +0200 Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 126/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: use local_bh_disable() in netif_rx_ni() Date: Fri, 16 Jun 2017 19:03:16 +0200 In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 127/191 [ Author: Grygorii Strashko Email: Grygorii.Strashko@linaro.org Subject: pid.h: include atomic.h Date: Tue, 21 Jul 2015 19:43:56 +0300 This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 128/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ptrace: fix ptrace vs tasklist_lock race Date: Thu, 29 Aug 2013 18:21:04 +0200 As explained by Alexander Fyodorov <halcy@yandex.ru>: \|read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, \|and it can remove __TASK_TRACED from task->state (by moving it to \|task->saved_state). If parent does wait() on child followed by a sys_ptrace \|call, the following race can happen: \| \|- child sets __TASK_TRACED in ptrace_stop() \|- parent does wait() which eventually calls wait_task_stopped() and returns \| child's pid \|- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves \| __TASK_TRACED flag to saved_state \|- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 129/191 [ Author: Oleg Nesterov Email: oleg@redhat.com Subject: ptrace: fix ptrace_unfreeze_traced() race with rt-lock Date: Tue, 3 Nov 2020 12:39:01 +0100 The patch "ptrace: fix ptrace vs tasklist_lock race" changed ptrace_freeze_traced() to take task->saved_state into account, but ptrace_unfreeze_traced() has the same problem and needs a similar fix: it should check/update both ->state and ->saved_state. Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Fixes: "ptrace: fix ptrace vs tasklist_lock race" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable-rt@vger.kernel.org ] 130/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rcu: Delay RCU-selftests Date: Wed, 10 Mar 2021 15:09:02 +0100 Delay RCU-selftests until ksoftirqd is up and running. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 131/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: Make spinlock_t and rwlock_t a RCU section on RT Date: Tue, 19 Nov 2019 09:25:04 +0100 On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 132/191 [ Author: Scott Wood Email: swood@redhat.com Subject: rcutorture: Avoid problematic critical section nesting on RT Date: Wed, 11 Sep 2019 17:57:29 +0100 rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 133/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/vmalloc: Another preempt disable region which sucks Date: Tue, 12 Jul 2011 11:39:36 +0200 Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 134/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: do not invoke preempt_disable() Date: Tue, 14 Jul 2015 14:26:34 +0200 preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 135/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: md: raid5: Make raid5_percpu handling RT aware Date: Tue, 6 Apr 2010 16:51:31 +0200 __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel <udovdh@xs4all.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Udo van den Heuvel <udovdh@xs4all.nl> ] 136/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: scsi/fcoe: Make RT aware. Date: Sat, 12 Nov 2011 14:00:48 +0100 Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 137/191 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() Date: Wed, 18 Feb 2015 16:05:28 +0100 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 \|in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd \|Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \|CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 \|Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 \| ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 \| 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 \| ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 \|Call Trace: \| [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e \| [<ffffffff81073c86>] __might_sleep+0xe6/0x150 \| [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50 \| [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] \| [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \| [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] \| [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc] \| [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd] \| [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd] \| [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0 \| [<ffffffff8117f889>] SyS_write+0x49/0xb0 \| [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 138/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Introduce cpu_chill() Date: Wed, 7 Mar 2012 20:51:03 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): \| \|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken \|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is \|called from softirq context, it may block the ksoftirqd() from running, in \|which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() \|If a task calls cpu_chill() and gets woken up by a regular or spurious \|wakeup and has a signal pending, then it exits the sleep loop in \|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is \|not TI_NONE then this results in accessing a stale user pointer from a \|previously interrupted syscall and a copy to user based on the stale \|pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: \| [....] Waiting for /dev to be fully populated... \| ===================================== \| [ BUG: udevd/229 still has locks held! ] \| 3.12.11-rt17 #23 Not tainted \| ------------------------------------- \| 1 lock held by udevd/229: \| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 \| \| stack backtrace: \| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 \| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) \| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) \| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) \| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) \| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) \| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) \| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) \| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) \| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) \| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) \| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) \| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) \| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) \| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) \| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) \| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) \| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 139/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs: namespace: Use cpu_chill() in trylock loops Date: Wed, 7 Mar 2012 21:00:34 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 140/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: debugobjects: Make RT aware Date: Sun, 17 Jul 2011 21:41:35 +0200 Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 141/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use skbufhead with raw lock Date: Tue, 12 Jul 2011 15:38:34 +0200 Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 142/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Dequeue in dev_cpu_dead() without the lock Date: Wed, 16 Sep 2020 16:15:39 +0200 Upstream uses skb_dequeue() to acquire lock of `input_pkt_queue'. The reason is to synchronize against a remote CPU which still thinks that the CPU is online enqueues packets to this CPU. There are no guarantees that the packet is enqueued before the callback is run, it just hope. RT however complains about an not initialized lock because it uses another lock for `input_pkt_queue' due to the IRQ-off nature of the context. Use the unlocked dequeue version for `input_pkt_queue'. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 143/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: dev: always take qdisc's busylock in __dev_xmit_skb() Date: Wed, 30 Mar 2016 13:36:29 +0200 The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 144/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqwork: push most work into softirq context Date: Tue, 23 Jun 2015 15:32:51 +0200 Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 145/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: limit more FPU-enabled sections Date: Thu, 30 Nov 2017 13:40:10 +0100 Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new _fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 146/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: cryptd - add a lock instead preempt_disable/local_bh_disable Date: Thu, 26 Jul 2018 18:52:00 +0200 cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 147/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: panic: skip get_random_bytes for RT_FULL in init_oops_id Date: Tue, 14 Jul 2015 14:26:34 +0200 Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 148/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: stackprotector: Avoid random pool on rt Date: Thu, 16 Dec 2010 14:25:18 +0100 CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 149/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: random: Make it work on rt Date: Tue, 21 Aug 2012 20:38:50 +0200 Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 150/191 [ Author: Priyanka Jain Email: Priyanka.Jain@freescale.com Subject: net: Remove preemption disabling in netif_rx() Date: Thu, 17 May 2012 09:35:11 +0530 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: Remove assumption about migrate_disable() from the description.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 151/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: lockdep: Make it RT aware Date: Sun, 17 Jul 2011 18:51:23 +0200 teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 152/191 [ Author: Yong Zhang Email: yong.zhang@windriver.com Subject: lockdep: selftest: Only do hardirq context test for raw spinlock Date: Mon, 16 Apr 2012 15:01:56 +0800 On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Yong Zhang <yong.zhang@windriver.com> Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 153/191 [ Author: Josh Cartwright Email: josh.cartwright@ni.com Subject: lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals Date: Wed, 28 Jan 2015 13:08:45 -0600 "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright <josh.cartwright@ni.com> Signed-off-by: Xander Huff <xander.huff@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 154/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: disable self-test Date: Tue, 17 Oct 2017 16:36:18 +0200 The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 155/191 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,radeon,i915: Use preempt_disable/enable_rt() where recommended Date: Sat, 27 Feb 2016 08:09:11 +0100 DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 156/191 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm/i915: Don't disable interrupts on PREEMPT_RT during atomic updates Date: Sat, 27 Feb 2016 09:01:42 +0100 Commit 8d7849db3eab7 ("drm/i915: Make sprite updates atomic") started disabling interrupts across atomic updates. This breaks on PREEMPT_RT because within this section the code attempt to acquire spinlock_t locks which are sleeping locks on PREEMPT_RT. According to the comment the interrupts are disabled to avoid random delays and not required for protection or synchronisation. Don't disable interrupts on PREEMPT_RT during atomic updates. [bigeasy: drop local locks, commit message] Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 157/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: disable tracing on -RT Date: Thu, 6 Dec 2018 09:52:20 +0100 Luca Abeni reported this: \| BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 \| CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 \| Call Trace: \| rt_spin_lock+0x3f/0x50 \| gen6_read32+0x45/0x1d0 [i915] \| g4x_get_vblank_counter+0x36/0x40 [i915] \| trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni <lucabe72@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 158/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE Date: Wed, 19 Dec 2018 10:47:02 +0100 The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 159/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915/gt: Only disable interrupts for the timeline lock on !force-threaded Date: Tue, 7 Jul 2020 12:25:11 +0200 According to commit d67739268cf0e ("drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe") the intrrupts are disabled the code may be called from an interrupt handler and from preemptible context. With `force_irqthreads' set the timeline mutex is never observed in IRQ context so it is not neede to disable interrupts. Disable only interrupts if not in `force_irqthreads' mode. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 160/191 [ Author: Mike Galbraith Email: efault@gmx.de Subject: cpuset: Convert callback_lock to raw_spinlock_t Date: Sun, 8 Jan 2017 09:32:25 +0100 The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 <disables IRQs> ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 161/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Allow to enable RT Date: Wed, 7 Aug 2019 18:15:38 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 162/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/scatterlist: Do not disable irqs on RT Date: Fri, 3 Jul 2009 08:44:34 -0500 For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 163/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add support for lazy preemption Date: Fri, 26 Oct 2012 18:50:54 +0100 It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 164/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/entry: Use should_resched() in idtentry_exit_cond_resched() Date: Tue, 30 Jun 2020 11:45:14 +0200 The TIF_NEED_RESCHED bit is inlined on x86 into the preemption counter. By using should_resched(0) instead of need_resched() the same check can be performed which uses the same variable as 'preempt_count()` which was issued before. Use should_resched(0) instead need_resched(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 165/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: Support for lazy preemption Date: Thu, 1 Nov 2012 11:03:47 +0100 Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 166/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Add support for lazy preemption Date: Wed, 31 Oct 2012 12:04:11 +0100 Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 167/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Add support for lazy preemption Date: Thu, 1 Nov 2012 10:14:11 +0100 Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 168/191 [ Author: Anders Roxell Email: anders.roxell@linaro.org Subject: arch/arm64: Add lazy preempt support Date: Thu, 14 May 2015 17:52:17 +0200 arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> ] 169/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jump-label: disable if stop_machine() is used Date: Wed, 8 Jul 2015 17:14:48 +0200 Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 170/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: leds: trigger: disable CPU trigger on -RT Date: Thu, 23 Jan 2014 14:45:59 +0100 as it triggers: \|CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 \|[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) \|[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) \|[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) \|[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) \|[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) \|[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) \|[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) \|[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) \|[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) \|[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) \|[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 171/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/omap: Make the locking RT aware Date: Thu, 28 Jul 2011 13:32:57 +0200 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 172/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/pl011: Make the locking work on RT Date: Tue, 8 Jan 2013 21:36:51 +0100 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 173/191 [ Author: Yadi.hu Email: yadi.hu@windriver.com Subject: ARM: enable irq in translation/section permission fault handlers Date: Wed, 10 Dec 2014 10:32:09 +0800 Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { ((char)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [<c041872c>] (unwind_backtrace+0x0/0x104) [ 512.745001] [<c09af0c4>] (dump_stack+0x20/0x24) [ 512.745355] [<c0462490>] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [<c09b6770>] (rt_spin_lock+0x34/0x6c) [ 512.746073] [<c0441bf0>] (do_force_sig_info+0x34/0xf0) [ 512.746457] [<c0442668>] (force_sig_info+0x18/0x1c) [ 512.746829] [<c041d880>] (__do_user_fault+0x9c/0xd8) [ 512.747185] [<c041d938>] (do_bad_area+0x7c/0x94) [ 512.747536] [<c041d990>] (do_sect_fault+0x40/0x48) [ 512.747898] [<c040841c>] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu <yadi.hu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 174/191 [ Author: Josh Cartwright Email: joshc@ni.com Subject: genirq: update irq_set_irqchip_state documentation Date: Thu, 11 Feb 2016 11:54:00 -0600 On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 175/191 [ Author: Josh Cartwright Email: joshc@ni.com Subject: KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() Date: Thu, 11 Feb 2016 11:54:01 -0600 kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall <christoffer.dall@linaro.org> Reported-by: Manish Jaggi <Manish.Jaggi@caviumnetworks.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 176/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() Date: Wed, 25 Jul 2018 14:02:38 +0200 fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 177/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Enable RT also on 32bit Date: Thu, 7 Nov 2019 17:49:20 +0100 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 178/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:29 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 179/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM64: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:35 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 180/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc: traps: Use PREEMPT_RT Date: Fri, 26 Jul 2019 11:30:49 +0200 Add PREEMPT_RT to the backtrace if enabled. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 181/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/pseries/iommu: Use a locallock instead local_irq_save() Date: Tue, 26 Mar 2019 18:31:54 +0100 The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 182/191 [ Author: Bogdan Purcareata Email: bogdan.purcareata@freescale.com Subject: powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT Date: Fri, 24 Apr 2015 15:53:13 +0000 While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 183/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/stackprotector: work around stack-guard init from atomic Date: Tue, 26 Mar 2019 18:31:29 +0100 This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 184/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc: Avoid recursive header includes Date: Fri, 8 Jan 2021 19:48:21 +0100 - The include of bug.h leads to an include of printk.h which gets back to spinlock.h and complains then about missing xchg(). Remove bug.h and add bits.h which is needed for BITS_PER_BYTE. - Avoid the "please don't include this file directly" error from rwlock-rt. Allow an include from/with rtmutex.h. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 185/191 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: POWERPC: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:41 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 186/191 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/block/zram: Replace bit spinlocks with rtmutex for -rt Date: Thu, 31 Mar 2016 04:08:28 +0200 They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 187/191 [ Author: Haris Okanovic Email: haris.okanovic@ni.com Subject: tpm_tis: fix stall after iowrite()s Date: Tue, 15 Aug 2017 15:13:08 -0500 ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic <haris.okanovic@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 188/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signals: Allow RT tasks to cache one sigqueue struct Date: Fri, 3 Jul 2009 08:44:56 -0500 Allow realtime tasks to cache one sigqueue in task struct. This avoids an allocation which can cause latencies or fail. Ideally the sigqueue is cached after first sucessfull delivery and will be available for next signal delivery. This works under the assumption that the RT task has never an unprocessed singal while one is about to be queued. The caching is not used for SIGQUEUE_PREALLOC because this kind of sigqueue is handled differently (and not used for regular signal delivery). [bigeasy: With a fix from Matt Fleming <matt@codeblueprint.co.uk>] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 189/191 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: genirq: Disable irqpoll on -rt Date: Fri, 3 Jul 2009 08:29:57 -0500 Creates long latencies for no value Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 190/191 [ Author: Clark Williams Email: williams@redhat.com Subject: sysfs: Add /sys/kernel/realtime entry Date: Sat, 30 Jul 2011 21:55:53 -0500 Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ] 191/191 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: Add localversion for -RT release Date: Fri, 8 Jul 2011 20:25:16 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-12-27	rt: remove patches for v5.11 initial	Bruce Ashfield
	Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-12-17	5.10/rt: initial import	Bruce Ashfield
	Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-12-17	rt: prep for v5.10	Bruce Ashfield
	Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-11-03	stop_machine: Add function and caller debug info	Bruce Ashfield
	1/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: stop_machine: Add function and caller debug info Date: Fri, 23 Oct 2020 12:11:59 +0200 Crashes in stop-machine are hard to connect to the calling code, add a little something to help with that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 2/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Fix balance_callback() Date: Fri, 23 Oct 2020 12:12:00 +0200 The intent of balance_callback() has always been to delay executing balancing operations until the end of the current rq->lock section. This is because balance operations must often drop rq->lock, and that isn't safe in general. However, as noted by Scott, there were a few holes in that scheme; balance_callback() was called after rq->lock was dropped, which means another CPU can interleave and touch the callback list. Rework code to call the balance callbacks before dropping rq->lock where possible, and otherwise splice the balance list onto a local stack. This guarantees that the balance list must be empty when we take rq->lock. IOW, we'll only ever run our own balance callbacks. Reported-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 3/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched/hotplug: Ensure only per-cpu kthreads run during hotplug Date: Fri, 23 Oct 2020 12:12:01 +0200 In preparation for migrate_disable(), make sure only per-cpu kthreads are allowed to run on !active CPUs. This is ran (as one of the very first steps) from the cpu-hotplug task which is a per-cpu kthread and completion of the hotplug operation only requires such tasks. This constraint enables the migrate_disable() implementation to wait for completion of all migrate_disable regions on this CPU at hotplug time without fear of any new ones starting. This replaces the unlikely(rq->balance_callbacks) test at the tail of context_switch with an unlikely(rq->balance_work), the fast path is not affected. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 4/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched/core: Wait for tasks being pushed away on hotplug Date: Fri, 23 Oct 2020 12:12:02 +0200 RT kernels need to ensure that all tasks which are not per CPU kthreads have left the outgoing CPU to guarantee that no tasks are force migrated within a migrate disabled section. There is also some desire to (ab)use fine grained CPU hotplug control to clear a CPU from active state to force migrate tasks which are not per CPU kthreads away for power control purposes. Add a mechanism which waits until all tasks which should leave the CPU after the CPU active flag is cleared have moved to a different online CPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 5/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: workqueue: Manually break affinity on hotplug Date: Fri, 23 Oct 2020 12:12:03 +0200 Don't rely on the scheduler to force break affinity for us -- it will stop doing that for per-cpu-kthreads. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 6/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched/hotplug: Consolidate task migration on CPU unplug Date: Fri, 23 Oct 2020 12:12:04 +0200 With the new mechanism which kicks tasks off the outgoing CPU at the end of schedule() the situation on an outgoing CPU right before the stopper thread brings it down completely is: - All user tasks and all unbound kernel threads have either been migrated away or are not running and the next wakeup will move them to a online CPU. - All per CPU kernel threads, except cpu hotplug thread and the stopper thread have either been unbound or parked by the responsible CPU hotplug callback. That means that at the last step before the stopper thread is invoked the cpu hotplug thread is the last legitimate running task on the outgoing CPU. Add a final wait step right before the stopper thread is kicked which ensures that any still running tasks on the way to park or on the way to kick themself of the CPU are either sleeping or gone. This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If sched_cpu_dying() detects that there is still another running task aside of the stopper thread then it will explode with the appropriate fireworks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 7/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Fix hotplug vs CPU bandwidth control Date: Fri, 23 Oct 2020 12:12:05 +0200 Since we now migrate tasks away before DYING, we should also move bandwidth unthrottle, otherwise we can gain tasks from unthrottle after we expect all tasks to be gone already. Also; it looks like the RT balancers don't respect cpu_active() and instead rely on rq->online in part, complete this. This too requires we do set_rq_offline() earlier to match the cpu_active() semantics. (The bigger patch is to convert RT to cpu_active() entirely) Since set_rq_online() is called from sched_cpu_activate(), place set_rq_offline() in sched_cpu_deactivate(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 8/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Massage set_cpus_allowed() Date: Fri, 23 Oct 2020 12:12:06 +0200 Thread a u32 flags word through the set_cpus_allowed() callchain. This will allow adding behavioural tweaks for future users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 9/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Add migrate_disable() Date: Fri, 23 Oct 2020 12:12:07 +0200 Add the base migrate_disable() support (under protest). While migrate_disable() is (currently) required for PREEMPT_RT, it is also one of the biggest flaws in the system. Notably this is just the base implementation, it is broken vs sched_setaffinity() and hotplug, both solved in additional patches for ease of review. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 10/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Fix migrate_disable() vs set_cpus_allowed_ptr() Date: Fri, 23 Oct 2020 12:12:08 +0200 Concurrent migrate_disable() and set_cpus_allowed_ptr() has interesting features. We rely on set_cpus_allowed_ptr() to not return until the task runs inside the provided mask. This expectation is exported to userspace. This means that any set_cpus_allowed_ptr() caller must wait until migrate_enable() allows migrations. At the same time, we don't want migrate_enable() to schedule, due to patterns like: preempt_disable(); migrate_disable(); ... migrate_enable(); preempt_enable(); And: raw_spin_lock(&B); spin_unlock(&A); this means that when migrate_enable() must restore the affinity mask, it cannot wait for completion thereof. Luck will have it that that is exactly the case where there is a pending set_cpus_allowed_ptr(), so let that provide storage for the async stop machine. Much thanks to Valentin who used TLA+ most effective and found lots of 'interesting' cases. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 11/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched/core: Make migrate disable and CPU hotplug cooperative Date: Fri, 23 Oct 2020 12:12:09 +0200 On CPU unplug tasks which are in a migrate disabled region cannot be pushed to a different CPU until they returned to migrateable state. Account the number of tasks on a runqueue which are in a migrate disabled section and make the hotplug wait mechanism respect that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 12/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched,rt: Use cpumask_any_distribute() Date: Fri, 23 Oct 2020 12:12:10 +0200 Replace a bunch of cpumask_any() instances with cpumask_any_distribute(), by injecting this little bit of random in cpu selection, we reduce the chance two competing balance operations working off the same lowest_mask pick the same CPU. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 13/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched,rt: Use the full cpumask for balancing Date: Fri, 23 Oct 2020 12:12:11 +0200 We want migrate_disable() tasks to get PULLs in order for them to PUSH away the higher priority task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 14/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched, lockdep: Annotate ->pi_lock recursion Date: Fri, 23 Oct 2020 12:12:12 +0200 There's a valid ->pi_lock recursion issue where the actual PI code tries to wake up the stop task. Make lockdep aware so it doesn't complain about this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 15/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Fix migrate_disable() vs rt/dl balancing Date: Fri, 23 Oct 2020 12:12:13 +0200 In order to minimize the interference of migrate_disable() on lower priority tasks, which can be deprived of runtime due to being stuck below a higher priority task. Teach the RT/DL balancers to push away these higher priority tasks when a lower priority task gets selected to run on a freshly demoted CPU (pull). This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost. Without this it would be possible to have all tasks on the system stuck on a single CPU, each task preempted in a migrate_disable() section with a single high priority task running. This way we can still approximate running the M highest priority tasks on the system. Migrating the top task away is (ofcourse) still subject to migrate_disable() too, which means the lower task is subject to an interference equivalent to the worst case migrate_disable() section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 16/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched/proc: Print accurate cpumask vs migrate_disable() Date: Fri, 23 Oct 2020 12:12:14 +0200 Ensure /proc//status doesn't print 'random' cpumasks due to migrate_disable(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 17/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: sched: Add migrate_disable() tracepoints Date: Fri, 23 Oct 2020 12:12:15 +0200 XXX write a tracer: - 'migirate_disable() -> migrate_enable()' time in task_sched_runtime() - 'migrate_pull -> sched-in' time in task_sched_runtime() The first will give worst case for the second, which is the actual interference experienced by the task to due migration constraints of migrate_disable(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 18/194 [ Author: Valentin Schneider Email: valentin.schneider@arm.com Subject: sched: Deny self-issued __set_cpus_allowed_ptr() when migrate_disable() Date: Fri, 23 Oct 2020 12:12:16 +0200 migrate_disable(); set_cpus_allowed_ptr(current, {something excluding task_cpu(current)}); affine_move_task(); <-- never returns Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201013140116.26651-1-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 19/194 [ Author: Valentin Schneider Email: valentin.schneider@arm.com Subject: sched: Comment affine_move_task() Date: Fri, 23 Oct 2020 12:12:17 +0200 Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201013140116.26651-2-valentin.schneider@arm.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 20/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: Use CONFIG_PREEMPTION Date: Fri, 26 Jul 2019 11:30:49 +0200 Thisi is an all-in-one patch of the current `PREEMPTION' branch. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 21/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: blk-mq: Don't complete on a remote CPU in force threaded mode Date: Wed, 28 Oct 2020 11:07:44 +0100 With force threaded interrupts enabled, raising softirq from an SMP function call will always result in waking the ksoftirqd thread. This is not optimal given that the thread runs at SCHED_OTHER priority. Completing the request in hard IRQ-context on PREEMPT_RT (which enforces the force threaded mode) is bad because the completion handler may acquire sleeping locks which violate the locking context. Disable request completing on a remote CPU in force threaded mode. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 22/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: blk-mq: Always complete remote completions requests in softirq Date: Wed, 28 Oct 2020 11:07:09 +0100 Controllers with multiple queues have their IRQ-handelers pinned to a CPU. The core shouldn't need to complete the request on a remote CPU. Remove this case and always raise the softirq to complete the request. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 23/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: blk-mq: Use llist_head for blk_cpu_done Date: Wed, 28 Oct 2020 11:08:21 +0100 With llist_head it is possible to avoid the locking (the irq-off region) when items are added. This makes it possible to add items on a remote CPU. llist_add() returns true if the list was previously empty. This can be used to invoke the SMP function call / raise sofirq only if the first item was added (otherwise it is already pending). This simplifies the code a little and reduces the IRQ-off regions. With this change it possible to reduce the SMP-function call a simple __raise_softirq_irqoff(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 24/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lib/test_lockup: Minimum fix to get it compiled on PREEMPT_RT Date: Wed, 28 Oct 2020 18:55:27 +0100 On PREEMPT_RT the locks are quite different so they can't be tested as it is done below. The alternative is test for the waitlock within rtmutex. This is the bare minim to get it compiled. Problems which exists on PREEMP_RT: - none of the locks (spinlock_t, rwlock_t, mutex_t, rw_semaphore) may be acquired with disabled preemption or interrupts. If I read the code correct the it is possible to acquire a mutex with disabled interrupts. I don't know how to obtain a lock pointer. Technically they are not exported to userland. - memory can not be allocated with disabled premption or interrupts even with GFP_ATOMIC. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 25/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: refactor kmsg_dump_get_buffer() Date: Wed, 14 Oct 2020 19:09:15 +0200 kmsg_dump_get_buffer() requires nearly the same logic as syslog_print_all(), but uses different variable names and does not make use of the ringbuffer loop macros. Modify kmsg_dump_get_buffer() so that the implementation is as similar to syslog_print_all() as possible. At some point it would be nice to have this code factored into a helper function. But until then, the code should at least look similar enough so that it is obvious there is logic duplication implemented. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 26/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: use buffer pools for sprint buffers Date: Tue, 13 Oct 2020 22:57:55 +0200 vprintk_store() is using a single static buffer as a temporary sprint buffer for the message text. This will not work once @logbuf_lock is removed. Replace the single static buffer with per-cpu and global pools. Each per-cpu pool is large enough to support a worse case of 2 contexts (non-NMI and NMI). To support printk() recursion and printk() calls before per-cpu variables are ready, an extra/fallback global pool of 2 contexts is available. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 27/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: change @clear_seq to atomic64_t Date: Tue, 13 Oct 2020 23:19:35 +0200 Currently @clear_seq access is protected by @logbuf_lock. Once @logbuf_lock is removed some other form of synchronization will be required. Change the type of @clear_seq to atomic64_t to provide the synchronization. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 28/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove logbuf_lock, add syslog_lock Date: Wed, 14 Oct 2020 19:06:12 +0200 Since the ringbuffer is lockless, there is no need for it to be protected by @logbuf_lock. Remove @logbuf_lock. This means that printk_nmi_direct and printk_safe_flush_on_panic() no longer need to acquire any lock to run. The global variables @syslog_seq, @syslog_partial, @syslog_time were also protected by @logbuf_lock. Introduce @syslog_lock to protect these. @console_seq, @exclusive_console_stop_seq, @console_dropped are protected by @console_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 29/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove safe buffers Date: Wed, 14 Oct 2020 20:00:11 +0200 With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. In NMI or safe contexts, store the message immediately but still use irq_work to defer the console printing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 30/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: console: add write_atomic interface Date: Wed, 14 Oct 2020 20:26:35 +0200 Add a write_atomic() callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write() callback. The write_atomic() callback will only be used in special situations, such as when the kernel panics. Creating an NMI safe write_atomic() that must synchronize with write() requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_() functions are provided: void console_atomic_lock(unsigned int flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using a processor-reentrant spinlock (called a cpulock). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 31/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: serial: 8250: implement write_atomic Date: Wed, 14 Oct 2020 20:31:46 +0200 Implement a non-sleeping NMI-safe write_atomic() console function in order to support emergency console printing. Since interrupts need to be disabled during transmit, all usage of the IER register is wrapped with access functions that use the console_atomic_lock() function to synchronize register access while tracking the state of the interrupts. This is necessary because write_atomic() can be called from an NMI context that has preempted write_atomic(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 32/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: inline log_output(),log_store() in vprintk_store() Date: Mon, 19 Oct 2020 16:40:26 +0206 In preparation for supporting atomic printing, inline log_output() and log_store() into vprintk_store(). This allows these sub-functions to more easily communicate if they have performed a finalized commit as well as the sequence number of that commit. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 33/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: relocate printk_delay() and vprintk_default() Date: Mon, 19 Oct 2020 21:02:40 +0206 Move printk_delay() and vprintk_default() "as is" further up so that they can be used by new functions in an upcoming commit. Signed-off-by: John Ogness <john.ogness@linutornix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 34/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: combine boot_delay_msec() into printk_delay() Date: Mon, 19 Oct 2020 22:11:31 +0206 boot_delay_msec() is always called immediately before printk_delay() so just combine the two. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 35/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: introduce kernel sync mode Date: Wed, 14 Oct 2020 20:40:05 +0200 When the kernel performs an OOPS, enter into "sync mode": - only atomic consoles (write_atomic() callback) will print - printing occurs within vprintk_store() instead of console_unlock() Change @console_seq to atomic64_t for atomic access. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 36/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: move console printing to kthreads Date: Mon, 19 Oct 2020 22:30:38 +0206 Create a kthread for each console to perform console printing. Now all console printing is fully asynchronous except for the boot console and when the kernel enters sync mode (and there are atomic consoles available). The console_lock() and console_unlock() functions now only do what their name says... locking and unlocking of the console. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 37/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove deferred printing Date: Mon, 19 Oct 2020 22:53:30 +0206 Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk context tracking. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 38/194 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add console handover Date: Mon, 19 Oct 2020 23:03:44 +0206 If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 39/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: printk: Tiny cleanup Date: Tue, 20 Oct 2020 18:48:16 +0200 - mark functions and variables static which are used only in this file. - add printf annotation where appropriate - remove static functions without caller - add kdb header file for kgdb builds. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 40/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: use irqsave in cgroup_rstat_flush_locked() Date: Tue, 3 Jul 2018 18:19:48 +0200 All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 41/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: workingset: replace IRQ-off check with a lockdep assert. Date: Mon, 11 Feb 2019 10:40:46 +0100 Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 42/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: tpm: remove tpm_dev_wq_lock Date: Mon, 11 Feb 2019 11:33:11 +0100 Added in commit 9e1b74a63f776 ("tpm: add support for nonblocking operation") but never actually used it. Cc: Philip Tricca <philip.b.tricca@intel.com> Cc: Tadeusz Struk <tadeusz.struk@intel.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 43/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: shmem: Use raw_spinlock_t for ->stat_lock Date: Fri, 14 Aug 2020 18:53:34 +0200 Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another sollution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() should be moved outside of the critical section to avoid invoking the destrutor with disabled preemption. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 44/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Move lockdep where it belongs Date: Tue, 8 Sep 2020 07:32:20 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 45/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: tcp: Remove superfluous BH-disable around listening_hash Date: Mon, 12 Oct 2020 17:33:54 +0200 Commit 9652dc2eb9e40 ("tcp: relax listening_hash operations") removed the need to disable bottom half while acquiring listening_hash.lock. There are still two callers left which disable bottom half before the lock is acquired. Drop local_bh_disable() around __inet_hash() which acquires listening_hash->lock, invoke inet_ehash_nolisten() with disabled BH. inet_unhash() conditionally acquires listening_hash->lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 46/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86/fpu: Do not disable BH on RT Date: Mon, 21 Sep 2020 20:15:50 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 47/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Add RT variant Date: Mon, 21 Sep 2020 17:26:19 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 48/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tick/sched: Prevent false positive softirq pending warnings on RT Date: Mon, 31 Aug 2020 17:02:36 +0200 On RT a task which has soft interrupts disabled can block on a lock and schedule out to idle while soft interrupts are pending. This triggers the warning in the NOHZ idle code which complains about going idle with pending soft interrupts. But as the task is blocked soft interrupt processing is temporarily blocked as well which means that such a warning is a false positive. To prevent that check the per CPU state which indicates that a scheduled out task has soft interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 49/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rcu: Prevent false positive softirq warning on RT Date: Mon, 31 Aug 2020 17:26:08 +0200 Soft interrupt disabled sections can legitimately be preempted or schedule out when blocking on a lock on RT enabled kernels so the RCU preempt check warning has to be disabled for RT kernels. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 50/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Replace barrier() with cpu_relax() in tasklet_unlock_wait() Date: Mon, 31 Aug 2020 15:12:38 +0200 A barrier() in a tight loop which waits for something to happen on a remote CPU is a pointless exercise. Replace it with cpu_relax() which allows HT siblings to make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 51/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Avoid cancel/kill deadlock on RT Date: Mon, 21 Sep 2020 17:47:34 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 52/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tasklets: Use static line for functions Date: Mon, 7 Sep 2020 22:57:32 +0200 Inlines exist for a reason. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 53/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove cruft Date: Tue, 29 Sep 2020 15:21:17 +0200 Most of this is around since the very beginning. I'm not sure if this was used while the rtmutex-deadlock-tester was around but today it seems to only waste memory: - save_state: No users - name: Assigned and printed if a dead lock was detected. I'm keeping it but want to point out that lockdep has the same information. - file + line: Printed if ::name was NULL. This is only used for in-kernel locks so it ::name shouldn't be NULL and then ::file and ::line isn't used. - magic: Assigned to NULL by rt_mutex_destroy(). Remove members of rt_mutex which are not used. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 54/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove output from deadlock detector. Date: Tue, 29 Sep 2020 16:05:11 +0200 In commit f5694788ad8da ("rt_mutex: Add lockdep annotations") rtmutex gained lockdep annotation for rt_mutex_lock() and and related functions. lockdep will see the locking order and may complain about a deadlock before rtmutex' own mechanism gets a chance to detect it. The rtmutex deadlock detector will only complain locks with the RT_MUTEX_MIN_CHAINWALK and a waiter must be pending. That means it works only for in-kernel locks because the futex interface always uses RT_MUTEX_FULL_CHAINWALK. The requirement for an active waiter limits the detector to actual deadlocks and makes it possible to report potential deadlocks like lockdep does. It looks like lockdep is better suited for reporting deadlocks. Remove rtmutex' debug print on deadlock detection. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 55/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES Date: Tue, 29 Sep 2020 16:32:49 +0200 rt_mutex_init() only initializes lockdep if CONFIG_DEBUG_RT_MUTEXES is enabled. The static initializer (DEFINE_RT_MUTEX) does not have such a restriction. Move rt_mutex_init() outside of CONFIG_DEBUG_RT_MUTEXES. Move the remaining functions in this CONFIG_DEBUG_RT_MUTEXES block to the upper block. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 56/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Remove rt_mutex_timed_lock() Date: Wed, 7 Oct 2020 12:11:33 +0200 rt_mutex_timed_lock() has no callers since commit c051b21f71d1f ("rtmutex: Confine deadlock logic to futex") Remove rt_mutex_timed_lock(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 57/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Handle the various new futex race conditions Date: Fri, 10 Jun 2011 11:04:15 +0200 RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 58/194 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: futex: Fix bug on when a requeued RT task times out Date: Tue, 14 Jul 2015 14:26:34 +0200 Requeue with timeout causes a bug with PREEMPT_RT. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() <timed out> double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 59/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Make lock_killable work Date: Sat, 1 Apr 2017 12:50:59 +0200 Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionally. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 60/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/spinlock: Split the lock types header Date: Wed, 29 Jun 2011 19:34:01 +0200 Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 61/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Avoid include hell Date: Wed, 29 Jun 2011 20:06:39 +0200 Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 62/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: Reduce header files in debug_locks.h Date: Fri, 14 Aug 2020 16:55:25 +0200 The inclusion of printk.h leads to circular dependency if spinlock_t is based on rt_mutex. Include only atomic.h (xchg()) and cache.h (__read_mostly). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 63/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: split out the rbtree definition Date: Fri, 14 Aug 2020 17:08:41 +0200 rtmutex.h needs the definition for rb_root_cached. By including kernel.h we will get to spinlock.h which requires rtmutex.h again. Split out the required struct definition and move it into its own header file which can be included by rtmutex.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 64/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: Provide rt_mutex_slowlock_locked() Date: Thu, 12 Oct 2017 16:14:22 +0200 This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 65/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Date: Thu, 12 Oct 2017 16:36:39 +0200 Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 66/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add saved_state for tasks blocked on sleeping locks Date: Sat, 25 Jun 2011 09:21:04 +0200 Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 67/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add sleeping lock implementation Date: Thu, 12 Oct 2017 17:11:19 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 68/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Allow rt_mutex_trylock() on PREEMPT_RT Date: Wed, 2 Dec 2015 11:34:07 +0100 Non PREEMPT_RT kernel can deadlock on rt_mutex_trylock() in softirq context. On PREEMPT_RT the softirq context is handled in thread context. This avoids the deadlock in the slow path and PI-boosting will be done on the correct thread. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 69/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add mutex implementation based on rtmutex Date: Thu, 12 Oct 2017 17:17:03 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 70/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add rwsem implementation based on rtmutex Date: Thu, 12 Oct 2017 17:28:34 +0200 The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code paths shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 71/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: add rwlock implementation based on rtmutex Date: Thu, 12 Oct 2017 17:18:06 +0200 The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 72/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking/rtmutex: wire up RT's locking Date: Thu, 12 Oct 2017 17:31:14 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 73/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: add ww_mutex addon for mutex-rt Date: Thu, 12 Oct 2017 17:34:38 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 74/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: Use custom scheduling function for spin-schedule() Date: Tue, 6 Oct 2020 13:07:17 +0200 PREEMPT_RT builds the rwsem, mutex, spinlock and rwlock typed locks on top of a rtmutex lock. While blocked task->pi_blocked_on is set (tsk_is_pi_blocked()) and task needs to schedule away while waiting. The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. Add preempt_schedule_lock() which is invoked if scheduling is required while blocking on a PREEMPT_RT-only sleeping lock. Remove tsk_is_pi_blocked() from the scheduler path which is no longer needed with the additional scheduler entry point. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 75/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signal: Revert ptrace preempt magic Date: Wed, 21 Sep 2011 19:57:12 +0200 Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 76/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: preempt: Provide preempt__(no)rt variants Date: Fri, 24 Jul 2009 12:38:56 +0200 RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 77/194 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm/vmstat: Protect per cpu variables with preempt disable on RT Date: Fri, 3 Jul 2009 08:30:13 -0500 Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 78/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Disable preemption in __mod_memcg_lruvec_state() Date: Wed, 28 Oct 2020 18:15:32 +0100 The callers expect disabled preemption/interrupts while invoking __mod_memcg_lruvec_state(). This works mainline because a lock of somekind is acquired. Use preempt_disable_rt() where per-CPU variables are accessed and a stable pointer is expected. This is also done in __mod_zone_page_state() for the same reason. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 79/194 [ Author: Ahmed S. Darwish Email: a.darwish@linutronix.de Subject: xfrm: Use sequence counter with associated spinlock Date: Wed, 10 Jun 2020 12:53:22 +0200 A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Upstream-status: The xfrm locking used for seqcoun writer serialization appears to be broken. If that's the case, a proper fix will need to be submitted upstream. (e.g. make the seqcount per network namespace?) Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 80/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: u64_stats: Disable preemption on 32bit-UP/SMP with RT during updates Date: Mon, 17 Aug 2020 12:28:10 +0200 On RT the seqcount_t is required even on UP because the softirq can be preempted. The IRQ handler is threaded so it is also preemptible. Disable preemption on 32bit-RT during value updates. There is no need to disable interrupts on RT because the handler is run threaded. Therefore disabling preemption is enough to guarantee that the update is not interruped. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 81/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: use swait_queue instead of waitqueue Date: Wed, 14 Sep 2016 14:35:49 +0200 __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 82/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: disable preemption on i_dir_seq's write side Date: Fri, 20 Oct 2017 11:29:53 +0200 i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 83/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/Qdisc: use a seqlock instead seqcount Date: Wed, 14 Sep 2016 17:36:35 +0200 The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 84/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Properly annotate the try-lock for the seqlock Date: Tue, 8 Sep 2020 16:57:11 +0200 In patch ("net/Qdisc: use a seqlock instead seqcount") the seqcount has been replaced with a seqlock to allow to reader to boost the preempted writer. The try_write_seqlock() acquired the lock with a try-lock but the seqcount annotation was "lock". Opencode write_seqcount_t_begin() and use the try-lock annotation for lockdep. Reported-by: Mike Galbraith <efault@gmx.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 85/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: kconfig: Disable config options which are not RT compatible Date: Sun, 24 Jul 2011 12:11:43 +0200 Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 86/194 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: Allow only SLUB on RT Date: Fri, 3 Jul 2009 08:44:03 -0500 Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 87/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rcu: make RCU_BOOST default on RT Date: Fri, 21 Mar 2014 20:19:05 +0100 Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 88/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable CONFIG_RT_GROUP_SCHED on RT Date: Mon, 18 Jul 2011 17:03:52 +0200 Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 89/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: disable NET_RX_BUSY_POLL on RT Date: Sat, 27 May 2017 19:02:06 +0200 napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 90/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Disable runtime services on RT Date: Thu, 26 Jul 2018 15:03:16 +0200 Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 91/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Allow efi=runtime Date: Thu, 26 Jul 2018 15:06:10 +0200 In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 92/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Add local irq locks Date: Mon, 20 Jun 2011 09:03:47 +0200 Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 93/194 [ Author: Oleg Nesterov Email: oleg@redhat.com Subject: signal/x86: Delay calling signals in atomic Date: Tue, 14 Jul 2015 14:26:34 +0200 On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: also needed on 32bit as per Yang Shi <yang.shi@linaro.org>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 94/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 95/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 96/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLxB: change list_lock to raw_spinlock_t Date: Mon, 28 May 2018 15:24:22 +0200 The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on -RT. The locking rules remain the same on !RT. This patch changes it for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 97/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLUB: delay giving back empty slubs to IRQ enabled regions Date: Thu, 21 Jun 2018 17:29:19 +0200 __free_slab() is invoked with disabled interrupts which increases the irq-off time while __free_pages() is doing the work. Allow __free_slab() to be invoked with enabled interrupts and move everything from interrupts-off invocations to a temporary per-CPU list so it can be processed later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 98/194 [ Author: Kevin Hao Email: haokexin@gmail.com Subject: mm: slub: Always flush the delayed empty slubs in flush_all() Date: Mon, 4 May 2020 11:34:07 +0800 After commit f0b231101c94 ("mm/SLUB: delay giving back empty slubs to IRQ enabled regions"), when the free_slab() is invoked with the IRQ disabled, the empty slubs are moved to a per-CPU list and will be freed after IRQ enabled later. But in the current codes, there is a check to see if there really has the cpu slub on a specific cpu before flushing the delayed empty slubs, this may cause a reference of already released kmem_cache in a scenario like below: cpu 0 cpu 1 kmem_cache_destroy() flush_all() --->IPI flush_cpu_slab() flush_slab() deactivate_slab() discard_slab() free_slab() c->page = NULL; for_each_online_cpu(cpu) if (!has_cpu_slab(1, s)) continue this skip to flush the delayed empty slub released by cpu1 kmem_cache_free(kmem_cache, s) kmalloc() __slab_alloc() free_delayed() __free_slab() reference to released kmem_cache Fixes: f0b231101c94 ("mm/SLUB: delay giving back empty slubs to IRQ enabled regions") Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable-rt@vger.kernel.org ] 99/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/page_alloc: Use migrate_disable() in drain_local_pages_wq() Date: Thu, 2 Jul 2020 14:27:23 +0200 drain_local_pages_wq() disables preemption to avoid CPU migration during CPU hotplug. Using migrate_disable() makes the function preemptible on PREEMPT_RT but still avoids CPU migrations during CPU-hotplug. On !PREEMPT_RT it behaves like preempt_disable(). Use migrate_disable() in drain_local_pages_wq(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 100/194 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: page_alloc: rt-friendly per-cpu pages Date: Fri, 3 Jul 2009 08:29:37 -0500 rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Contains fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 101/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/slub: Make object_map_lock a raw_spinlock_t Date: Thu, 16 Jul 2020 18:47:50 +0200 The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 102/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: slub: Enable irqs for __GFP_WAIT Date: Wed, 9 Jan 2013 12:08:15 +0100 SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. [bigeasy: Add warning on RT for allocations in atomic context. Don't enable interrupts on allocations during SYSTEM_SUSPEND. This is done during suspend by ACPI, noticed by Liwei Song <liwei.song@windriver.com> ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 103/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: slub: Disable SLUB_CPU_PARTIAL Date: Wed, 15 Apr 2015 19:00:47 +0200 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 \|in_atomic(): 1, irqs_disabled(): 0, pid: 87, name: rcuop/7 \|1 lock held by rcuop/7/87: \| #0: (rcu_callback){......}, at: [<ffffffff8112c76a>] rcu_nocb_kthread+0x1ca/0x5d0 \|Preemption disabled at:[<ffffffff811eebd9>] put_cpu_partial+0x29/0x220 \| \|CPU: 0 PID: 87 Comm: rcuop/7 Tainted: G W 4.0.0-rt0+ #477 \|Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 \| 000000000007a9fc ffff88013987baf8 ffffffff817441c7 0000000000000007 \| 0000000000000000 ffff88013987bb18 ffffffff810eee51 0000000000000000 \| ffff88013fc10200 ffff88013987bb48 ffffffff8174a1c4 000000000007a9fc \|Call Trace: \| [<ffffffff817441c7>] dump_stack+0x4f/0x90 \| [<ffffffff810eee51>] ___might_sleep+0x121/0x1b0 \| [<ffffffff8174a1c4>] rt_spin_lock+0x24/0x60 \| [<ffffffff811a689a>] __free_pages_ok+0xaa/0x540 \| [<ffffffff811a729d>] __free_pages+0x1d/0x30 \| [<ffffffff811eddd5>] __free_slab+0xc5/0x1e0 \| [<ffffffff811edf46>] free_delayed+0x56/0x70 \| [<ffffffff811eecfd>] put_cpu_partial+0x14d/0x220 \| [<ffffffff811efc98>] __slab_free+0x158/0x2c0 \| [<ffffffff811f0021>] kmem_cache_free+0x221/0x2d0 \| [<ffffffff81204d0c>] file_free_rcu+0x2c/0x40 \| [<ffffffff8112c7e3>] rcu_nocb_kthread+0x243/0x5d0 \| [<ffffffff810e951c>] kthread+0xfc/0x120 \| [<ffffffff8174abc8>] ret_from_fork+0x58/0x90 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 104/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: memcontrol: Provide a local_lock for per-CPU memcg_stock Date: Tue, 18 Aug 2020 10:30:00 +0200 The interrupts are disabled to ensure CPU-local access to the per-CPU variable `memcg_stock'. As the code inside the interrupt disabled section acquires regular spinlocks, which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel, this conflicts with the RT semantics. Convert it to a local_lock which allows RT kernels to substitute them with a real per CPU lock. On non RT kernels this maps to local_irq_save() as before, but provides also lockdep coverage of the critical region. No functional change. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 105/194 [ Author: Yang Shi Email: yang.shi@windriver.com Subject: mm/memcontrol: Don't call schedule_work_on in preemption disabled context Date: Wed, 30 Oct 2013 11:48:33 -0700 The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff8112b361>] drain_all_stock+0xe1/0x1c0 [<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 [<ffffffff8112beda>] __mem_cgroup_try_charge+0x41a/0xc40 [<ffffffff810f1c91>] ? release_pages+0x1b1/0x1f0 [<ffffffff8106f200>] ? sched_exec+0x40/0xb0 [<ffffffff8112cc87>] mem_cgroup_charge_common+0x37/0x70 [<ffffffff8112e2c6>] mem_cgroup_newpage_charge+0x26/0x30 [<ffffffff8110af68>] handle_pte_fault+0x618/0x840 [<ffffffff8103ecf6>] ? unpin_current_cpu+0x16/0x70 [<ffffffff81070f94>] ? migrate_enable+0xd4/0x200 [<ffffffff8110cde5>] handle_mm_fault+0x145/0x1e0 [<ffffffff810301e1>] __do_page_fault+0x1a1/0x4c0 [<ffffffff8169c9eb>] ? preempt_schedule_irq+0x4b/0x70 [<ffffffff8169e3b7>] ? retint_kernel+0x37/0x40 [<ffffffff8103053e>] do_page_fault+0xe/0x10 [<ffffffff8169e4c2>] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 106/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Replace local_irq_disable with local locks Date: Wed, 28 Jan 2015 17:14:16 +0100 There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. [bigeasy: Move unlock after memcg_check_events() in mem_cgroup_swapout(), pointed out by Matt Fleming <matt@codeblueprint.co.uk>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 107/194 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: mm/zsmalloc: copy with get_cpu_var() and locking Date: Tue, 22 Mar 2016 11:16:09 +0100 get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 108/194 [ Author: Luis Claudio R. Goncalves Email: lgoncalv@redhat.com Subject: mm/zswap: Use local lock to protect per-CPU data Date: Tue, 25 Jun 2019 11:28:04 -0300 zwap uses per-CPU compression. The per-CPU data pointer is acquired with get_cpu_ptr() which implicitly disables preemption. It allocates memory inside the preempt disabled region which conflicts with the PREEMPT_RT semantics. Replace the implicit preemption control with an explicit local lock. This allows RT kernels to substitute it with a real per CPU lock, which serializes the access but keeps the code section preemptible. On non RT kernels this maps to preempt_disable() as before, i.e. no functional change. [bigeasy: Use local_lock(), additional hunks, patch description] Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 109/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: kvm Require const tsc for RT Date: Sun, 6 Nov 2011 12:26:18 +0100 Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 110/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: wait.h: include atomic.h Date: Mon, 28 Oct 2013 12:19:57 +0100 \| CC init/main.o \|In file included from include/linux/mmzone.h:9:0, \| from include/linux/gfp.h:4, \| from include/linux/kmod.h:22, \| from include/linux/module.h:13, \| from init/main.c:15: \|include/linux/wait.h: In function ‘wait_on_atomic_t’: \|include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] \| if (atomic_read(val) == 0) \| ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 111/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hrtimer: Allow raw wakeups during boot Date: Fri, 9 Aug 2019 15:25:21 +0200 There are a few wake-up timers during the early boot which are essencial for the system to make progress. At this stage there are no softirq spawn for the softirq processing so there is no timer processing in softirq. The wakeup in question: smpboot_create_thread() -> kthread_create_on_cpu() -> kthread_bind() -> wait_task_inactive() -> schedule_hrtimeout() Let the timer fire in hardirq context during the system boot. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 112/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Limit the number of task migrations per batch Date: Mon, 6 Jun 2011 12:12:51 +0200 Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 113/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Move mmdrop to RCU on RT Date: Mon, 6 Jun 2011 12:20:33 +0200 Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 114/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: move stack + kprobe clean up to __put_task_struct() Date: Mon, 21 Nov 2016 19:31:08 +0100 There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 115/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Do not account rcu_preempt_depth on RT in might_sleep() Date: Tue, 7 Jun 2011 09:19:06 +0200 RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 116/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable TTWU_QUEUE on RT Date: Tue, 13 Sep 2011 16:42:35 +0200 The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 117/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Check preemption after reenabling interrupts Date: Sun, 13 Nov 2011 17:17:09 +0100 raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 118/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Disable softirq stacks for RT Date: Mon, 18 Jul 2011 13:59:17 +0200 Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 119/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: use local_bh_disable() in netif_rx_ni() Date: Fri, 16 Jun 2017 19:03:16 +0200 In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 120/194 [ Author: Grygorii Strashko Email: Grygorii.Strashko@linaro.org Subject: pid.h: include atomic.h Date: Tue, 21 Jul 2015 19:43:56 +0300 This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 121/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ptrace: fix ptrace vs tasklist_lock race Date: Thu, 29 Aug 2013 18:21:04 +0200 As explained by Alexander Fyodorov <halcy@yandex.ru>: \|read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, \|and it can remove __TASK_TRACED from task->state (by moving it to \|task->saved_state). If parent does wait() on child followed by a sys_ptrace \|call, the following race can happen: \| \|- child sets __TASK_TRACED in ptrace_stop() \|- parent does wait() which eventually calls wait_task_stopped() and returns \| child's pid \|- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves \| __TASK_TRACED flag to saved_state \|- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 122/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: add {put\|get}_cpu_light() Date: Sat, 27 May 2017 19:02:06 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 123/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: trace: Add migrate-disabled counter to tracing output Date: Sun, 17 Jul 2011 21:56:42 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 124/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Date: Fri, 4 Aug 2017 17:40:42 +0200 Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 125/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: Make spinlock_t and rwlock_t a RCU section on RT Date: Tue, 19 Nov 2019 09:25:04 +0100 On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 126/194 [ Author: Scott Wood Email: swood@redhat.com Subject: rcu: Use rcuc threads on PREEMPT_RT as we did Date: Wed, 11 Sep 2019 17:57:28 +0100 While switching to the reworked RCU-thread code, it has been forgotten to enable the thread processing on -RT. Besides restoring behavior that used to be default on RT, this avoids a deadlock on scheduler locks. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 127/194 [ Author: Julia Cartwright Email: julia@ni.com Subject: rcu: enable rcu_normal_after_boot by default for RT Date: Wed, 12 Oct 2016 11:21:14 -0500 The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT. Suggested-by: Luiz Capitulino <lcapitulino@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 128/194 [ Author: Scott Wood Email: swood@redhat.com Subject: rcutorture: Avoid problematic critical section nesting on RT Date: Wed, 11 Sep 2019 17:57:29 +0100 rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 129/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/vmalloc: Another preempt disable region which sucks Date: Tue, 12 Jul 2011 11:39:36 +0200 Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 130/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: do not invoke preempt_disable() Date: Tue, 14 Jul 2015 14:26:34 +0200 preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 131/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: md: raid5: Make raid5_percpu handling RT aware Date: Tue, 6 Apr 2010 16:51:31 +0200 __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel <udovdh@xs4all.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Udo van den Heuvel <udovdh@xs4all.nl> ] 132/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: scsi/fcoe: Make RT aware. Date: Sat, 12 Nov 2011 14:00:48 +0100 Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 133/194 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() Date: Wed, 18 Feb 2015 16:05:28 +0100 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 \|in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd \|Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \|CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 \|Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 \| ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 \| 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 \| ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 \|Call Trace: \| [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e \| [<ffffffff81073c86>] __might_sleep+0xe6/0x150 \| [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50 \| [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] \| [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \| [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] \| [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc] \| [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd] \| [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd] \| [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0 \| [<ffffffff8117f889>] SyS_write+0x49/0xb0 \| [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 134/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Introduce cpu_chill() Date: Wed, 7 Mar 2012 20:51:03 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): \| \|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken \|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is \|called from softirq context, it may block the ksoftirqd() from running, in \|which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() \|If a task calls cpu_chill() and gets woken up by a regular or spurious \|wakeup and has a signal pending, then it exits the sleep loop in \|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is \|not TI_NONE then this results in accessing a stale user pointer from a \|previously interrupted syscall and a copy to user based on the stale \|pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: \| [....] Waiting for /dev to be fully populated... \| ===================================== \| [ BUG: udevd/229 still has locks held! ] \| 3.12.11-rt17 #23 Not tainted \| ------------------------------------- \| 1 lock held by udevd/229: \| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 \| \| stack backtrace: \| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 \| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) \| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) \| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) \| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) \| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) \| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) \| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) \| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) \| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) \| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) \| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) \| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) \| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) \| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) \| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) \| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) \| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 135/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs: namespace: Use cpu_chill() in trylock loops Date: Wed, 7 Mar 2012 21:00:34 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 136/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: debugobjects: Make RT aware Date: Sun, 17 Jul 2011 21:41:35 +0200 Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 137/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use skbufhead with raw lock Date: Tue, 12 Jul 2011 15:38:34 +0200 Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 138/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Dequeue in dev_cpu_dead() without the lock Date: Wed, 16 Sep 2020 16:15:39 +0200 Upstream uses skb_dequeue() to acquire lock of `input_pkt_queue'. The reason is to synchronize against a remote CPU which still thinks that the CPU is online enqueues packets to this CPU. There are no guarantees that the packet is enqueued before the callback is run, it just hope. RT however complains about an not initialized lock because it uses another lock for `input_pkt_queue' due to the IRQ-off nature of the context. Use the unlocked dequeue version for `input_pkt_queue'. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 139/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: dev: always take qdisc's busylock in __dev_xmit_skb() Date: Wed, 30 Mar 2016 13:36:29 +0200 The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 140/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqwork: push most work into softirq context Date: Tue, 23 Jun 2015 15:32:51 +0200 Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 141/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: x86: crypto: Reduce preempt disabled regions Date: Mon, 14 Nov 2011 18:19:27 +0100 Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 142/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: Reduce preempt disabled regions, more algos Date: Fri, 21 Feb 2014 17:24:04 +0100 Don Estabrook reported \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() \| kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); / normal migrate_disable() / glue_fpu_begin(); / get atomic / while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); / with nbytes = 0, migrate_enable() * while we are atomic / }; glue_fpu_end() / no longer atomic / } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook <don.estabrook@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 143/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: limit more FPU-enabled sections Date: Thu, 30 Nov 2017 13:40:10 +0100 Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new _fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 144/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: cryptd - add a lock instead preempt_disable/local_bh_disable Date: Thu, 26 Jul 2018 18:52:00 +0200 cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 145/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: panic: skip get_random_bytes for RT_FULL in init_oops_id Date: Tue, 14 Jul 2015 14:26:34 +0200 Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 146/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: stackprotector: Avoid random pool on rt Date: Thu, 16 Dec 2010 14:25:18 +0100 CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 147/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: random: Make it work on rt Date: Tue, 21 Aug 2012 20:38:50 +0200 Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 148/194 [ Author: Priyanka Jain Email: Priyanka.Jain@freescale.com Subject: net: Remove preemption disabling in netif_rx() Date: Thu, 17 May 2012 09:35:11 +0530 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 149/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: lockdep: Make it RT aware Date: Sun, 17 Jul 2011 18:51:23 +0200 teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 150/194 [ Author: Yong Zhang Email: yong.zhang@windriver.com Subject: lockdep: selftest: Only do hardirq context test for raw spinlock Date: Mon, 16 Apr 2012 15:01:56 +0800 On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Yong Zhang <yong.zhang@windriver.com> Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 151/194 [ Author: Josh Cartwright Email: josh.cartwright@ni.com Subject: lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals Date: Wed, 28 Jan 2015 13:08:45 -0600 "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright <josh.cartwright@ni.com> Signed-off-by: Xander Huff <xander.huff@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 152/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: disable self-test Date: Tue, 17 Oct 2017 16:36:18 +0200 The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 153/194 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,radeon,i915: Use preempt_disable/enable_rt() where recommended Date: Sat, 27 Feb 2016 08:09:11 +0100 DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 154/194 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm/i915: Don't disable interrupts on PREEMPT_RT during atomic updates Date: Sat, 27 Feb 2016 09:01:42 +0100 Commit 8d7849db3eab7 ("drm/i915: Make sprite updates atomic") started disabling interrupts across atomic updates. This breaks on PREEMPT_RT because within this section the code attempt to acquire spinlock_t locks which are sleeping locks on PREEMPT_RT. According to the comment the interrupts are disabled to avoid random delays and not required for protection or synchronisation. Don't disable interrupts on PREEMPT_RT during atomic updates. [bigeasy: drop local locks, commit message] Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 155/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: disable tracing on -RT Date: Thu, 6 Dec 2018 09:52:20 +0100 Luca Abeni reported this: \| BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 \| CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 \| Call Trace: \| rt_spin_lock+0x3f/0x50 \| gen6_read32+0x45/0x1d0 [i915] \| g4x_get_vblank_counter+0x36/0x40 [i915] \| trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni <lucabe72@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 156/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE Date: Wed, 19 Dec 2018 10:47:02 +0100 The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 157/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915/gt: Only disable interrupts for the timeline lock on !force-threaded Date: Tue, 7 Jul 2020 12:25:11 +0200 According to commit d67739268cf0e ("drm/i915/gt: Mark up the nested engine-pm timeline lock as irqsafe") the intrrupts are disabled the code may be called from an interrupt handler and from preemptible context. With `force_irqthreads' set the timeline mutex is never observed in IRQ context so it is not neede to disable interrupts. Disable only interrupts if not in `force_irqthreads' mode. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 158/194 [ Author: Mike Galbraith Email: efault@gmx.de Subject: cpuset: Convert callback_lock to raw_spinlock_t Date: Sun, 8 Jan 2017 09:32:25 +0100 The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 <disables IRQs> ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 159/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Allow to enable RT Date: Wed, 7 Aug 2019 18:15:38 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 160/194 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: mm, rt: kmap_atomic scheduling Date: Thu, 28 Jul 2011 10:43:51 +0200 In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ] ] 161/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/highmem: Add a "already used pte" check Date: Mon, 11 Mar 2013 17:09:55 +0100 This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 162/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm/highmem: Flush tlb on unmap Date: Mon, 11 Mar 2013 21:37:27 +0100 The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 163/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Enable highmem for rt Date: Wed, 13 Feb 2013 11:03:11 +0100 fixup highmem for ARM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 164/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/scatterlist: Do not disable irqs on RT Date: Fri, 3 Jul 2009 08:44:34 -0500 For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 165/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add support for lazy preemption Date: Fri, 26 Oct 2012 18:50:54 +0100 It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 166/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/entry: Use should_resched() in idtentry_exit_cond_resched() Date: Tue, 30 Jun 2020 11:45:14 +0200 The TIF_NEED_RESCHED bit is inlined on x86 into the preemption counter. By using should_resched(0) instead of need_resched() the same check can be performed which uses the same variable as 'preempt_count()` which was issued before. Use should_resched(0) instead need_resched(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 167/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: Support for lazy preemption Date: Thu, 1 Nov 2012 11:03:47 +0100 Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 168/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Add support for lazy preemption Date: Wed, 31 Oct 2012 12:04:11 +0100 Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 169/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Add support for lazy preemption Date: Thu, 1 Nov 2012 10:14:11 +0100 Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 170/194 [ Author: Anders Roxell Email: anders.roxell@linaro.org Subject: arch/arm64: Add lazy preempt support Date: Thu, 14 May 2015 17:52:17 +0200 arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> ] 171/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jump-label: disable if stop_machine() is used Date: Wed, 8 Jul 2015 17:14:48 +0200 Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 172/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: leds: trigger: disable CPU trigger on -RT Date: Thu, 23 Jan 2014 14:45:59 +0100 as it triggers: \|CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 \|[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) \|[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) \|[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) \|[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) \|[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) \|[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) \|[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) \|[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) \|[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) \|[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) \|[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 173/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/omap: Make the locking RT aware Date: Thu, 28 Jul 2011 13:32:57 +0200 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 174/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/pl011: Make the locking work on RT Date: Tue, 8 Jan 2013 21:36:51 +0100 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 175/194 [ Author: Yadi.hu Email: yadi.hu@windriver.com Subject: ARM: enable irq in translation/section permission fault handlers Date: Wed, 10 Dec 2014 10:32:09 +0800 Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { ((char)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [<c041872c>] (unwind_backtrace+0x0/0x104) [ 512.745001] [<c09af0c4>] (dump_stack+0x20/0x24) [ 512.745355] [<c0462490>] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [<c09b6770>] (rt_spin_lock+0x34/0x6c) [ 512.746073] [<c0441bf0>] (do_force_sig_info+0x34/0xf0) [ 512.746457] [<c0442668>] (force_sig_info+0x18/0x1c) [ 512.746829] [<c041d880>] (__do_user_fault+0x9c/0xd8) [ 512.747185] [<c041d938>] (do_bad_area+0x7c/0x94) [ 512.747536] [<c041d990>] (do_sect_fault+0x40/0x48) [ 512.747898] [<c040841c>] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu <yadi.hu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 176/194 [ Author: Josh Cartwright Email: joshc@ni.com Subject: genirq: update irq_set_irqchip_state documentation Date: Thu, 11 Feb 2016 11:54:00 -0600 On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 177/194 [ Author: Josh Cartwright Email: joshc@ni.com Subject: KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() Date: Thu, 11 Feb 2016 11:54:01 -0600 kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall <christoffer.dall@linaro.org> Reported-by: Manish Jaggi <Manish.Jaggi@caviumnetworks.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 178/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() Date: Wed, 25 Jul 2018 14:02:38 +0200 fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 179/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Enable RT also on 32bit Date: Thu, 7 Nov 2019 17:49:20 +0100 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 180/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:29 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 181/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM64: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:35 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 182/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/pseries/iommu: Use a locallock instead local_irq_save() Date: Tue, 26 Mar 2019 18:31:54 +0100 The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 183/194 [ Author: Bogdan Purcareata Email: bogdan.purcareata@freescale.com Subject: powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT Date: Fri, 24 Apr 2015 15:53:13 +0000 While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 184/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Disable highmem on RT Date: Mon, 18 Jul 2011 17:08:34 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 185/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/stackprotector: work around stack-guard init from atomic Date: Tue, 26 Mar 2019 18:31:29 +0100 This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 186/194 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: POWERPC: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:41 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 187/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mips: Disable highmem on RT Date: Mon, 18 Jul 2011 17:10:12 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 188/194 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/block/zram: Replace bit spinlocks with rtmutex for -rt Date: Thu, 31 Mar 2016 04:08:28 +0200 They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 189/194 [ Author: Haris Okanovic Email: haris.okanovic@ni.com Subject: tpm_tis: fix stall after iowrite()s Date: Tue, 15 Aug 2017 15:13:08 -0500 ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic <haris.okanovic@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 190/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signals: Allow rt tasks to cache one sigqueue struct Date: Fri, 3 Jul 2009 08:44:56 -0500 To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 191/194 [ Author: Matt Fleming Email: matt@codeblueprint.co.uk Subject: signal: Prevent double-free of user struct Date: Tue, 7 Apr 2020 10:54:13 +0100 The way user struct reference counting works changed significantly with, fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Now user structs are only freed once the last pending signal is dequeued. Make sigqueue_free_current() follow this new convention to avoid freeing the user struct multiple times and triggering this warning: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 6794 at lib/refcount.c:288 refcount_dec_not_one+0x45/0x50 Call Trace: refcount_dec_and_lock_irqsave+0x16/0x60 free_uid+0x31/0xa0 __dequeue_signal+0x17c/0x190 dequeue_signal+0x5a/0x1b0 do_sigtimedwait+0x208/0x250 __x64_sys_rt_sigtimedwait+0x6f/0xd0 do_syscall_64+0x72/0x200 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 192/194 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: genirq: Disable irqpoll on -rt Date: Fri, 3 Jul 2009 08:29:57 -0500 Creates long latencies for no value Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 193/194 [ Author: Clark Williams Email: williams@redhat.com Subject: sysfs: Add /sys/kernel/realtime entry Date: Sat, 30 Jul 2011 21:55:53 -0500 Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ] 194/194 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: Add localversion for -RT release Date: Fri, 8 Jul 2011 20:25:16 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-01-12	v5.5: patch prep	Bruce Ashfield
	Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-12-20	lib/smp_processor_id: Don't use cpumask_equal()	Bruce Ashfield
	1/252 [ Author: Waiman Long Email: longman@redhat.com Subject: lib/smp_processor_id: Don't use cpumask_equal() Date: Thu, 3 Oct 2019 16:36:08 -0400 The check_preemption_disabled() function uses cpumask_equal() to see if the task is bounded to the current CPU only. cpumask_equal() calls memcmp() to do the comparison. As x86 doesn't have __HAVE_ARCH_MEMCMP, the slow memcmp() function in lib/string.c is used. On a RT kernel that call check_preemption_disabled() very frequently, below is the perf-record output of a certain microbenchmark: 42.75% 2.45% testpmd [kernel.kallsyms] [k] check_preemption_disabled 40.01% 39.97% testpmd [kernel.kallsyms] [k] memcmp We should avoid calling memcmp() in performance critical path. So the cpumask_equal() call is now replaced with an equivalent simpler check. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 2/252 [ Author: Julien Grall Email: julien.grall@arm.com Subject: lib/ubsan: Don't seralize UBSAN report Date: Fri, 20 Sep 2019 11:08:35 +0100 At the moment, UBSAN report will be serialized using a spin_lock(). On RT-systems, spinlocks are turned to rt_spin_lock and may sleep. This will result to the following splat if the undefined behavior is in a context that can sleep: \| BUG: sleeping function called from invalid context at /src/linux/kernel/locking/rtmutex.c:968 \| in_atomic(): 1, irqs_disabled(): 128, pid: 3447, name: make \| 1 lock held by make/3447: \| #0: 000000009a966332 (&mm->mmap_sem){++++}, at: do_page_fault+0x140/0x4f8 \| Preemption disabled at: \| [<ffff000011324a4c>] rt_mutex_futex_unlock+0x4c/0xb0 \| CPU: 3 PID: 3447 Comm: make Tainted: G W 5.2.14-rt7-01890-ge6e057589653 #911 \| Call trace: \| dump_backtrace+0x0/0x148 \| show_stack+0x14/0x20 \| dump_stack+0xbc/0x104 \| ___might_sleep+0x154/0x210 \| rt_spin_lock+0x68/0xa0 \| ubsan_prologue+0x30/0x68 \| handle_overflow+0x64/0xe0 \| __ubsan_handle_add_overflow+0x10/0x18 \| __lock_acquire+0x1c28/0x2a28 \| lock_acquire+0xf0/0x370 \| _raw_spin_lock_irqsave+0x58/0x78 \| rt_mutex_futex_unlock+0x4c/0xb0 \| rt_spin_unlock+0x28/0x70 \| get_page_from_freelist+0x428/0x2b60 \| __alloc_pages_nodemask+0x174/0x1708 \| alloc_pages_vma+0x1ac/0x238 \| __handle_mm_fault+0x4ac/0x10b0 \| handle_mm_fault+0x1d8/0x3b0 \| do_page_fault+0x1c8/0x4f8 \| do_translation_fault+0xb8/0xe0 \| do_mem_abort+0x3c/0x98 \| el0_da+0x20/0x24 The spin_lock() will protect against multiple CPUs to output a report together, I guess to prevent them to be interleaved. However, they can still interleave with other messages (and even splat from __migth_sleep). So the lock usefulness seems pretty limited. Rather than trying to accomodate RT-system by switching to a raw_spin_lock(), the lock is now completely dropped. Link: https://lkml.kernel.org/r/20190920100835.14999-1-julien.grall@arm.com Reported-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 3/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jbd2: Simplify journal_unmap_buffer() Date: Fri, 9 Aug 2019 14:42:27 +0200 journal_unmap_buffer() checks first whether the buffer head is a journal. If so it takes locks and then invokes jbd2_journal_grab_journal_head() followed by another check whether this is journal head buffer. The double checking is pointless. Replace the initial check with jbd2_journal_grab_journal_head() which alredy checks whether the buffer head is actually a journal. Allows also early access to the journal head pointer for the upcoming conversion of state lock to a regular spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 4/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jbd2: Remove jbd_trylock_bh_state() Date: Fri, 9 Aug 2019 14:42:28 +0200 No users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 5/252 [ Author: Jan Kara Email: jack@suse.cz Subject: jbd2: Move dropping of jh reference out of un/re-filing functions Date: Fri, 9 Aug 2019 14:42:29 +0200 __jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() drop transaction's jh reference when they remove jh from a transaction. This will be however inconvenient once we move state lock into journal_head itself as we still need to unlock it and we'd need to grab jh reference just for that. Move dropping of jh reference out of these functions into the few callers. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 6/252 [ Author: Jan Kara Email: jack@suse.cz Subject: jbd2: Drop unnecessary branch from jbd2_journal_forget() Date: Fri, 9 Aug 2019 14:42:30 +0200 We have cleared both dirty & jbddirty bits from the bh. So there's no difference between bforget() and brelse(). Thus there's no point jumping to no_jbd branch. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 7/252 [ Author: Jan Kara Email: jack@suse.cz Subject: jbd2: Don't call __bforget() unnecessarily Date: Fri, 9 Aug 2019 14:42:31 +0200 jbd2_journal_forget() jumps to 'not_jbd' branch which calls __bforget() in cases where the buffer is clean which is pointless. In case of failed assertion, it can be even argued that it is safer not to touch buffer's dirty bits. Also logically it makes more sense to just jump to 'drop' and that will make logic also simpler when we switch bh_state_lock to a spinlock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 8/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jbd2: Make state lock a spinlock Date: Fri, 9 Aug 2019 14:42:32 +0200 Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held region because bit spinlocks disable preemption even on RT. A first attempt was to replace state lock with a spinlock placed in struct buffer_head and make the locking conditional on PREEMPT_RT and DEBUG_BIT_SPINLOCKS. Jan pointed out that there is a 4 byte hole in struct journal_head where a regular spinlock fits in and he would not object to convert the state lock to a spinlock unconditionally. Aside of solving the RT problem, this also gains lockdep coverage for the journal head state lock (bit-spinlocks are not covered by lockdep as it's hard to fit a lockdep map into a single bit). The trivial change would have been to convert the jbd_lock_bh_state() inlines, but that comes with the downside that these functions take a buffer head pointer which needs to be converted to a journal head pointer which adds another level of indirection. As almost all functions which use this lock have a journal head pointer readily available, it makes more sense to remove the lock helper inlines and write out spin_lock() at all call sites. Fixup all locking comments as well. Suggested-by: Jan Kara <jack@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jan Kara <jack@suse.com> Cc: linux-ext4@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 9/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jbd2: Free journal head outside of locked region Date: Fri, 9 Aug 2019 14:42:33 +0200 On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n, i.e. they disable preemption. That means functions which are not safe to be called in preempt disabled context on RT trigger a might_sleep() assert. The journal head bit spinlock is mostly held for short code sequences with trivial RT safe functionality, except for one place: jbd2_journal_put_journal_head() invokes __journal_remove_journal_head() with the journal head bit spinlock held. __journal_remove_journal_head() invokes kmem_cache_free() which must not be called with preemption disabled on RT. Jan suggested to rework the removal function so the actual free happens outside the bit-spinlocked region. Split it into two parts: - Do the sanity checks and the buffer head detach under the lock - Do the actual free after dropping the lock There is error case handling in the free part which needs to dereference the b_size field of the now detached buffer head. Due to paranoia (caused by ignorance) the size is retrieved in the detach function and handed into the free function. Might be over-engineered, but better safe than sorry. This makes the journal head bit-spinlock usage RT compliant and also avoids nested locking which is not covered by lockdep. Suggested-by: Jan Kara <jack@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 10/252 [ Author: Joel Fernandes (Google) Email: joel@joelfernandes.org Subject: workqueue: Convert for_each_wq to use built-in list check Date: Thu, 15 Aug 2019 10:18:42 -0400 Because list_for_each_entry_rcu() can now check for holding a lock as well as for being in an RCU read-side critical section, this commit replaces the workqueue_sysfs_unregister() function's use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with list_for_each_entry_rcu() augmented with a lockdep_is_held() optional argument. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 11/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86/ioapic: Prevent inconsistent state when moving an interrupt Date: Thu, 17 Oct 2019 12:19:01 +0200 There is an issue with threaded interrupts which are marked ONESHOT and using the fasteoi handler: if (IS_ONESHOT()) mask_irq(); .... cond_unmask_eoi_irq() chip->irq_eoi(); if (setaffinity_pending) { mask_ioapic(); ... move_affinity(); unmask_ioapic(); } So if setaffinity is pending the interrupt will be moved and then unconditionally unmasked at the ioapic level, which is wrong in two aspects: 1) It should be kept masked up to the point where the threaded handler finished. 2) The physical chip state and the software masked state are inconsistent Guard both the mask and the unmask with a check for the software masked state. If the line is marked masked then the ioapic line is also masked, so both mask_ioapic() and unmask_ioapic() can be skipped safely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Fixes: 3aa551c9b4c4 ("genirq: add threaded interrupt handler support") Link: https://lkml.kernel.org/r/20191017101938.321393687@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 12/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86/ioapic: Rename misnamed functions Date: Thu, 17 Oct 2019 12:19:02 +0200 ioapic_irqd_[un]mask() are misnomers as both functions do way more than masking and unmasking the interrupt line. Both deal with the moving the affinity of the interrupt within interrupt context. The mask/unmask is just a tiny part of the functionality. Rename them to ioapic_prepare/finish_move(), fixup the call sites and rename the related variables in the code to reflect what this is about. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20191017101938.412489856@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 13/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: percpu-refcount: use normal instead of RCU-sched" Date: Wed, 4 Sep 2019 17:59:36 +0200 This is a revert of commit a4244454df129 ("percpu-refcount: use RCU-sched insted of normal RCU") which claims the only reason for using RCU-sched is "rcu_read_[un]lock() … are slightly more expensive than preempt_disable/enable()" and "As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications." The problem with RCU-sched is that it disables preemption and the callback must not acquire any sleeping locks like spinlock_t on PREEMPT_RT which is the case. Convert back to normal RCU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 14/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: Don't disable interrupts independently of the lock Date: Wed, 10 Apr 2019 11:01:37 +0200 The locks (active.lock and rq->lock) need to be taken with disabled interrupts. This is done in i915_request_retire() by disabling the interrupts independently of the locks itself. While local_irq_disable()+spin_lock() equals spin_lock_irq() on vanilla it does not on PREEMPT_RT. Chris Wilson confirmed that local_irq_disable() was just introduced as an optimisation to avoid enabling/disabling interrupts during lock/unlock combo. Enable/disable interrupts as part of the locking instruction. Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 15/252 [ Author: Julia Cartwright Email: julia@ni.com Subject: watchdog: prevent deferral of watchdogd wakeup on RT Date: Fri, 28 Sep 2018 21:03:51 +0000 When PREEMPT_RT is enabled, all hrtimer expiry functions are deferred for execution into the context of ksoftirqd unless otherwise annotated. Deferring the expiry of the hrtimer used by the watchdog core, however, is a waste, as the callback does nothing but queue a kthread work item and wakeup watchdogd. It's worst then that, too: the deferral through ksoftirqd also means that for correct behavior a user must adjust the scheduling parameters of both watchdogd _and_ ksoftirqd, which is unnecessary and has other side effects (like causing unrelated expiry functions to execute at potentially elevated priority). Instead, mark the hrtimer used by the watchdog core as being _HARD to allow it's execution directly from hardirq context. The work done in this expiry function is well-bounded and minimal. A user still must adjust the scheduling parameters of the watchdogd to be correct w.r.t. their application needs. Cc: Guenter Roeck <linux@roeck-us.net> Reported-and-tested-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Reported-by: Tim Sander <tim@krieglstein.org> Signed-off-by: Julia Cartwright <julia@ni.com> Acked-by: Guenter Roeck <linux@roeck-us.net> [bigeasy: use only HRTIMER_MODE_REL_HARD] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 16/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block: Don't disable interrupts in trigger_softirq() Date: Fri, 15 Nov 2019 21:37:22 +0100 trigger_softirq() is always invoked as a SMP-function call which is always invoked with disables interrupts. Don't disable interrupt in trigger_softirq() because interrupts are already disabled. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 17/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: KVM: Invoke compute_layout() before alternatives are applied Date: Thu, 26 Jul 2018 09:13:42 +0200 compute_layout() is invoked as part of an alternative fixup under stop_machine(). This function invokes get_random_long() which acquires a sleeping lock on -RT which can not be acquired in this context. Rename compute_layout() to kvm_compute_layout() and invoke it before stop_machine() applies the alternatives. Add a __init prefix to kvm_compute_layout() because the caller has it, too (and so the code can be discarded after boot). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 18/252 [ Author: Marc Kleine-Budde Email: mkl@pengutronix.de Subject: net: sched: Use msleep() instead of yield() Date: Wed, 5 Mar 2014 00:49:47 +0100 On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50 (by default). If a high priority userspace process tries to shut down a busy network interface it might spin in a yield loop waiting for the device to become idle. With the interrupt thread having a lower priority than the looping process it might never be scheduled and so result in a deadlock on UP systems. With Magic SysRq the following backtrace can be produced: > test_app R running 0 174 168 0x00000000 > [<c02c7070>] (__schedule+0x220/0x3fc) from [<c02c7870>] (preempt_schedule_irq+0x48/0x80) > [<c02c7870>] (preempt_schedule_irq+0x48/0x80) from [<c0008fa8>] (svc_preempt+0x8/0x20) > [<c0008fa8>] (svc_preempt+0x8/0x20) from [<c001a984>] (local_bh_enable+0x18/0x88) > [<c001a984>] (local_bh_enable+0x18/0x88) from [<c025316c>] (dev_deactivate_many+0x220/0x264) > [<c025316c>] (dev_deactivate_many+0x220/0x264) from [<c023be04>] (__dev_close_many+0x64/0xd4) > [<c023be04>] (__dev_close_many+0x64/0xd4) from [<c023be9c>] (__dev_close+0x28/0x3c) > [<c023be9c>] (__dev_close+0x28/0x3c) from [<c023f7f0>] (__dev_change_flags+0x88/0x130) > [<c023f7f0>] (__dev_change_flags+0x88/0x130) from [<c023f904>] (dev_change_flags+0x10/0x48) > [<c023f904>] (dev_change_flags+0x10/0x48) from [<c024c140>] (do_setlink+0x370/0x7ec) > [<c024c140>] (do_setlink+0x370/0x7ec) from [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) > [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) from [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) > [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) from [<c0256740>] (netlink_rcv_skb+0xac/0xc0) > [<c0256740>] (netlink_rcv_skb+0xac/0xc0) from [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) > [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) from [<c02561b8>] (netlink_unicast+0x13c/0x198) > [<c02561b8>] (netlink_unicast+0x13c/0x198) from [<c025651c>] (netlink_sendmsg+0x264/0x2e0) > [<c025651c>] (netlink_sendmsg+0x264/0x2e0) from [<c022af98>] (sock_sendmsg+0x78/0x98) > [<c022af98>] (sock_sendmsg+0x78/0x98) from [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) > [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) from [<c022cf08>] (__sys_sendmsg+0x48/0x78) > [<c022cf08>] (__sys_sendmsg+0x48/0x78) from [<c0009320>] (ret_fast_syscall+0x0/0x2c) This patch works around the problem by replacing yield() by msleep(1), giving the interrupt thread time to finish, similar to other changes contained in the rt patch set. Using wait_for_completion() instead would probably be a better solution. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 19/252 [ Author: Uladzislau Rezki (Sony) Email: urezki@gmail.com Subject: mm/vmalloc: remove preempt_disable/enable when doing preloading Date: Sat, 30 Nov 2019 17:54:33 -0800 Some background. The preemption was disabled before to guarantee that a preloaded object is available for a CPU, it was stored for. That was achieved by combining the disabling the preemption and taking the spin lock while the ne_fit_preload_node is checked. The aim was to not allocate in atomic context when spinlock is taken later, for regular vmap allocations. But that approach conflicts with CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel. Therefore, get rid of preempt_disable() and preempt_enable() when the preload is done for splitting purpose. As a result we do not guarantee now that a CPU is preloaded, instead we minimize the case when it is not, with this change, by populating the per cpu preload pointer under the vmap_area_lock. This implies that at least each caller that has done the preallocation will not fallback to an atomic allocation later. It is possible that the preallocation would be pointless or that no preallocation is done because of the race but the data shows that this is really rare. For example i run the special test case that follows the preload pattern and path. 20 "unbind" threads run it and each does 1000000 allocations. Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen but the number is negligible. [mhocko@suse.com: changelog additions] Link: http://lkml.kernel.org/r/20191016095438.12391-1-urezki@gmail.com Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Daniel Wagner <dwagner@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 20/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: KVM: arm/arm64: Let the timer expire in hardirq context on RT Date: Tue, 13 Aug 2019 14:29:41 +0200 The timers are canceled from an preempt-notifier which is invoked with disabled preemption which is not allowed on PREEMPT_RT. The timer callback is short so in could be invoked in hard-IRQ context on -RT. Let the timer expire on hard-IRQ context even on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 21/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add printk ring buffer documentation Date: Tue, 12 Feb 2019 15:29:39 +0100 The full documentation file for the printk ring buffer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 22/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add prb locking functions Date: Tue, 12 Feb 2019 15:29:40 +0100 Add processor-reentrant spin locking functions. These allow restricting the number of possible contexts to 2, which can simplify implementing code that also supports NMI interruptions. prb_lock(); /* * This code is synchronized with all contexts * except an NMI on the same processor. / prb_unlock(); In order to support printk's emergency messages, a processor-reentrant spin lock will be used to control raw access to the emergency console. However, it must be the same processor-reentrant spin lock as the one used by the ring buffer, otherwise a deadlock can occur: CPU1: printk lock -> emergency -> serial lock CPU2: serial lock -> printk lock By making the processor-reentrant implemtation available externally, printk can use the same atomic_t for the ring buffer as for the emergency console and thus avoid the above deadlock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 23/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: define ring buffer struct and initializer Date: Tue, 12 Feb 2019 15:29:41 +0100 See Documentation/printk-ringbuffer.txt for details about the initializer arguments. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 24/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add writer interface Date: Tue, 12 Feb 2019 15:29:42 +0100 Add the writer functions prb_reserve() and prb_commit(). These make use of processor-reentrant spin locks to limit the number of possible interruption scenarios for the writers. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 25/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add basic non-blocking reading interface Date: Tue, 12 Feb 2019 15:29:43 +0100 Add reader iterator static declaration/initializer, dynamic initializer, and functions to iterate and retrieve ring buffer data. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 26/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add blocking reader support Date: Tue, 12 Feb 2019 15:29:44 +0100 Add a blocking read function for readers. An irq_work function is used to signal the wait queue so that write notification can be triggered from any context. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 27/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk-rb: add functionality required by printk Date: Tue, 12 Feb 2019 15:29:45 +0100 The printk subsystem needs to be able to query the size of the ring buffer, seek to specific entries within the ring buffer, and track if records could not be stored in the ring buffer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 28/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add ring buffer and kthread Date: Tue, 12 Feb 2019 15:29:46 +0100 The printk ring buffer provides an NMI-safe interface for writing messages to a ring buffer. Using such a buffer for alleviates printk callers from the current burdens of disabled preemption while calling the console drivers (and possibly printing out many messages that another task put into the log buffer). Create a ring buffer to be used for storing messages to be printed to the consoles. Create a dedicated printk kthread to block on the ring buffer and call the console drivers for the read messages. NOTE: The printk_delay is relocated to _after_ the message is printed, where it makes more sense. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 29/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove exclusive console hack Date: Tue, 12 Feb 2019 15:29:47 +0100 In order to support printing the printk log history when new consoles are registered, a global exclusive_console variable is temporarily set. This only works because printk runs with preemption disabled. When console printing is moved to a fully preemptible dedicated kthread, this hack no longer works. Remove exclusive_console usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 30/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: redirect emit/store to new ringbuffer Date: Tue, 12 Feb 2019 15:29:48 +0100 vprintk_emit and vprintk_store are the main functions that all printk variants eventually go through. Change these to store the message in the new printk ring buffer that the printk kthread is reading. Remove functions no longer in use because of the changes to vprintk_emit and vprintk_store. In order to handle interrupts and NMIs, a second per-cpu ring buffer (sprint_rb) is added. This ring buffer is used for NMI-safe memory allocation in order to format the printk messages. NOTE: LOG_CONT is ignored for now and handled as individual messages. LOG_CONT functions are masked behind "#if 0" blocks until their functionality can be restored Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 31/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk_safe: remove printk safe code Date: Tue, 12 Feb 2019 15:29:49 +0100 vprintk variants are now NMI-safe so there is no longer a need for the "safe" calls. NOTE: This also removes printk flushing functionality. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 32/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: minimize console locking implementation Date: Tue, 12 Feb 2019 15:29:50 +0100 Since printing of the printk buffer is now handled by the printk kthread, minimize the console locking functions to just handle locking of the console. NOTE: With this console_flush_on_panic will no longer flush. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 33/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: track seq per console Date: Tue, 12 Feb 2019 15:29:51 +0100 Allow each console to track which seq record was last printed. This simplifies identifying dropped records. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 34/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: do boot_delay_msec inside printk_delay Date: Tue, 12 Feb 2019 15:29:52 +0100 Both functions needed to be called one after the other, so just integrate boot_delay_msec into printk_delay for simplification. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 35/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: print history for new consoles Date: Tue, 12 Feb 2019 15:29:53 +0100 When new consoles register, they currently print how many messages they have missed. However, many (or all) of those messages may still be in the ring buffer. Add functionality to print as much of the history as available. This is a clean replacement of the old exclusive console hack. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 36/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: implement CON_PRINTBUFFER Date: Tue, 12 Feb 2019 15:29:54 +0100 If the CON_PRINTBUFFER flag is not set, do not replay the history for that console. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 37/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: add processor number to output Date: Tue, 12 Feb 2019 15:29:55 +0100 It can be difficult to sort printk out if multiple processors are printing simultaneously. Add the processor number to the printk output to allow the messages to be sorted. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 38/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: console: add write_atomic interface Date: Tue, 12 Feb 2019 15:29:56 +0100 Add a write_atomic callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write callback. The write_atomic callback will only be used for emergency messages. Creating an NMI safe write_atomic that must synchronize with write requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_ functions are provided: void console_atomic_lock(unsigned int flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using the processor-reentrant cpu lock of the printk buffer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 39/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: introduce emergency messages Date: Tue, 12 Feb 2019 15:29:57 +0100 Console messages are generally either critical or non-critical. Critical messages are messages such as crashes or sysrq output. Critical messages should never be lost because generally they provide important debugging information. Since all console messages are output via a fully preemptible printk kernel thread, it is possible that messages are not output because that thread cannot be scheduled (BUG in scheduler, run-away RT task, etc). To allow critical messages to be output independent of the schedulability of the printk task, introduce an emergency mechanism that _immediately_ outputs the message to the consoles. To avoid possible unbounded latency issues, the emergency mechanism only outputs the printk line provided by the caller and ignores any pending messages in the log buffer. Critical messages are identified as messages (by default) with log level LOGLEVEL_WARNING or more critical. This is configurable via the kernel option CONSOLE_LOGLEVEL_EMERGENCY. Any messages output as emergency messages are skipped by the printk thread on those consoles that output the emergency message. In order for a console driver to support emergency messages, the write_atomic function must be implemented by the driver. If not implemented, the emergency messages are handled like all other messages and are printed by the printk thread. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 40/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: serial: 8250: implement write_atomic Date: Tue, 12 Feb 2019 15:29:58 +0100 Implement a non-sleeping NMI-safe write_atomic console function in order to support emergency printk messages. Since interrupts need to be disabled during transmit, all usage of the IER register was wrapped with access functions that use the console_atomic_lock function to synchronize register access while tracking the state of the interrupts. This was necessary because write_atomic is can be calling from an NMI context that has preempted write_atomic. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 41/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: implement KERN_CONT Date: Tue, 12 Feb 2019 15:29:59 +0100 Implement KERN_CONT based on the printing CPU rather than on the printing task. As long as the KERN_CONT messages are coming from the same CPU and no non-KERN_CONT messages come, the messages are assumed to belong to each other. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 42/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: implement /dev/kmsg Date: Tue, 12 Feb 2019 15:30:00 +0100 Since printk messages are now logged to a new ring buffer, update the /dev/kmsg functions to pull the messages from there. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 43/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: implement syslog Date: Tue, 12 Feb 2019 15:30:01 +0100 Since printk messages are now logged to a new ring buffer, update the syslog functions to pull the messages from there. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 44/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: implement kmsg_dump Date: Tue, 12 Feb 2019 15:30:02 +0100 Since printk messages are now logged to a new ring buffer, update the kmsg_dump functions to pull the messages from there. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 45/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: remove unused code Date: Tue, 12 Feb 2019 15:30:03 +0100 Code relating to the safe context and anything dealing with the previous log buffer implementation is no longer in use. Remove it. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 46/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: set deferred to default loglevel, enforce mask Date: Thu, 14 Feb 2019 23:13:30 +0100 All messages printed via vpritnk_deferred() were being automatically treated as emergency messages. Messages printed via vprintk_deferred() should be set to the default loglevel. LOGLEVEL_SCHED is no longer relevant. Also, enforce the loglevel mask for emergency messages. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 47/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: serial: 8250: remove that trylock in serial8250_console_write_atomic() Date: Thu, 14 Feb 2019 17:38:24 +0100 This does not work as rtmutex in NMI context. As per John, it is not needed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 48/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: serial: 8250: export symbols which are used by symbols Date: Sat, 16 Feb 2019 09:02:00 +0100 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 49/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm: remove printk_nmi_.() Date: Fri, 15 Feb 2019 14:34:20 +0100 It is no longer provided by the printk core code. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 50/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: only allow kernel to emergency message Date: Sun, 17 Feb 2019 03:11:20 +0100 Emergency messages exist as a mechanism for the kernel to communicate critical information to users. It is not meant for use by userspace. Only allow facility=0 messages to be processed by the emergency message code. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 51/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: devkmsg: llseek: reset clear if it is lost Date: Fri, 22 Feb 2019 23:02:44 +0100 SEEK_DATA will seek to the last clear record. If this clear record is no longer in the ring buffer, devkmsg_llseek() will go into an infinite loop. Fix that by resetting the clear sequence if the old clear record is no longer in the ring buffer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 52/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: printk: print "rate-limitted" message as info Date: Fri, 22 Feb 2019 12:47:13 +0100 If messages which are injected via kmsg are dropped then they don't need to be printed as warnings. This is to avoid latency spikes if the interface decides to print a lot of important messages. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 53/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: kmsg_dump: remove mutex usage Date: Wed, 24 Apr 2019 16:36:04 +0200 The kmsg dumper can be called from any context, but the dumping helpers were using a mutex to synchronize the iterator against concurrent dumps. Rather than trying to synchronize the iterator, use a local copy of the iterator during the dump. Then no synchronization is required. Reported-by: Scott Wood <swood@redhat.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 54/252 [ Author: He Zhe Email: zhe.he@windriver.com Subject: printk: devkmsg: read: Return EPIPE when the first message user-space wants has gone Date: Tue, 24 Sep 2019 15:26:39 +0800 When user-space wants to read the first message, that is when user->seq is 0, and that message has gone, it currently automatically resets user->seq to current first seq. This mis-aligns with mainline kernel. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/dev-kmsg#n39 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/printk/printk.c#n899 We should inform user-space that what it wants has gone by returning EPIPE in such scenario. Link: https://lore.kernel.org/r/20190924072639.25986-1-zhe.he@windriver.com Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 55/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: handle iterating while buffer changing Date: Mon, 7 Oct 2019 16:20:39 +0200 The syslog and kmsg_dump readers are provided buffers to fill. Both try to maximize the provided buffer usage by calculating the maximum number of messages that can fit. However, if after the calculation, messages are dropped and new messages added, the calculation will no longer match. For syslog, add a check to make sure the provided buffer is not overfilled. For kmsg_dump, start over by recalculating the messages available. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 56/252 [ Author: John Ogness Email: john.ogness@linutronix.de Subject: printk: hack out emergency loglevel usage Date: Tue, 3 Dec 2019 09:14:57 +0100 Instead of using an emergency loglevel to determine if atomic messages should be printed, use oops_in_progress. This conforms to the decision that latency-causing atomic messages never be generated during normal operation. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 57/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t Date: Fri, 15 Nov 2019 18:54:20 +0100 Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT replaces the bit spinlocks with regular spinlocks to avoid this problem. Bit spinlocks are also not covered by lock debugging, e.g. lockdep. Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: remove the wrapper and use always spinlock_t and move it into the padding hole] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 58/252 [ Author: Clark Williams Email: williams@redhat.com Subject: thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t Date: Mon, 15 Jul 2019 15:25:00 -0500 The spinlock pkg_temp_lock has the potential of being taken in atomic context because it can be acquired from the thermal IRQ vector. It's static and limited scope so go ahead and make it a raw spinlock. Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 59/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: perf/core: Add SRCU annotation for pmus list walk Date: Fri, 15 Nov 2019 18:04:07 +0100 Since commit 28875945ba98d ("rcu: Add support for consolidated-RCU reader checking") there is an additional check to ensure that a RCU related lock is held while the RCU list is iterated. This section holds the SRCU reader lock instead. Add annotation to list_for_each_entry_rcu() that pmus_srcu must be acquired during the list traversal. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 60/252 [ Author: He Zhe Email: zhe.he@windriver.com Subject: kmemleak: Turn kmemleak_lock and object->lock to raw_spinlock_t Date: Wed, 19 Dec 2018 16:30:57 +0100 kmemleak_lock as a rwlock on RT can possibly be acquired in atomic context which does work on RT. Since the kmemleak operation is performed in atomic context make it a raw_spinlock_t so it can also be acquired on RT. This is used for debugging and is not enabled by default in a production like environment (where performance/latency matters) so it makes sense to make it a raw_spinlock_t instead trying to get rid of the atomic context. Turn also the kmemleak_object->lock into raw_spinlock_t which is acquired (nested) while the kmemleak_lock is held. The time spent in "echo scan > kmemleak" slightly improved on 64core box with this patch applied after boot. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Liu Haitao <haitao.liu@windriver.com> Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com> [bigeasy: Redo the description. Merge the individual bits: He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded the patch.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 61/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: Use CONFIG_PREEMPTION Date: Fri, 26 Jul 2019 11:30:49 +0200 Thisi is an all-in-one patch of the current `PREEMPTION' branch. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 62/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: BPF: Disable on PREEMPT_RT Date: Thu, 10 Oct 2019 16:54:45 +0200 Disable BPF on PREEMPT_RT because - it allocates and frees memory in atomic context - it uses up_read_non_owner() - BPF_PROG_RUN() expects to be invoked in non-preemptible context Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 63/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: workqueue: Don't assume that the callback has interrupts disabled Date: Tue, 11 Jun 2019 11:21:02 +0200 Due to the TIMER_IRQSAFE flag, the timer callback is invoked with disabled interrupts. On -RT the callback is invoked in softirq context with enabled interrupts. Since the interrupts are threaded, there are are no in_irq() users. The local_bh_disable() around the threaded handler ensures that there is either a timer or a threaded handler active on the CPU. Disable interrupts before __queue_work() is invoked from the timer callback. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 64/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: sched/swait: Add swait_event_lock_irq() Date: Wed, 22 May 2019 12:42:26 +0200 The swait_event_lock_irq() is inspired by wait_event_lock_irq(). This is required by the workqueue code once it switches to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 65/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: workqueue: Use swait for wq_manager_wait Date: Tue, 11 Jun 2019 11:21:09 +0200 In order for the workqueue code use raw_spinlock_t typed locking there must not be a spinlock_t typed lock be acquired. A wait_queue_head uses a spinlock_t lock for its list protection. Use a swait based queue head to avoid raw_spinlock_t -> spinlock_t locking. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 66/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: workqueue: Convert the locks to raw type Date: Wed, 22 May 2019 12:43:56 +0200 After all the workqueue and the timer rework, we can finally make the worker_pool lock raw. The lock is not held over an unbounded period of time/iterations. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 67/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/compaction: Disable compact_unevictable_allowed on RT Date: Fri, 8 Nov 2019 12:55:47 +0100 Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages for pages to compact by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and remove it from /proc on RT. Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 68/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: Remove ->css_rstat_flush() Date: Thu, 15 Aug 2019 18:14:16 +0200 I was looking at the lifetime of the the ->css_rstat_flush() to see if cgroup_rstat_cpu_lock should remain a raw_spinlock_t. I didn't find any users and is unused since it was introduced in commit 8f53470bab042 ("cgroup: Add cgroup_subsys->css_rstat_flush()") Remove the css_rstat_flush callback because it has no users. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 69/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: Consolidate users of cgroup_rstat_lock. Date: Fri, 16 Aug 2019 12:20:42 +0200 cgroup_rstat_flush_irqsafe() has no users, remove it. cgroup_rstat_flush_hold() and cgroup_rstat_flush_release() are only used within this file. Make it static. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 70/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: Remove `may_sleep' from cgroup_rstat_flush_locked() Date: Fri, 16 Aug 2019 12:25:35 +0200 cgroup_rstat_flush_locked() is always invoked with `may_sleep' set to true so that this case can be made default and the parameter removed. Remove the `may_sleep' parameter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 71/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: Acquire cgroup_rstat_lock with enabled interrupts Date: Fri, 16 Aug 2019 12:49:36 +0200 There is no need to disable interrupts while cgroup_rstat_lock is acquired. The lock is never used in-IRQ context so a simple spin_lock() is enough for synchronisation purpose. Acquire cgroup_rstat_lock without disabling interrupts and ensure that cgroup_rstat_cpu_lock is acquired with disabled interrupts (this one is acquired in-IRQ context). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 72/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: workingset: replace IRQ-off check with a lockdep assert. Date: Mon, 11 Feb 2019 10:40:46 +0100 Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 73/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: tpm: remove tpm_dev_wq_lock Date: Mon, 11 Feb 2019 11:33:11 +0100 Added in commit 9e1b74a63f776 ("tpm: add support for nonblocking operation") but never actually used it. Cc: Philip Tricca <philip.b.tricca@intel.com> Cc: Tadeusz Struk <tadeusz.struk@intel.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 74/252 [ Author: Rob Herring Email: robh@kernel.org Subject: of: Rework and simplify phandle cache to use a fixed size Date: Wed, 11 Dec 2019 17:23:45 -0600 The phandle cache was added to speed up of_find_node_by_phandle() by avoiding walking the whole DT to find a matching phandle. The implementation has several shortcomings: - The cache is designed to work on a linear set of phandle values. This is true for dtc generated DTs, but not for other cases such as Power. - The cache isn't enabled until of_core_init() and a typical system may see hundreds of calls to of_find_node_by_phandle() before that point. - The cache is freed and re-allocated when the number of phandles changes. - It takes a raw spinlock around a memory allocation which breaks on RT. Change the implementation to a fixed size and use hash_32() as the cache index. This greatly simplifies the implementation. It avoids the need for any re-alloc of the cache and taking a reference on nodes in the cache. We only have a single source of removing cache entries which is of_detach_node(). Using hash_32() removes any assumption on phandle values improving the hit rate for non-linear phandle values. The effect on linear values using hash_32() is about a 10% collision. The chances of thrashing on colliding values seems to be low. To compare performance, I used a RK3399 board which is a pretty typical system. I found that just measuring boot time as done previously is noisy and may be impacted by other things. Also bringing up secondary cores causes some issues with measuring, so I booted with 'nr_cpus=1'. With no caching, calls to of_find_node_by_phandle() take about 20124 us for 1248 calls. There's an additional 288 calls before time keeping is up. Using the average time per hit/miss with the cache, we can calculate these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache and 13319 us with no cache or an uninitialized cache. Comparing the 3 implementations the time spent in of_find_node_by_phandle() is: no cache: 20124 us (+ 13319 us) 128 entry cache: 5134 us (+ 690 us) current cache: 819 us (+ 13319 us) We could move the allocation of the cache earlier to improve the current cache, but that just further complicates the situation as it needs to be after slab is up, so we can't do it when unflattening (which uses memblock). Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Frank Rowand <frowand.list@gmail.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lkml.kernel.org/r/20191211232345.24810-1-robh@kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 75/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: timekeeping: Split jiffies seqlock Date: Thu, 14 Feb 2013 22:36:59 +0100 Replace jiffies_lock seqlock with a simple seqcounter and a rawlock so it can be taken in atomic context on RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 76/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signal: Revert ptrace preempt magic Date: Wed, 21 Sep 2011 19:57:12 +0200 Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 77/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: dma-buf: Use seqlock_t instread disabling preemption Date: Wed, 14 Aug 2019 16:38:43 +0200 "dma reservation" disables preemption while acquiring the write access for "seqcount". Replace the seqcount with a seqlock_t which provides seqcount like semantic and lock for writer. Link: https://lkml.kernel.org/r/f410b429-db86-f81c-7c67-f563fa808b62@free.fr Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 78/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: seqlock: Prevent rt starvation Date: Wed, 22 Feb 2012 12:03:30 +0100 If a low prio writer gets preempted while holding the seqlock write locked, a high prio reader spins forever on RT. To prevent this let the reader grab the spinlock, so it blocks and eventually boosts the writer. This way the writer can proceed and endless spinning is prevented. For seqcount writers we disable preemption over the update code path. Thanks to Al Viro for distangling some VFS code to make that possible. Nicholas Mc Guire: - spin_lock+unlock => spin_unlock_wait - __write_seqcount_begin => __raw_write_seqcount_begin Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 79/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: NFSv4: replace seqcount_t with a seqlock_t Date: Fri, 28 Oct 2016 23:05:11 +0200 The raw_write_seqcount_begin() in nfs4_reclaim_open_state() causes a preempt_disable() on -RT. The spin_lock()/spin_unlock() in that section does not work. The lockdep part was removed in commit abbec2da13f0 ("NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because lockdep complained. The whole seqcount thing was introduced in commit c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as being recovered"). The recovery threads runs only once. write_seqlock() does not work on !RT because it disables preemption and it the writer side is preemptible (has to remain so despite the fact that it will block readers). Reported-by: kernel test robot <xiaolong.ye@intel.com> Link: https://lkml.kernel.org/r/20161021164727.24485-1-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 80/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/Qdisc: use a seqlock instead seqcount Date: Wed, 14 Sep 2016 17:36:35 +0200 The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 81/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Add a mutex around devnet_rename_seq Date: Wed, 20 Mar 2013 18:06:20 +0100 On RT write_seqcount_begin() disables preemption and device_rename() allocates memory with GFP_KERNEL and grabs later the sysfs_mutex mutex. Serialize with a mutex and add use the non preemption disabling __write_seqcount_begin(). To avoid writer starvation, let the reader grab the mutex and release it when it detects a writer in progress. This keeps the normal case (no reader on the fly) fast. [ tglx: Instead of replacing the seqcount by a mutex, add the mutex ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 82/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: userfaultfd: Use a seqlock instead of seqcount Date: Wed, 18 Dec 2019 12:25:09 +0100 On RT write_seqcount_begin() disables preemption which leads to warning in add_wait_queue() while the spinlock_t is acquired. The waitqueue can't be converted to swait_queue because userfaultfd_wake_function() is used as a custom wake function. Use seqlock instead seqcount to avoid the preempt_disable() section during add_wait_queue(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 83/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/nfs: turn rmdir_sem into a semaphore Date: Thu, 15 Sep 2016 10:51:27 +0200 The RW semaphore had a reader side which used the _non_owner version because it most likely took the reader lock in one thread and released it in another which would cause lockdep to complain if the "regular" version was used. On -RT we need the owner because the rw lock is turned into a rtmutex. The semaphores on the hand are "plain simple" and should work as expected. We can't have multiple readers but on -RT we don't allow multiple readers anyway so that is not a loss. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 84/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: disable preemption on i_dir_seq's write side Date: Fri, 20 Oct 2017 11:29:53 +0200 i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 85/252 [ Author: Paul Gortmaker Email: paul.gortmaker@windriver.com Subject: list_bl: Make list head locking RT safe Date: Fri, 21 Jun 2013 15:07:25 -0400 As per changes in include/linux/jbd_common.h for avoiding the bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal head lock rt safe") we do the same thing here. We use the non atomic __set_bit and __clear_bit inside the scope of the lock to preserve the ability of the existing LIST_DEBUG code to use the zero'th bit in the sanity checks. As a bit spinlock, we had no lockdep visibility into the usage of the list head locking. Now, if we were to implement it as a standard non-raw spinlock, we would see: BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 in_atomic(): 1, irqs_disabled(): 0, pid: 122, name: udevd 5 locks held by udevd/122: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffff811967e8>] lock_rename+0xe8/0xf0 #1: (rename_lock){+.+...}, at: [<ffffffff811a277c>] d_move+0x2c/0x60 #2: (&dentry->d_lock){+.+...}, at: [<ffffffff811a0763>] dentry_lock_for_move+0xf3/0x130 #3: (&dentry->d_lock/2){+.+...}, at: [<ffffffff811a0734>] dentry_lock_for_move+0xc4/0x130 #4: (&dentry->d_lock/3){+.+...}, at: [<ffffffff811a0747>] dentry_lock_for_move+0xd7/0x130 Pid: 122, comm: udevd Not tainted 3.4.47-rt62 #7 Call Trace: [<ffffffff810b9624>] __might_sleep+0x134/0x1f0 [<ffffffff817a24d4>] rt_spin_lock+0x24/0x60 [<ffffffff811a0c4c>] __d_shrink+0x5c/0xa0 [<ffffffff811a1b2d>] __d_drop+0x1d/0x40 [<ffffffff811a24be>] __d_move+0x8e/0x320 [<ffffffff811a278e>] d_move+0x3e/0x60 [<ffffffff81199598>] vfs_rename+0x198/0x4c0 [<ffffffff8119b093>] sys_renameat+0x213/0x240 [<ffffffff817a2de5>] ? _raw_spin_unlock+0x35/0x60 [<ffffffff8107781c>] ? do_page_fault+0x1ec/0x4b0 [<ffffffff817a32ca>] ? retint_swapgs+0xe/0x13 [<ffffffff813eb0e6>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8119b0db>] sys_rename+0x1b/0x20 [<ffffffff817a3b96>] system_call_fastpath+0x1a/0x1f Since we are only taking the lock during short lived list operations, lets assume for now that it being raw won't be a significant latency concern. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> [julia@ni.com: Use #define instead static inline to avoid false positive from lockdep] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 86/252 [ Author: Clark Williams Email: williams@redhat.com Subject: fscache: initialize cookie hash table raw spinlocks Date: Tue, 3 Jul 2018 13:34:30 -0500 The fscache cookie mechanism uses a hash table of hlist_bl_head structures. The PREEMPT_RT patcheset adds a raw spinlock to this structure and so on PREEMPT_RT the structures get used uninitialized, causing warnings about bad magic numbers when spinlock debugging is turned on. Use the init function for fscache cookies. Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 87/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: bring back explicit INIT_HLIST_BL_HEAD init Date: Wed, 13 Sep 2017 12:32:34 +0200 Commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag") removed INIT_HLIST_BL_HEAD and uses the ZERO flag instead for the init. However on RT we have also a spinlock which needs an init call so we can't use that. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 88/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: use swait_queue instead of waitqueue Date: Wed, 14 Sep 2016 14:35:49 +0200 __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 89/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: kconfig: Disable config options which are not RT compatible Date: Sun, 24 Jul 2011 12:11:43 +0200 Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 90/252 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: Allow only SLUB on RT Date: Fri, 3 Jul 2009 08:44:03 -0500 Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 91/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rcu: make RCU_BOOST default on RT Date: Fri, 21 Mar 2014 20:19:05 +0100 Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 92/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable CONFIG_RT_GROUP_SCHED on RT Date: Mon, 18 Jul 2011 17:03:52 +0200 Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 93/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: disable NET_RX_BUSY_POLL on RT Date: Sat, 27 May 2017 19:02:06 +0200 napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 94/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: md: disable bcache Date: Thu, 29 Aug 2013 11:48:57 +0200 It uses anon semaphores \|drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’: \|drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration] \| up_read_non_owner(&dc->writeback_lock); \| ^ \|drivers/md/bcache/request.c: In function ‘request_write’: \|drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration] \| down_read_non_owner(&dc->writeback_lock); \| ^ either we get rid of those or we have to introduce them… Link: http://lkml.kernel.org/r/20130820111602.3cea203c@gandalf.local.home Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 95/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Disable runtime services on RT Date: Thu, 26 Jul 2018 15:03:16 +0200 Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 96/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Allow efi=runtime Date: Thu, 26 Jul 2018 15:06:10 +0200 In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 97/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Disable HAVE_ARCH_JUMP_LABEL Date: Mon, 1 Jul 2019 17:39:28 +0200 __text_poke() does: \| local_irq_save(flags); … \| ptep = get_locked_pte(poking_mm, poking_addr, &ptl); which does not work on -RT because the PTE-lock is a spinlock_t typed lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 98/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Add local irq locks Date: Mon, 20 Jun 2011 09:03:47 +0200 Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 99/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: softirq: Add preemptible softirq Date: Mon, 20 May 2019 13:09:08 +0200 Add preemptible softirq for RT's needs. By removing the softirq count from the preempt counter, the softirq becomes preemptible. A per-CPU lock ensures that there is no parallel softirq processing or that per-CPU variables are not access in parallel by multiple threads. local_bh_enable() will process all softirq work that has been raised in its BH-disabled section once the BH counter gets to 0. [+ rcu_read_lock() as part of local_bh_disable() by Scott Wood] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 100/252 [ Author: Oleg Nesterov Email: oleg@redhat.com Subject: signal/x86: Delay calling signals in atomic Date: Tue, 14 Jul 2015 14:26:34 +0200 On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: also needed on 32bit as per Yang Shi <yang.shi@linaro.org>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 101/252 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 102/252 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 103/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLxB: change list_lock to raw_spinlock_t Date: Mon, 28 May 2018 15:24:22 +0200 The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on -RT. The locking rules remain the same on !RT. This patch changes it for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 104/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLUB: delay giving back empty slubs to IRQ enabled regions Date: Thu, 21 Jun 2018 17:29:19 +0200 __free_slab() is invoked with disabled interrupts which increases the irq-off time while __free_pages() is doing the work. Allow __free_slab() to be invoked with enabled interrupts and move everything from interrupts-off invocations to a temporary per-CPU list so it can be processed later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 105/252 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: page_alloc: rt-friendly per-cpu pages Date: Fri, 3 Jul 2009 08:29:37 -0500 rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Contains fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 106/252 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: mm/page_alloc: Split drain_local_pages() Date: Thu, 18 Apr 2019 11:09:04 +0200 Splitting the functionality of drain_local_pages() into a separate function. This is a preparatory work for introducing the static key dependend locking mechanism. No functional change. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 107/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/swap: Add static key dependent pagevec locking Date: Thu, 18 Apr 2019 11:09:05 +0200 The locking of struct pagevec is done by disabling preemption. In case the struct has be accessed form interrupt context then interrupts are disabled. This means the struct can only be accessed locally from the CPU. There is also no lockdep coverage which would scream during if it accessed from wrong context. Create struct swap_pagevec which contains of a pagevec member and a spin_lock_t. Introduce a static key, which changes the locking behavior only if the key is set in the following way: Before the struct is accessed the spin_lock has to be acquired instead of using preempt_disable(). Since the struct is used CPU-locally there is no spinning on the lock but the lock is acquired immediately. If the struct is accessed from interrupt context, spin_lock_irqsave() is used. No functional change yet because static key is not enabled. [anna-maria: introduce static key] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 108/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/swap: Access struct pagevec remotely Date: Thu, 18 Apr 2019 11:09:06 +0200 When the newly introduced static key would be enabled, struct pagevec is locked during access. So it is possible to access it from a remote CPU. The advantage is that the work can be done from the "requesting" CPU without firing a worker on a remote CPU and waiting for it to complete the work. No functional change because static key is not enabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 109/252 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: mm/swap: Enable "use_pvec_lock" nohz_full dependent Date: Thu, 18 Apr 2019 11:09:07 +0200 When a system runs with CONFIG_NO_HZ_FULL enabled, the tick of CPUs listed in 'nohz_full=' kernel command line parameter should be stopped whenever possible. The tick stays longer stopped, when work for this CPU is handled by another CPU. With the already introduced static key 'use_pvec_lock' there is the possibility to prevent firing a worker for mm/swap work on a remote CPU with a stopped tick. Therefore enabling the static key in case kernel command line parameter 'nohz_full=' setup was successful, which implies that CONFIG_NO_HZ_FULL is set. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 110/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/swap: Enable use pvec lock on RT Date: Mon, 12 Aug 2019 11:20:44 +0200 On RT we also need to avoid preempt disable/IRQ-off regions so have to enable the locking while accessing pvecs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 111/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: preempt: Provide preempt__(no)rt variants Date: Fri, 24 Jul 2009 12:38:56 +0200 RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 112/252 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm/vmstat: Protect per cpu variables with preempt disable on RT Date: Fri, 3 Jul 2009 08:30:13 -0500 Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 113/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm: Enable SLUB for RT Date: Thu, 25 Oct 2012 10:32:35 +0100 Avoid the memory allocation in IRQ section Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: factor out everything except the kcalloc() workaorund ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 114/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: slub: Enable irqs for __GFP_WAIT Date: Wed, 9 Jan 2013 12:08:15 +0100 SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 115/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: slub: Disable SLUB_CPU_PARTIAL Date: Wed, 15 Apr 2015 19:00:47 +0200 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 \|in_atomic(): 1, irqs_disabled(): 0, pid: 87, name: rcuop/7 \|1 lock held by rcuop/7/87: \| #0: (rcu_callback){......}, at: [<ffffffff8112c76a>] rcu_nocb_kthread+0x1ca/0x5d0 \|Preemption disabled at:[<ffffffff811eebd9>] put_cpu_partial+0x29/0x220 \| \|CPU: 0 PID: 87 Comm: rcuop/7 Tainted: G W 4.0.0-rt0+ #477 \|Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 \| 000000000007a9fc ffff88013987baf8 ffffffff817441c7 0000000000000007 \| 0000000000000000 ffff88013987bb18 ffffffff810eee51 0000000000000000 \| ffff88013fc10200 ffff88013987bb48 ffffffff8174a1c4 000000000007a9fc \|Call Trace: \| [<ffffffff817441c7>] dump_stack+0x4f/0x90 \| [<ffffffff810eee51>] ___might_sleep+0x121/0x1b0 \| [<ffffffff8174a1c4>] rt_spin_lock+0x24/0x60 \| [<ffffffff811a689a>] __free_pages_ok+0xaa/0x540 \| [<ffffffff811a729d>] __free_pages+0x1d/0x30 \| [<ffffffff811eddd5>] __free_slab+0xc5/0x1e0 \| [<ffffffff811edf46>] free_delayed+0x56/0x70 \| [<ffffffff811eecfd>] put_cpu_partial+0x14d/0x220 \| [<ffffffff811efc98>] __slab_free+0x158/0x2c0 \| [<ffffffff811f0021>] kmem_cache_free+0x221/0x2d0 \| [<ffffffff81204d0c>] file_free_rcu+0x2c/0x40 \| [<ffffffff8112c7e3>] rcu_nocb_kthread+0x243/0x5d0 \| [<ffffffff810e951c>] kthread+0xfc/0x120 \| [<ffffffff8174abc8>] ret_from_fork+0x58/0x90 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 116/252 [ Author: Yang Shi Email: yang.shi@windriver.com Subject: mm/memcontrol: Don't call schedule_work_on in preemption disabled context Date: Wed, 30 Oct 2013 11:48:33 -0700 The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff8112b361>] drain_all_stock+0xe1/0x1c0 [<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 [<ffffffff8112beda>] __mem_cgroup_try_charge+0x41a/0xc40 [<ffffffff810f1c91>] ? release_pages+0x1b1/0x1f0 [<ffffffff8106f200>] ? sched_exec+0x40/0xb0 [<ffffffff8112cc87>] mem_cgroup_charge_common+0x37/0x70 [<ffffffff8112e2c6>] mem_cgroup_newpage_charge+0x26/0x30 [<ffffffff8110af68>] handle_pte_fault+0x618/0x840 [<ffffffff8103ecf6>] ? unpin_current_cpu+0x16/0x70 [<ffffffff81070f94>] ? migrate_enable+0xd4/0x200 [<ffffffff8110cde5>] handle_mm_fault+0x145/0x1e0 [<ffffffff810301e1>] __do_page_fault+0x1a1/0x4c0 [<ffffffff8169c9eb>] ? preempt_schedule_irq+0x4b/0x70 [<ffffffff8169e3b7>] ? retint_kernel+0x37/0x40 [<ffffffff8103053e>] do_page_fault+0xe/0x10 [<ffffffff8169e4c2>] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 117/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Replace local_irq_disable with local locks Date: Wed, 28 Jan 2015 17:14:16 +0100 There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 118/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: mm/zsmalloc: copy with get_cpu_var() and locking Date: Tue, 22 Mar 2016 11:16:09 +0100 get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 119/252 [ Author: Luis Claudio R. Goncalves Email: lclaudio@uudg.org Subject: mm/zswap: Do not disable preemption in zswap_frontswap_store() Date: Tue, 25 Jun 2019 11:28:04 -0300 Zswap causes "BUG: scheduling while atomic" by blocking on a rt_spin_lock() with preemption disabled. The preemption is disabled by get_cpu_var() in zswap_frontswap_store() to protect the access of the zswap_dstmem percpu variable. Use get_locked_var() to protect the percpu zswap_dstmem variable, making the code preemptive. As get_cpu_ptr() also disables preemption, replace it by this_cpu_ptr() and remove the counterpart put_cpu_ptr(). Steps to Reproduce: 1. # grubby --args "zswap.enabled=1" --update-kernel DEFAULT 2. # reboot 3. Calculate the amount o memory to be used by the test: ---> grep MemAvailable /proc/meminfo ---> Add 25% ~ 50% to that value 4. # stress --vm 1 --vm-bytes ${MemAvailable+25%} --timeout 240s Usually, in less than 5 minutes the backtrace listed below appears, followed by a kernel panic: \| BUG: scheduling while atomic: kswapd1/181/0x00000002 \| \| Preemption disabled at: \| [<ffffffff8b2a6cda>] zswap_frontswap_store+0x21a/0x6e1 \| \| Kernel panic - not syncing: scheduling while atomic \| CPU: 14 PID: 181 Comm: kswapd1 Kdump: loaded Not tainted 5.0.14-rt9 #1 \| Hardware name: AMD Pence/Pence, BIOS WPN2321X_Weekly_12_03_21 03/19/2012 \| Call Trace: \| panic+0x106/0x2a7 \| __schedule_bug.cold+0x3f/0x51 \| __schedule+0x5cb/0x6f0 \| schedule+0x43/0xd0 \| rt_spin_lock_slowlock_locked+0x114/0x2b0 \| rt_spin_lock_slowlock+0x51/0x80 \| zbud_alloc+0x1da/0x2d0 \| zswap_frontswap_store+0x31a/0x6e1 \| __frontswap_store+0xab/0x130 \| swap_writepage+0x39/0x70 \| pageout.isra.0+0xe3/0x320 \| shrink_page_list+0xa8e/0xd10 \| shrink_inactive_list+0x251/0x840 \| shrink_node_memcg+0x213/0x770 \| shrink_node+0xd9/0x450 \| balance_pgdat+0x2d5/0x510 \| kswapd+0x218/0x470 \| kthread+0xfb/0x130 \| ret_from_fork+0x27/0x50 Cc: stable-rt@vger.kernel.org Reported-by: Ping Fang <pifang@redhat.com> Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 120/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: radix-tree: use local locks Date: Wed, 25 Jan 2017 16:34:27 +0100 The preload functionality uses per-CPU variables and preempt-disable to ensure that it does not switch CPUs during its usage. This patch adds local_locks() instead preempt_disable() for the same purpose and to remain preemptible on -RT. Cc: stable-rt@vger.kernel.org Reported-and-debugged-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 121/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: kvm Require const tsc for RT Date: Sun, 6 Nov 2011 12:26:18 +0100 Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 122/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: pci/switchtec: Don't use completion's wait queue Date: Wed, 4 Oct 2017 10:24:23 +0200 The poll callback is using completion's wait_queue_head_t member and puts it in poll_wait() so the poll() caller gets a wakeup after command completed. This does not work on RT because we don't have a wait_queue_head_t in our completion implementation. Nobody in tree does like that in tree so this is the only driver that breaks. Instead of using the completion here is waitqueue with a status flag as suggested by Logan. I don't have the HW so I have no idea if it works as expected, so please test it. Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com> Cc: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 123/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: wait.h: include atomic.h Date: Mon, 28 Oct 2013 12:19:57 +0100 \| CC init/main.o \|In file included from include/linux/mmzone.h:9:0, \| from include/linux/gfp.h:4, \| from include/linux/kmod.h:22, \| from include/linux/module.h:13, \| from init/main.c:15: \|include/linux/wait.h: In function ‘wait_on_atomic_t’: \|include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] \| if (atomic_read(val) == 0) \| ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 124/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: completion: Use simple wait queues Date: Fri, 11 Jan 2013 11:23:51 +0100 Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [cminyard@mvista.com: Move __prepare_to_swait() into the do loop because swake_up_locked() removes the waiter on wake from the queue while in the original code it is not the case] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 125/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hrtimer: Allow raw wakeups during boot Date: Fri, 9 Aug 2019 15:25:21 +0200 There are a few wake-up timers during the early boot which are essencial for the system to make progress. At this stage there are no softirq spawn for the softirq processing so there is no timer processing in softirq. The wakeup in question: smpboot_create_thread() -> kthread_create_on_cpu() -> kthread_bind() -> wait_task_inactive() -> schedule_hrtimeout() Let the timer fire in hardirq context during the system boot. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 126/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hrtimer: move state change before hrtimer_cancel in do_nanosleep() Date: Thu, 6 Dec 2018 10:15:13 +0100 There is a small window between setting t->task to NULL and waking the task up (which would set TASK_RUNNING). So the timer would fire, run and set ->task to NULL while the other side/do_nanosleep() wouldn't enter freezable_schedule(). After all we are peemptible here (in do_nanosleep() and on the timer wake up path) and on KVM/virt the virt-CPU might get preempted. So do_nanosleep() wouldn't enter freezable_schedule() but cancel the timer which is still running and wait for it via hrtimer_wait_for_timer(). Then wait_event()/might_sleep() would complain that it is invoked with state != TASK_RUNNING. This isn't a problem since it would be reset to TASK_RUNNING later anyway and we don't rely on the previous state. Move the state update to TASK_RUNNING before hrtimer_cancel() so there are no complains from might_sleep() about wrong state. Cc: stable-rt@vger.kernel.org Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 127/252 [ Author: John Stultz Email: johnstul@us.ibm.com Subject: posix-timers: Thread posix-cpu-timers on -rt Date: Fri, 3 Jul 2009 08:29:58 -0500 posix-cpu-timer code takes non -rt safe locks in hard irq context. Move it to a thread. [ 3.0 fixes from Peter Zijlstra <peterz@infradead.org> ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 128/252 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: posix-timers: Add expiry lock Date: Mon, 27 May 2019 16:54:06 +0200 If a about to be removed posix timer is active then the code will retry the delete operation until it succeeds / the timer callback completes. Use hrtimer_grab_expiry_lock() for posix timers which use a hrtimer underneath to spin on a lock until the callback finished. Introduce cpu_timers_grab_expiry_lock() for the posix-cpu-timer. This will acquire the proper per-CPU spin_lock which is acquired by the CPU which is expirering the timer. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> [bigeasy: keep the posix-cpu timer bits, everything else got applied] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 129/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Limit the number of task migrations per batch Date: Mon, 6 Jun 2011 12:12:51 +0200 Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 130/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Move mmdrop to RCU on RT Date: Mon, 6 Jun 2011 12:20:33 +0200 Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 131/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: move stack + kprobe clean up to __put_task_struct() Date: Mon, 21 Nov 2016 19:31:08 +0100 There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 132/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add saved_state for tasks blocked on sleeping locks Date: Sat, 25 Jun 2011 09:21:04 +0200 Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 133/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Do not account rcu_preempt_depth on RT in might_sleep() Date: Tue, 7 Jun 2011 09:19:06 +0200 RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 134/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable TTWU_QUEUE on RT Date: Tue, 13 Sep 2011 16:42:35 +0200 The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 135/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: softirq: Avoid a cancel dead-lock in tasklet handling due to preemptible-softirq Date: Sat, 22 Jun 2019 00:09:22 +0200 A pending / active tasklet which is preempted by a task on the same CPU will spin indefinitely because the tasklet makes no progress. To avoid this deadlock we can disable BH which will acquire the softirq-lock which will force the completion of the softirq and so the tasklet. The BH off/on in tasklet_kill() will force tasklets which are not yet running but scheduled (because ksoftirqd was preempted before it could start the tasklet). The BH off/on in tasklet_unlock_wait() will force tasklets which got preempted while running. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 136/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Check preemption after reenabling interrupts Date: Sun, 13 Nov 2011 17:17:09 +0100 raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 137/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Disable softirq stacks for RT Date: Mon, 18 Jul 2011 13:59:17 +0200 Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 138/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: use local_bh_disable() in netif_rx_ni() Date: Fri, 16 Jun 2017 19:03:16 +0200 In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 139/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Handle the various new futex race conditions Date: Fri, 10 Jun 2011 11:04:15 +0200 RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 140/252 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: futex: Fix bug on when a requeued RT task times out Date: Tue, 14 Jul 2015 14:26:34 +0200 Requeue with timeout causes a bug with PREEMPT_RT. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() <timed out> double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 141/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: futex: Ensure lock/unlock symetry versus pi_lock and hash bucket lock Date: Fri, 1 Mar 2013 11:17:42 +0100 In exit_pi_state_list() we have the following locking construct: spin_lock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); ... spin_unlock(&hb->lock); In !RT this works, but on RT the migrate_enable() function which is called from spin_unlock() sees atomic context due to the held pi_lock and just decrements the migrate_disable_atomic counter of the task. Now the next call to migrate_disable() sees the counter being negative and issues a warning. That check should be in migrate_enable() already. Fix this by dropping pi_lock before unlocking hb->lock and reaquire pi_lock after that again. This is safe as the loop code reevaluates head again under the pi_lock. Reported-by: Yong Zhang <yong.zhang@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 142/252 [ Author: Grygorii Strashko Email: Grygorii.Strashko@linaro.org Subject: pid.h: include atomic.h Date: Tue, 21 Jul 2015 19:43:56 +0300 This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 143/252 [ Author: Wolfgang M. Reimer Email: linuxball@gmail.com Subject: locking: locktorture: Do NOT include rwlock.h directly Date: Tue, 21 Jul 2015 16:20:07 +0200 Including rwlock.h directly will cause kernel builds to fail if CONFIG_PREEMPT_RT is defined. The correct header file (rwlock_rt.h OR rwlock.h) will be included by spinlock.h which is included by locktorture.c anyway. Cc: stable-rt@vger.kernel.org Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 144/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Add rtmutex_lock_killable() Date: Thu, 9 Jun 2011 11:43:52 +0200 Add "killable" type to rtmutex. We need this since rtmutex are used as "normal" mutexes which do use this type. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 145/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Make lock_killable work Date: Sat, 1 Apr 2017 12:50:59 +0200 Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionaly. Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 146/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: spinlock: Split the lock types header Date: Wed, 29 Jun 2011 19:34:01 +0200 Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 147/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Avoid include hell Date: Wed, 29 Jun 2011 20:06:39 +0200 Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 148/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rbtree: don't include the rcu header Date: Fri, 20 Dec 2019 11:38:21 -0500 The RCU header pulls in spinlock.h and fails due not yet defined types: \|In file included from include/linux/spinlock.h:275:0, \| from include/linux/rcupdate.h:38, \| from include/linux/rbtree.h:34, \| from include/linux/rtmutex.h:17, \| from include/linux/spinlock_types.h:18, \| from kernel/bounds.c:13: \|include/linux/rwlock_rt.h:16:38: error: unknown type name ‘rwlock_t’ \| extern void __lockfunc rt_write_lock(rwlock_t rwlock); \| ^ This patch moves the required RCU function from the rcupdate.h header file into a new header file which can be included by both users. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 149/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Provide rt_mutex_slowlock_locked() Date: Thu, 12 Oct 2017 16:14:22 +0200 This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 150/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Date: Thu, 12 Oct 2017 16:36:39 +0200 Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 151/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add sleeping lock implementation Date: Thu, 12 Oct 2017 17:11:19 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 152/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Use the proper LOCK_OFFSET for cond_resched() Date: Sun, 17 Jul 2011 22:51:33 +0200 RT does not increment preempt count when a 'sleeping' spinlock is locked. Update PREEMPT_LOCK_OFFSET for that case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 153/252 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: locking/rtmutex: Clean ->pi_blocked_on in the error case Date: Mon, 30 Sep 2019 18:15:44 +0200 The function rt_mutex_wait_proxy_lock() cleans ->pi_blocked_on in case of failure (timeout, signal). The same cleanup is required in __rt_mutex_start_proxy_lock(). In both the cases the tasks was interrupted by a signal or timeout while acquiring the lock and after the interruption it longer blocks on the lock. Fixes: 1a1fb985f2e2b ("futex: Handle early deadlock return correctly") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 154/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rtmutex: trylock is okay on -RT Date: Wed, 2 Dec 2015 11:34:07 +0100 non-RT kernel could deadlock on rt_mutex_trylock() in softirq context. On -RT we don't run softirqs in IRQ context but in thread context so it is not a issue here. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 155/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add mutex implementation based on rtmutex Date: Thu, 12 Oct 2017 17:17:03 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 156/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add rwsem implementation based on rtmutex Date: Thu, 12 Oct 2017 17:28:34 +0200 The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code pathes shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 157/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add rwlock implementation based on rtmutex Date: Thu, 12 Oct 2017 17:18:06 +0200 The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 158/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: wire up RT's locking Date: Thu, 12 Oct 2017 17:31:14 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 159/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rtmutex: add ww_mutex addon for mutex-rt Date: Thu, 12 Oct 2017 17:34:38 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 160/252 [ Author: Mikulas Patocka Email: mpatocka@redhat.com Subject: locking/rt-mutex: fix deadlock in device mapper / block-IO Date: Mon, 13 Nov 2017 12:56:53 -0500 When some block device driver creates a bio and submits it to another block device driver, the bio is added to current->bio_list (in order to avoid unbounded recursion). However, this queuing of bios can cause deadlocks, in order to avoid them, device mapper registers a function flush_current_bio_list. This function is called when device mapper driver blocks. It redirects bios queued on current->bio_list to helper workqueues, so that these bios can proceed even if the driver is blocked. The problem with CONFIG_PREEMPT_RT is that when the device mapper driver blocks, it won't call flush_current_bio_list (because tsk_is_pi_blocked returns true in sched_submit_work), so deadlocks in block device stack can happen. Note that we can't call blk_schedule_flush_plug if tsk_is_pi_blocked returns true - that would cause BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in task_blocks_on_rt_mutex when flush_current_bio_list attempts to take a spinlock. So the proper fix is to call blk_schedule_flush_plug in rt_mutex_fastlock, when fast acquire failed and when the task is about to block. CC: stable-rt@vger.kernel.org [bigeasy: The deadlock is not device-mapper specific, it can also occur in plain EXT4] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 161/252 [ Author: Scott Wood Email: swood@redhat.com Subject: locking/rt-mutex: Flush block plug on __down_read() Date: Fri, 4 Jan 2019 15:33:21 -0500 __down_read() bypasses the rtmutex frontend to call rt_mutex_slowlock_locked() directly, and thus it needs to call blk_schedule_flush_flug() itself. Cc: stable-rt@vger.kernel.org Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 162/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: re-init the wait_lock in rt_mutex_init_proxy_locked() Date: Thu, 16 Nov 2017 16:48:48 +0100 We could provide a key-class for the lockdep (and fixup all callers) or move the init to all callers (like it was) in order to avoid lockdep seeing a double-lock of the wait_lock. Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 163/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ptrace: fix ptrace vs tasklist_lock race Date: Thu, 29 Aug 2013 18:21:04 +0200 As explained by Alexander Fyodorov <halcy@yandex.ru>: \|read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, \|and it can remove __TASK_TRACED from task->state (by moving it to \|task->saved_state). If parent does wait() on child followed by a sys_ptrace \|call, the following race can happen: \| \|- child sets __TASK_TRACED in ptrace_stop() \|- parent does wait() which eventually calls wait_task_stopped() and returns \| child's pid \|- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves \| __TASK_TRACED flag to saved_state \|- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 164/252 [ Author: Scott Wood Email: swood@redhat.com Subject: sched: __set_cpus_allowed_ptr(): Check cpus_mask, not cpus_ptr Date: Sat, 27 Jul 2019 00:56:32 -0500 This function is concerned with the long-term cpu mask, not the transitory mask the task might have while migrate disabled. Before this patch, if a task was migrate disabled at the time __set_cpus_allowed_ptr() was called, and the new mask happened to be equal to the cpu that the task was running on, then the mask update would be lost. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 165/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched/core: add migrate_disable() Date: Sat, 27 May 2017 19:02:06 +0200 [bristot@redhat.com: rt: Increase/decrease the nr of migratory tasks when enabling/disabling migration Link: https://lkml.kernel.org/r/e981d271cbeca975bca710e2fbcc6078c09741b0.1498482127.git.bristot@redhat.com ] [swood@redhat.com: fixups and optimisations Link:https://lkml.kernel.org/r/20190727055638.20443-1-swood@redhat.com Link:https://lkml.kernel.org/r/20191012065214.28109-1-swood@redhat.com ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 166/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: sched/core: migrate_enable() must access takedown_cpu_task on !HOTPLUG_CPU Date: Fri, 29 Nov 2019 17:24:55 +0100 The variable takedown_cpu_task is never declared/used on !HOTPLUG_CPU except for migrate_enable(). This leads to a link error. Don't use takedown_cpu_task in !HOTPLUG_CPU. Reported-by: Dick Hollenbeck <dick@softplc.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 167/252 [ Author: Scott Wood Email: swood@redhat.com Subject: sched: migrate_enable: Use stop_one_cpu_nowait() Date: Sat, 12 Oct 2019 01:52:14 -0500 migrate_enable() can be called with current->state != TASK_RUNNING. Avoid clobbering the existing state by using stop_one_cpu_nowait(). Since we're stopping the current cpu, we know that we won't get past __schedule() until migration_cpu_stop() has run (at least up to the point of migrating us to another cpu). Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: spin until the request has been processed] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 168/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: trace: Add migrate-disabled counter to tracing output Date: Sun, 17 Jul 2011 21:56:42 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 169/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: futex: workaround migrate_disable/enable in different context Date: Wed, 8 Mar 2017 14:23:35 +0100 migrate_enable() invokes __schedule() and it expects a preempt count of one. Holding a raw_spinlock_t with disabled interrupts should not allow scheduling. These little hacks ensure that we don't schedule while we lock the hb lockwith interrupts enabled and unlock it with interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [XXX: As per PeterZ suggesstion set_thread_flag(TIF_NEED_RESCHED); preempt_fold_need_resched() would trigger a scheduler invocation on the last preempt_enable() which in turn would allow to drop this. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 170/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Date: Fri, 4 Aug 2017 17:40:42 +0200 Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 171/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: Make spinlock_t and rwlock_t a RCU section on RT Date: Tue, 19 Nov 2019 09:25:04 +0100 On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 172/252 [ Author: Scott Wood Email: swood@redhat.com Subject: rcu: Use rcuc threads on PREEMPT_RT as we did Date: Wed, 11 Sep 2019 17:57:28 +0100 While switching to the reworked RCU-thread code, it has been forgotten to enable the thread processing on -RT. Besides restoring behavior that used to be default on RT, this avoids a deadlock on scheduler locks. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 173/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: srcu: replace local_irqsave() with a locallock Date: Thu, 12 Oct 2017 18:37:12 +0200 There are two instances which disable interrupts in order to become a stable this_cpu_ptr() pointer. The restore part is coupled with spin_unlock_irqrestore() which does not work on RT. Replace the local_irq_save() call with the appropriate local_lock() version of it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 174/252 [ Author: Julia Cartwright Email: julia@ni.com Subject: rcu: enable rcu_normal_after_boot by default for RT Date: Wed, 12 Oct 2016 11:21:14 -0500 The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT. Suggested-by: Luiz Capitulino <lcapitulino@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 175/252 [ Author: Scott Wood Email: swood@redhat.com Subject: rcutorture: Avoid problematic critical section nesting on RT Date: Wed, 11 Sep 2019 17:57:29 +0100 rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 176/252 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: rt: Improve the serial console PASS_LIMIT Date: Wed, 14 Dec 2011 13:05:54 +0100 Beyond the warning: drivers/tty/serial/8250/8250.c:1613:6: warning: unused variable ‘pass_counter’ [-Wunused-variable] the solution of just looping infinitely was ugly - up it to 1 million to give it a chance to continue in some really ugly situation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 177/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs/epoll: Do not disable preemption on RT Date: Fri, 8 Jul 2011 16:35:35 +0200 ep_call_nested() takes a sleeping lock so we can't disable preemption. The light version is enough since ep_call_nested() doesn't mind beeing invoked twice on the same CPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 178/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/vmalloc: Another preempt disable region which sucks Date: Tue, 12 Jul 2011 11:39:36 +0200 Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 179/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: do not invoke preempt_disable() Date: Tue, 14 Jul 2015 14:26:34 +0200 preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 180/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: don't complete requests via IPI Date: Thu, 29 Jan 2015 15:10:08 +0100 The IPI runs in hardirq context and there are sleeping locks. Assume caches are shared and complete them on the local CPU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 181/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: md: raid5: Make raid5_percpu handling RT aware Date: Tue, 6 Apr 2010 16:51:31 +0200 __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel <udovdh@xs4all.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Udo van den Heuvel <udovdh@xs4all.nl> ] 182/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: scsi/fcoe: Make RT aware. Date: Sat, 12 Nov 2011 14:00:48 +0100 Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 183/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() Date: Wed, 18 Feb 2015 16:05:28 +0100 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 \|in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd \|Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \|CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 \|Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 \| ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 \| 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 \| ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 \|Call Trace: \| [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e \| [<ffffffff81073c86>] __might_sleep+0xe6/0x150 \| [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50 \| [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] \| [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] \| [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] \| [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc] \| [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd] \| [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd] \| [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0 \| [<ffffffff8117f889>] SyS_write+0x49/0xb0 \| [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 184/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Introduce cpu_chill() Date: Wed, 7 Mar 2012 20:51:03 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): \| \|Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken \|up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is \|called from softirq context, it may block the ksoftirqd() from running, in \|which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() \|If a task calls cpu_chill() and gets woken up by a regular or spurious \|wakeup and has a signal pending, then it exits the sleep loop in \|do_nanosleep() and sets up the restart block. If restart->nanosleep.type is \|not TI_NONE then this results in accessing a stale user pointer from a \|previously interrupted syscall and a copy to user based on the stale \|pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: \| [....] Waiting for /dev to be fully populated... \| ===================================== \| [ BUG: udevd/229 still has locks held! ] \| 3.12.11-rt17 #23 Not tainted \| ------------------------------------- \| 1 lock held by udevd/229: \| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 \| \| stack backtrace: \| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 \| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) \| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) \| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) \| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) \| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) \| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) \| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) \| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) \| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) \| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) \| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) \| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) \| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) \| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) \| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) \| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) \| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 185/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: block: Use cpu_chill() for retry loops Date: Thu, 20 Dec 2012 18:28:26 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. [bigeasy: After all those changes that occured over the years, this one hunk is left and should not cause any starvation on -RT anymore] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 186/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs: namespace: Use cpu_chill() in trylock loops Date: Wed, 7 Mar 2012 21:00:34 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 187/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use cpu_chill() instead of cpu_relax() Date: Wed, 7 Mar 2012 21:10:04 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 188/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: debugobjects: Make RT aware Date: Sun, 17 Jul 2011 21:41:35 +0200 Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 189/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use skbufhead with raw lock Date: Tue, 12 Jul 2011 15:38:34 +0200 Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 190/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: dev: always take qdisc's busylock in __dev_xmit_skb() Date: Wed, 30 Mar 2016 13:36:29 +0200 The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 191/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqwork: push most work into softirq context Date: Tue, 23 Jun 2015 15:32:51 +0200 Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 192/252 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: x86: crypto: Reduce preempt disabled regions Date: Mon, 14 Nov 2011 18:19:27 +0100 Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 193/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: Reduce preempt disabled regions, more algos Date: Fri, 21 Feb 2014 17:24:04 +0100 Don Estabrook reported \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() \| kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() \| kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() / glue_fpu_begin(); / get atomic / while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); / with nbytes = 0, migrate_enable() * while we are atomic / }; glue_fpu_end() / no longer atomic / } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook <don.estabrook@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 194/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: limit more FPU-enabled sections Date: Thu, 30 Nov 2017 13:40:10 +0100 Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new _fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 195/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: cryptd - add a lock instead preempt_disable/local_bh_disable Date: Thu, 26 Jul 2018 18:52:00 +0200 cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 196/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: panic: skip get_random_bytes for RT_FULL in init_oops_id Date: Tue, 14 Jul 2015 14:26:34 +0200 Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 197/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: stackprotector: Avoid random pool on rt Date: Thu, 16 Dec 2010 14:25:18 +0100 CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 198/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: random: Make it work on rt Date: Tue, 21 Aug 2012 20:38:50 +0200 Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 199/252 [ Author: Priyanka Jain Email: Priyanka.Jain@freescale.com Subject: net: Remove preemption disabling in netif_rx() Date: Thu, 17 May 2012 09:35:11 +0530 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 200/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: lockdep: Make it RT aware Date: Sun, 17 Jul 2011 18:51:23 +0200 teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 201/252 [ Author: Yong Zhang Email: yong.zhang@windriver.com Subject: lockdep: selftest: Only do hardirq context test for raw spinlock Date: Mon, 16 Apr 2012 15:01:56 +0800 On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Yong Zhang <yong.zhang@windriver.com> Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 202/252 [ Author: Josh Cartwright Email: josh.cartwright@ni.com Subject: lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals Date: Wed, 28 Jan 2015 13:08:45 -0600 "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright <josh.cartwright@ni.com> Signed-off-by: Xander Huff <xander.huff@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 203/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: disable self-test Date: Tue, 17 Oct 2017 16:36:18 +0200 The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 204/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,radeon,i915: Use preempt_disable/enable_rt() where recommended Date: Sat, 27 Feb 2016 08:09:11 +0100 DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 205/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,i915: Use local_lock/unlock_irq() in intel_pipe_update_start/end() Date: Sat, 27 Feb 2016 09:01:42 +0100 [ 8.014039] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:918 [ 8.014041] in_atomic(): 0, irqs_disabled(): 1, pid: 78, name: kworker/u4:4 [ 8.014045] CPU: 1 PID: 78 Comm: kworker/u4:4 Not tainted 4.1.7-rt7 #5 [ 8.014055] Workqueue: events_unbound async_run_entry_fn [ 8.014059] 0000000000000000 ffff880037153748 ffffffff815f32c9 0000000000000002 [ 8.014063] ffff88013a50e380 ffff880037153768 ffffffff815ef075 ffff8800372c06c8 [ 8.014066] ffff8800372c06c8 ffff880037153778 ffffffff8107c0b3 ffff880037153798 [ 8.014067] Call Trace: [ 8.014074] [<ffffffff815f32c9>] dump_stack+0x4a/0x61 [ 8.014078] [<ffffffff815ef075>] ___might_sleep.part.93+0xe9/0xee [ 8.014082] [<ffffffff8107c0b3>] ___might_sleep+0x53/0x80 [ 8.014086] [<ffffffff815f9064>] rt_spin_lock+0x24/0x50 [ 8.014090] [<ffffffff8109368b>] prepare_to_wait+0x2b/0xa0 [ 8.014152] [<ffffffffa016c04c>] intel_pipe_update_start+0x17c/0x300 [i915] [ 8.014156] [<ffffffff81093b40>] ? prepare_to_wait_event+0x120/0x120 [ 8.014201] [<ffffffffa0158f36>] intel_begin_crtc_commit+0x166/0x1e0 [i915] [ 8.014215] [<ffffffffa00c806d>] drm_atomic_helper_commit_planes+0x5d/0x1a0 [drm_kms_helper] [ 8.014260] [<ffffffffa0171e9b>] intel_atomic_commit+0xab/0xf0 [i915] [ 8.014288] [<ffffffffa00654c7>] drm_atomic_commit+0x37/0x60 [drm] [ 8.014298] [<ffffffffa00c6fcd>] drm_atomic_helper_plane_set_property+0x8d/0xd0 [drm_kms_helper] [ 8.014301] [<ffffffff815f77d9>] ? __ww_mutex_lock+0x39/0x40 [ 8.014319] [<ffffffffa0053b3d>] drm_mode_plane_set_obj_prop+0x2d/0x90 [drm] [ 8.014328] [<ffffffffa00c8edb>] restore_fbdev_mode+0x6b/0xf0 [drm_kms_helper] [ 8.014337] [<ffffffffa00cae49>] drm_fb_helper_restore_fbdev_mode_unlocked+0x29/0x80 [drm_kms_helper] [ 8.014346] [<ffffffffa00caec2>] drm_fb_helper_set_par+0x22/0x50 [drm_kms_helper] [ 8.014390] [<ffffffffa016dfba>] intel_fbdev_set_par+0x1a/0x60 [i915] [ 8.014394] [<ffffffff81327dc4>] fbcon_init+0x4f4/0x580 [ 8.014398] [<ffffffff8139ef4c>] visual_init+0xbc/0x120 [ 8.014401] [<ffffffff813a1623>] do_bind_con_driver+0x163/0x330 [ 8.014405] [<ffffffff813a1b2c>] do_take_over_console+0x11c/0x1c0 [ 8.014408] [<ffffffff813236e3>] do_fbcon_takeover+0x63/0xd0 [ 8.014410] [<ffffffff81328965>] fbcon_event_notify+0x785/0x8d0 [ 8.014413] [<ffffffff8107c12d>] ? __might_sleep+0x4d/0x90 [ 8.014416] [<ffffffff810775fe>] notifier_call_chain+0x4e/0x80 [ 8.014419] [<ffffffff810779cd>] __blocking_notifier_call_chain+0x4d/0x70 [ 8.014422] [<ffffffff81077a06>] blocking_notifier_call_chain+0x16/0x20 [ 8.014425] [<ffffffff8132b48b>] fb_notifier_call_chain+0x1b/0x20 [ 8.014428] [<ffffffff8132d8fa>] register_framebuffer+0x21a/0x350 [ 8.014439] [<ffffffffa00cb164>] drm_fb_helper_initial_config+0x274/0x3e0 [drm_kms_helper] [ 8.014483] [<ffffffffa016f1cb>] intel_fbdev_initial_config+0x1b/0x20 [i915] [ 8.014486] [<ffffffff8107912c>] async_run_entry_fn+0x4c/0x160 [ 8.014490] [<ffffffff81070ffa>] process_one_work+0x14a/0x470 [ 8.014493] [<ffffffff81071489>] worker_thread+0x169/0x4c0 [ 8.014496] [<ffffffff81071320>] ? process_one_work+0x470/0x470 [ 8.014499] [<ffffffff81076606>] kthread+0xc6/0xe0 [ 8.014502] [<ffffffff81070000>] ? queue_work_on+0x80/0x110 [ 8.014506] [<ffffffff81076540>] ? kthread_worker_fn+0x1c0/0x1c0 Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 206/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: disable tracing on -RT Date: Thu, 6 Dec 2018 09:52:20 +0100 Luca Abeni reported this: \| BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 \| CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 \| Call Trace: \| rt_spin_lock+0x3f/0x50 \| gen6_read32+0x45/0x1d0 [i915] \| g4x_get_vblank_counter+0x36/0x40 [i915] \| trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni <lucabe72@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 207/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE Date: Wed, 19 Dec 2018 10:47:02 +0100 The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 208/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: Don't disable interrupts for intel_engine_breadcrumbs_irq() Date: Thu, 26 Sep 2019 12:29:05 +0200 The function intel_engine_breadcrumbs_irq() is always invoked from an interrupt handler and for that reason it invokes (as an optimisation) only spin_lock() for locking assuming that the interrupts are already disabled. The function intel_engine_signal_breadcrumbs() is provided to disable interrupts while the former function is invoked so that assumption is also true for callers from preemptible context. On PREEMPT_RT local_irq_disable() really disables interrupts and this forbids to invoke spin_lock() which becomes a sleeping spinlock. This is also problematic with `threadirqs' in conjunction with irq_work. With force threading the interrupt handler, the handler is invoked with disabled BH but with interrupts enabled. This is okay and the lock itself is never acquired in IRQ context. This changes with irq_work (signal_irq_work()) which _still_ invokes intel_engine_breadcrumbs_irq() from IRQ context. Lockdep should see this and complain. Acquire the locks in intel_engine_breadcrumbs_irq() with _irqsave() suffix and let all callers invoke intel_engine_breadcrumbs_irq() directly instead using intel_engine_signal_breadcrumbs(). Reported-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 209/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: drm/i915: Drop the IRQ-off asserts Date: Thu, 26 Sep 2019 12:30:21 +0200 The lockdep_assert_irqs_disabled() check is needless. The previous lockdep_assert_held() check ensures that the lock is acquired and while the lock is acquired lockdep also prints a warning if the interrupts are not disabled if they have to be. These IRQ-off asserts trigger on PREEMPT_RT because the locks become sleeping locks and do not really disable interrupts. Remove lockdep_assert_irqs_disabled(). Reported-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 210/252 [ Author: Mike Galbraith Email: efault@gmx.de Subject: cpuset: Convert callback_lock to raw_spinlock_t Date: Sun, 8 Jan 2017 09:32:25 +0100 The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 <disables IRQs> ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 211/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: apparmor: use a locallock instead preempt_disable() Date: Wed, 11 Oct 2017 17:43:49 +0200 get_buffers() disables preemption which acts as a lock for the per-CPU variable. Since we can't disable preemption here on RT, a local_lock is lock is used in order to remain on the same CPU and not to have more than one user within the critical section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 212/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Allow to enable RT Date: Wed, 7 Aug 2019 18:15:38 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 213/252 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: mm, rt: kmap_atomic scheduling Date: Thu, 28 Jul 2011 10:43:51 +0200 In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ] ] 214/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/highmem: Add a "already used pte" check Date: Mon, 11 Mar 2013 17:09:55 +0100 This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 215/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm/highmem: Flush tlb on unmap Date: Mon, 11 Mar 2013 21:37:27 +0100 The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 216/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Enable highmem for rt Date: Wed, 13 Feb 2013 11:03:11 +0100 fixup highmem for ARM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 217/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/scatterlist: Do not disable irqs on RT Date: Fri, 3 Jul 2009 08:44:34 -0500 For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 218/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add support for lazy preemption Date: Fri, 26 Oct 2012 18:50:54 +0100 It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 219/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: Support for lazy preemption Date: Thu, 1 Nov 2012 11:03:47 +0100 Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 220/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Add support for lazy preemption Date: Wed, 31 Oct 2012 12:04:11 +0100 Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 221/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Add support for lazy preemption Date: Thu, 1 Nov 2012 10:14:11 +0100 Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 222/252 [ Author: Anders Roxell Email: anders.roxell@linaro.org Subject: arch/arm64: Add lazy preempt support Date: Thu, 14 May 2015 17:52:17 +0200 arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> ] 223/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jump-label: disable if stop_machine() is used Date: Wed, 8 Jul 2015 17:14:48 +0200 Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 224/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: leds: trigger: disable CPU trigger on -RT Date: Thu, 23 Jan 2014 14:45:59 +0100 as it triggers: \|CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 \|[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) \|[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) \|[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) \|[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) \|[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) \|[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) \|[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) \|[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) \|[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) \|[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) \|[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 225/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/omap: Make the locking RT aware Date: Thu, 28 Jul 2011 13:32:57 +0200 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 226/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/pl011: Make the locking work on RT Date: Tue, 8 Jan 2013 21:36:51 +0100 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 227/252 [ Author: Kurt Kanzenbach Email: kurt@linutronix.de Subject: tty: serial: pl011: explicitly initialize the flags variable Date: Mon, 24 Sep 2018 10:29:01 +0200 Silence the following gcc warning: drivers/tty/serial/amba-pl011.c: In function ‘pl011_console_write’: ./include/linux/spinlock.h:260:3: warning: ‘flags’ may be used uninitialized in this function [-Wmaybe-uninitialized] _raw_spin_unlock_irqrestore(lock, flags); \ ^~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/tty/serial/amba-pl011.c:2214:16: note: ‘flags’ was declared here unsigned long flags; ^~~~~ The code is correct. Thus, initializing flags to zero doesn't change the behavior and resolves the warning. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 228/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm: include definition for cpumask_t Date: Thu, 22 Dec 2016 17:28:33 +0100 This definition gets pulled in by other files. With the (later) split of RCU and spinlock.h it won't compile anymore. The split is done in ("rbtree: don't include the rcu header"). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 229/252 [ Author: Yadi.hu Email: yadi.hu@windriver.com Subject: ARM: enable irq in translation/section permission fault handlers Date: Wed, 10 Dec 2014 10:32:09 +0800 Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { ((char)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [<c041872c>] (unwind_backtrace+0x0/0x104) [ 512.745001] [<c09af0c4>] (dump_stack+0x20/0x24) [ 512.745355] [<c0462490>] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [<c09b6770>] (rt_spin_lock+0x34/0x6c) [ 512.746073] [<c0441bf0>] (do_force_sig_info+0x34/0xf0) [ 512.746457] [<c0442668>] (force_sig_info+0x18/0x1c) [ 512.746829] [<c041d880>] (__do_user_fault+0x9c/0xd8) [ 512.747185] [<c041d938>] (do_bad_area+0x7c/0x94) [ 512.747536] [<c041d990>] (do_sect_fault+0x40/0x48) [ 512.747898] [<c040841c>] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu <yadi.hu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 230/252 [ Author: Josh Cartwright Email: joshc@ni.com Subject: genirq: update irq_set_irqchip_state documentation Date: Thu, 11 Feb 2016 11:54:00 -0600 On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 231/252 [ Author: Josh Cartwright Email: joshc@ni.com Subject: KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() Date: Thu, 11 Feb 2016 11:54:01 -0600 kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall <christoffer.dall@linaro.org> Reported-by: Manish Jaggi <Manish.Jaggi@caviumnetworks.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 232/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() Date: Wed, 25 Jul 2018 14:02:38 +0200 fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 233/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm: at91: do not disable/enable clocks in a row Date: Wed, 9 Mar 2016 10:51:06 +0100 Currently the driver will disable the clock and enable it one line later if it is switching from periodic mode into one shot. This can be avoided and causes a needless warning on -RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 234/252 [ Author: Benedikt Spranger Email: b.spranger@linutronix.de Subject: clocksource: TCLIB: Allow higher clock rates for clock events Date: Mon, 8 Mar 2010 18:57:04 +0100 As default the TCLIB uses the 32KiHz base clock rate for clock events. Add a compile time selection to allow higher clock resulution. (fixed up by Sami Pietikäinen <Sami.Pietikainen@wapice.com>) Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 235/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86: Enable RT also on 32bit Date: Thu, 7 Nov 2019 17:49:20 +0100 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 236/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:29 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 237/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ARM64: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:35 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 238/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/pseries/iommu: Use a locallock instead local_irq_save() Date: Tue, 26 Mar 2019 18:31:54 +0100 The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 239/252 [ Author: Bogdan Purcareata Email: bogdan.purcareata@freescale.com Subject: powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT Date: Fri, 24 Apr 2015 15:53:13 +0000 While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 240/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Disable highmem on RT Date: Mon, 18 Jul 2011 17:08:34 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 241/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: powerpc/stackprotector: work around stack-guard init from atomic Date: Tue, 26 Mar 2019 18:31:29 +0100 This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 242/252 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: POWERPC: Allow to enable RT Date: Fri, 11 Oct 2019 13:14:41 +0200 Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 243/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mips: Disable highmem on RT Date: Mon, 18 Jul 2011 17:10:12 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 244/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: connector/cn_proc: Protect send_msg() with a local lock on RT Date: Sun, 16 Oct 2016 05:11:54 +0200 \|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 \|in_atomic(): 1, irqs_disabled(): 0, pid: 31807, name: sleep \|Preemption disabled at:[<ffffffff8148019b>] proc_exit_connector+0xbb/0x140 \| \|CPU: 4 PID: 31807 Comm: sleep Tainted: G W E 4.8.0-rt11-rt #106 \|Call Trace: \| [<ffffffff813436cd>] dump_stack+0x65/0x88 \| [<ffffffff8109c425>] ___might_sleep+0xf5/0x180 \| [<ffffffff816406b0>] __rt_spin_lock+0x20/0x50 \| [<ffffffff81640978>] rt_read_lock+0x28/0x30 \| [<ffffffff8156e209>] netlink_broadcast_filtered+0x49/0x3f0 \| [<ffffffff81522621>] ? __kmalloc_reserve.isra.33+0x31/0x90 \| [<ffffffff8156e5cd>] netlink_broadcast+0x1d/0x20 \| [<ffffffff8147f57a>] cn_netlink_send_mult+0x19a/0x1f0 \| [<ffffffff8147f5eb>] cn_netlink_send+0x1b/0x20 \| [<ffffffff814801d8>] proc_exit_connector+0xf8/0x140 \| [<ffffffff81077f71>] do_exit+0x5d1/0xba0 \| [<ffffffff810785cc>] do_group_exit+0x4c/0xc0 \| [<ffffffff81078654>] SyS_exit_group+0x14/0x20 \| [<ffffffff81640a72>] entry_SYSCALL_64_fastpath+0x1a/0xa4 Since ab8ed951080e ("connector: fix out-of-order cn_proc netlink message delivery") which is v4.7-rc6. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 245/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/block/zram: Replace bit spinlocks with rtmutex for -rt Date: Thu, 31 Mar 2016 04:08:28 +0200 They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 246/252 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/zram: Don't disable preemption in zcomp_stream_get/put() Date: Thu, 20 Oct 2016 11:15:22 +0200 In v4.7, the driver switched to percpu compression streams, disabling preemption via get/put_cpu_ptr(). Use a per-zcomp_strm lock here. We also have to fix an lock order issue in zram_decompress_page() such that zs_map_object() nests inside of zcomp_stream_put() as it does in zram_bvec_write(). Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: get_locked_var() -> per zcomp_strm lock] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 247/252 [ Author: Julia Cartwright Email: julia@ni.com Subject: squashfs: make use of local lock in multi_cpu decompressor Date: Mon, 7 May 2018 08:58:57 -0500 Currently, the squashfs multi_cpu decompressor makes use of get_cpu_ptr()/put_cpu_ptr(), which unconditionally disable preemption during decompression. Because the workload is distributed across CPUs, all CPUs can observe a very high wakeup latency, which has been seen to be as much as 8000us. Convert this decompressor to make use of a local lock, which will allow execution of the decompressor with preemption-enabled, but also ensure concurrent accesses to the percpu compressor data on the local CPU will be serialized. Cc: stable-rt@vger.kernel.org Reported-by: Alexander Stein <alexander.stein@systec-electronic.com> Tested-by: Alexander Stein <alexander.stein@systec-electronic.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 248/252 [ Author: Haris Okanovic Email: haris.okanovic@ni.com Subject: tpm_tis: fix stall after iowrite()s Date: Tue, 15 Aug 2017 15:13:08 -0500 ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic <haris.okanovic@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 249/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signals: Allow rt tasks to cache one sigqueue struct Date: Fri, 3 Jul 2009 08:44:56 -0500 To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 250/252 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: genirq: Disable irqpoll on -rt Date: Fri, 3 Jul 2009 08:29:57 -0500 Creates long latencies for no value Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 251/252 [ Author: Clark Williams Email: williams@redhat.com Subject: sysfs: Add /sys/kernel/realtime entry Date: Sat, 30 Jul 2011 21:55:53 -0500 Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ] 252/252 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: Add localversion for -RT release Date: Fri, 8 Jul 2011 20:25:16 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>