summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2021-03-20bpf: Add sanity check for upper ptr_limitPiotr Krysiuk
commit 1b1597e64e1a610c7a96710fc4717158e98a08b3 upstream. Given we know the max possible value of ptr_limit at the time of retrieving the latter, add basic assertions, so that the verifier can bail out if anything looks odd and reject the program. Nothing triggered this so far, but it also does not hurt to have these. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-20bpf: Simplify alu_limit masking for pointer arithmeticPiotr Krysiuk
commit b5871dca250cd391885218b99cc015aca1a51aea upstream. Instead of having the mov32 with aux->alu_limit - 1 immediate, move this operation to retrieve_ptr_limit() instead to simplify the logic and to allow for subsequent sanity boundary checks inside retrieve_ptr_limit(). This avoids in future that at the time of the verifier masking rewrite we'd run into an underflow which would not sign extend due to the nature of mov32 instruction. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-20bpf: Fix off-by-one for area size in creating mask to leftPiotr Krysiuk
commit 10d2bb2e6b1d8c4576c56a748f697dbeb8388899 upstream. retrieve_ptr_limit() computes the ptr_limit for registers with stack and map_value type. ptr_limit is the size of the memory area that is still valid / in-bounds from the point of the current position and direction of the operation (add / sub). This size will later be used for masking the operation such that attempting out-of-bounds access in the speculative domain is redirected to remain within the bounds of the current map value. When masking to the right the size is correct, however, when masking to the left, the size is off-by-one which would lead to an incorrect mask and thus incorrect arithmetic operation in the non-speculative domain. Piotr found that if the resulting alu_limit value is zero, then the BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading 0xffffffff into AX instead of sign-extending to the full 64 bit range, and as a result, this allows abuse for executing speculatively out-of- bounds loads against 4GB window of address space and thus extracting the contents of kernel memory via side-channel. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-20bpf: Prohibit alu ops for pointer types not defining ptr_limitPiotr Krysiuk
commit f232326f6966cf2a1d1db7bc917a4ce5f9f55f76 upstream. The purpose of this patch is to streamline error propagation and in particular to propagate retrieve_ptr_limit() errors for pointer types that are not defining a ptr_limit such that register-based alu ops against these types can be rejected. The main rationale is that a gap has been identified by Piotr in the existing protection against speculatively out-of-bounds loads, for example, in case of ctx pointers, unprivileged programs can still perform pointer arithmetic. This can be abused to execute speculatively out-of-bounds loads without restrictions and thus extract contents of kernel memory. Fix this by rejecting unprivileged programs that attempt any pointer arithmetic on unprotected pointer types. The two affected ones are pointer to ctx as well as pointer to map. Field access to a modified ctx' pointer is rejected at a later point in time in the verifier, and 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") only relevant for root-only use cases. Risk of unprivileged program breakage is considered very low. Fixes: 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()Anna-Maria Behnsen
[ Upstream commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4 ] hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers") Reported-by: Mikael Beckius <mikael.beckius@windriver.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Beckius <mikael.beckius@windriver.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210223160240.27518-1-anna-maria@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04printk: fix deadlock when kernel panicMuchun Song
commit 8a8109f303e25a27f92c1d8edd67d7cbbc60a4eb upstream. printk_safe_flush_on_panic() caused the following deadlock on our server: CPU0: CPU1: panic rcu_dump_cpu_stacks kdump_nmi_shootdown_cpus nmi_trigger_cpumask_backtrace register_nmi_handler(crash_nmi_callback) printk_safe_flush __printk_safe_flush raw_spin_lock_irqsave(&read_lock) // send NMI to other processors apic_send_IPI_allbutself(NMI_VECTOR) // NMI interrupt, dead loop crash_nmi_callback printk_safe_flush_on_panic printk_safe_flush __printk_safe_flush // deadlock raw_spin_lock_irqsave(&read_lock) DEADLOCK: read_lock is taken on CPU1 and will never get released. It happens when panic() stops a CPU by NMI while it has been in the middle of printk_safe_flush(). Handle the lock the same way as logbuf_lock. The printk_safe buffers are flushed only when both locks can be safely taken. It can avoid the deadlock _in this particular case_ at expense of losing contents of printk_safe buffers. Note: It would actually be safe to re-init the locks when all CPUs were stopped by NMI. But it would require passing this information from arch-specific code. It is not worth the complexity. Especially because logbuf_lock and printk_safe buffers have been obsoleted by the lockless ring buffer. Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: <stable@vger.kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210210034823.64867-1-songmuchun@bytedance.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbolsFangrui Song
commit ebfac7b778fac8b0e8e92ec91d0b055f046b4604 upstream. clang-12 -fno-pic (since https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de0083333232da3f1d6) can emit `call __stack_chk_fail@PLT` instead of `call __stack_chk_fail` on x86. The two forms should have identical behaviors on x86-64 but the former causes GNU as<2.37 to produce an unreferenced undefined symbol _GLOBAL_OFFSET_TABLE_. (On x86-32, there is an R_386_PC32 vs R_386_PLT32 difference but the linker behavior is identical as far as Linux kernel is concerned.) Simply ignore _GLOBAL_OFFSET_TABLE_ for now, like what scripts/mod/modpost.c:ignore_undef_symbol does. This also fixes the problem for gcc/clang -fpie and -fpic, which may emit `call foo@PLT` for external function calls on x86. Note: ld -z defs and dynamic loaders do not error for unreferenced undefined symbols so the module loader is reading too much. If we ever need to ignore more symbols, the code should be refactored to ignore unreferenced symbols. Cc: <stable@vger.kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1250 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=27178 Reported-by: Marco Elver <elver@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Marco Elver <elver@google.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04seccomp: Add missing return in non-void functionPaul Cercueil
commit 04b38d012556199ba4c31195940160e0c44c64f0 upstream. We don't actually care about the value, since the kernel will panic before that; but a value should nonetheless be returned, otherwise the compiler will complain. Fixes: 8112c4f140fa ("seccomp: remove 2-phase API") Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210111172839.640914-1-paul@crapouillou.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04tracepoint: Do not fail unregistering a probe due to memory failureSteven Rostedt (VMware)
[ Upstream commit befe6d946551d65cddbd32b9cb0170b0249fd5ed ] The list of tracepoint callbacks is managed by an array that is protected by RCU. To update this array, a new array is allocated, the updates are copied over to the new array, and then the list of functions for the tracepoint is switched over to the new array. After a completion of an RCU grace period, the old array is freed. This process happens for both adding a callback as well as removing one. But on removing a callback, if the new array fails to be allocated, the callback is not removed, and may be used after it is freed by the clients of the tracepoint. There's really no reason to fail if the allocation for a new array fails when removing a function. Instead, the function can simply be replaced by a stub function that could be cleaned up on the next modification of the array. That is, instead of calling the function registered to the tracepoint, it would call a stub function in its place. Link: https://lore.kernel.org/r/20201115055256.65625-1-mmullins@mmlx.us Link: https://lore.kernel.org/r/20201116175107.02db396d@gandalf.local.home Link: https://lore.kernel.org/r/20201117211836.54acaef2@oasis.local.home Link: https://lkml.kernel.org/r/20201118093405.7a6d2290@gandalf.local.home [ Note, this version does use undefined compiler behavior (assuming that a stub function with no parameters or return, can be called by a location that thinks it has parameters but still no return value. Static calls do the same thing, so this trick is not without precedent. There's another solution that uses RCU tricks and is more complex, but can be an alternative if this solution becomes an issue. Link: https://lore.kernel.org/lkml/20210127170721.58bce7cc@gandalf.local.home/ ] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: netdev <netdev@vger.kernel.org> Cc: bpf <bpf@vger.kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Florian Weimer <fw@deneb.enyo.de> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com Reported-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04ima: Free IMA measurement buffer after kexec syscallLakshmi Ramasubramanian
[ Upstream commit f31e3386a4e92ba6eda7328cb508462956c94c64 ] IMA allocates kernel virtual memory to carry forward the measurement list, from the current kernel to the next kernel on kexec system call, in ima_add_kexec_buffer() function. This buffer is not freed before completing the kexec system call resulting in memory leak. Add ima_buffer field in "struct kimage" to store the virtual address of the buffer allocated for the IMA measurement list. Free the memory allocated for the IMA measurement list in kimage_file_post_load_cleanup() function. Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com> Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list") Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04bpf_lru_list: Read double-checked variable once without lockMarco Elver
[ Upstream commit 6df8fb83301d68ea0a0c0e1cbcc790fcc333ed12 ] For double-checked locking in bpf_common_lru_push_free(), node->type is read outside the critical section and then re-checked under the lock. However, concurrent writes to node->type result in data races. For example, the following concurrent access was observed by KCSAN: write to 0xffff88801521bc22 of 1 bytes by task 10038 on cpu 1: __bpf_lru_node_move_in kernel/bpf/bpf_lru_list.c:91 __local_list_flush kernel/bpf/bpf_lru_list.c:298 ... read to 0xffff88801521bc22 of 1 bytes by task 10043 on cpu 0: bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:507 bpf_lru_push_free kernel/bpf/bpf_lru_list.c:555 ... Fix the data races where node->type is read outside the critical section (for double-checked locking) by marking the access with READ_ONCE() as well as ensuring the variable is only accessed once. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: syzbot+3536db46dfa58c573458@syzkaller.appspotmail.com Reported-by: syzbot+516acdb03d3e27d91bcd@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210209112701.3341724-1-elver@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04kdb: Make memory allocations more robustSumit Garg
commit 93f7a6d818deef69d0ba652d46bae6fbabbf365c upstream. Currently kdb uses in_interrupt() to determine whether its library code has been called from the kgdb trap handler or from a saner calling context such as driver init. This approach is broken because in_interrupt() alone isn't able to determine kgdb trap handler entry from normal task context. This can happen during normal use of basic features such as breakpoints and can also be trivially reproduced using: echo g > /proc/sysrq-trigger We can improve this by adding check for in_dbg_master() instead which explicitly determines if we are running in debugger context. Cc: stable@vger.kernel.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/1611313556-4004-1-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04locking/static_key: Fix false positive warnings on concurrent dec/incPeter Zijlstra
commit a1247d06d01045d7ab2882a9c074fbf21137c690 upstream. Even though the atomic_dec_and_mutex_lock() in __static_key_slow_dec_cpuslocked() can never see a negative value in key->enabled the subsequent sanity check is re-reading key->enabled, which may have been set to -1 in the meantime by static_key_slow_inc_cpuslocked(). CPU A CPU B __static_key_slow_dec_cpuslocked(): static_key_slow_inc_cpuslocked(): # enabled = 1 atomic_dec_and_mutex_lock() # enabled = 0 atomic_read() == 0 atomic_set(-1) # enabled = -1 val = atomic_read() # Oops - val == -1! The test case is TCP's clean_acked_data_enable() / clean_acked_data_disable() as tickled by KTLS (net/ktls). Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com> Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: ard.biesheuvel@linaro.org Cc: oss-drivers@netronome.com Cc: pbonzini@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Will McVicker <willmcvicker@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operationsPeter Zijlstra
commit cb538267ea1e9e025ec692577c9ae75797261889 upstream. Weirdly we seem to have forgotten this... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Will McVicker <willmcvicker@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23bpf: Check for integer overflow when using roundup_pow_of_two()Bui Quang Minh
[ Upstream commit 6183f4d3a0a2ad230511987c6c362ca43ec0055f ] On 32-bit architecture, roundup_pow_of_two() can return 0 when the argument has upper most bit set due to resulting 1UL << 32. Add a check for this case. Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127063653.3576-1-minhquangbui99@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-23tracing: Check length before giving out the filter bufferSteven Rostedt (VMware)
commit b220c049d5196dd94d992dd2dc8cba1a5e6123bf upstream. When filters are used by trace events, a page is allocated on each CPU and used to copy the trace event fields to this page before writing to the ring buffer. The reason to use the filter and not write directly into the ring buffer is because a filter may discard the event and there's more overhead on discarding from the ring buffer than the extra copy. The problem here is that there is no check against the size being allocated when using this page. If an event asks for more than a page size while being filtered, it will get only a page, leading to the caller writing more that what was allocated. Check the length of the request, and if it is more than PAGE_SIZE minus the header default back to allocating from the ring buffer directly. The ring buffer may reject the event if its too big anyway, but it wont overflow. Link: https://lore.kernel.org/ath10k/1612839593-2308-1-git-send-email-wgong@codeaurora.org/ Cc: stable@vger.kernel.org Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events") Reported-by: Wen Gong <wgong@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23tracing: Do not count ftrace events in top level enable outputSteven Rostedt (VMware)
commit 256cfdd6fdf70c6fcf0f7c8ddb0ebd73ce8f3bc9 upstream. The file /sys/kernel/tracing/events/enable is used to enable all events by echoing in "1", or disabling all events when echoing in "0". To know if all events are enabled, disabled, or some are enabled but not all of them, cating the file should show either "1" (all enabled), "0" (all disabled), or "X" (some enabled but not all of them). This works the same as the "enable" files in the individule system directories (like tracing/events/sched/enable). But when all events are enabled, the top level "enable" file shows "X". The reason is that its checking the "ftrace" events, which are special events that only exist for their format files. These include the format for the function tracer events, that are enabled when the function tracer is enabled, but not by the "enable" file. The check includes these events, which will always be disabled, and even though all true events are enabled, the top level "enable" file will show "X" instead of "1". To fix this, have the check test the event's flags to see if it has the "IGNORE_ENABLE" flag set, and if so, not test it. Cc: stable@vger.kernel.org Fixes: 553552ce1796c ("tracing: Combine event filter_active and enable into single flags field") Reported-by: "Yordan Karadzhov (VMware)" <y.karadz@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-13fgraph: Initialize tracing_graph_pause at task creationSteven Rostedt (VMware)
commit 7e0a9220467dbcfdc5bc62825724f3e52e50ab31 upstream. On some archs, the idle task can call into cpu_suspend(). The cpu_suspend() will disable or pause function graph tracing, as there's some paths in bringing down the CPU that can have issues with its return address being modified. The task_struct structure has a "tracing_graph_pause" atomic counter, that when set to something other than zero, the function graph tracer will not modify the return address. The problem is that the tracing_graph_pause counter is initialized when the function graph tracer is enabled. This can corrupt the counter for the idle task if it is suspended in these architectures. CPU 1 CPU 2 ----- ----- do_idle() cpu_suspend() pause_graph_tracing() task_struct->tracing_graph_pause++ (0 -> 1) start_graph_tracing() for_each_online_cpu(cpu) { ftrace_graph_init_idle_task(cpu) task-struct->tracing_graph_pause = 0 (1 -> 0) unpause_graph_tracing() task_struct->tracing_graph_pause-- (0 -> -1) The above should have gone from 1 to zero, and enabled function graph tracing again. But instead, it is set to -1, which keeps it disabled. There's no reason that the field tracing_graph_pause on the task_struct can not be initialized at boot up. Cc: stable@vger.kernel.org Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339 Reported-by: pierre.gondois@arm.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-13tracing/kprobe: Fix to support kretprobe events on unloaded modulesMasami Hiramatsu
commit 97c753e62e6c31a404183898d950d8c08d752dbd upstream. Fix kprobe_on_func_entry() returns error code instead of false so that register_kretprobe() can return an appropriate error code. append_trace_kprobe() expects the kprobe registration returns -ENOENT when the target symbol is not found, and it checks whether the target module is unloaded or not. If the target module doesn't exist, it defers to probe the target symbol until the module is loaded. However, since register_kretprobe() returns -EINVAL instead of -ENOENT in that case, it always fail on putting the kretprobe event on unloaded modules. e.g. Kprobe event: /sys/kernel/debug/tracing # echo p xfs:xfs_end_io >> kprobe_events [ 16.515574] trace_kprobe: This probe might be able to register after target module is loaded. Continue. Kretprobe event: (p -> r) /sys/kernel/debug/tracing # echo r xfs:xfs_end_io >> kprobe_events sh: write error: Invalid argument /sys/kernel/debug/tracing # cat error_log [ 41.122514] trace_kprobe: error: Failed to register probe event Command: r xfs:xfs_end_io ^ To fix this bug, change kprobe_on_func_entry() to detect symbol lookup failure and return -ENOENT in that case. Otherwise it returns -EINVAL or 0 (succeeded, given address is on the entry). Link: https://lkml.kernel.org/r/161176187132.1067016.8118042342894378981.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 59158ec4aef7 ("tracing/kprobes: Check the probe on unloaded module correctly") Reported-by: Jianlin Lv <Jianlin.Lv@arm.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10genirq/msi: Activate Multi-MSI early when MSI_FLAG_ACTIVATE_EARLY is setMarc Zyngier
commit 4c457e8cb75eda91906a4f89fc39bde3f9a43922 upstream. When MSI_FLAG_ACTIVATE_EARLY is set (which is the case for PCI), __msi_domain_alloc_irqs() performs the activation of the interrupt (which in the case of PCI results in the endpoint being programmed) as soon as the interrupt is allocated. But it appears that this is only done for the first vector, introducing an inconsistent behaviour for PCI Multi-MSI. Fix it by iterating over the number of vectors allocated to each MSI descriptor. This is easily achieved by introducing a new "for_each_msi_vector" iterator, together with a tiny bit of refactoring. Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Reported-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210123122759.1781359-1-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10kretprobe: Avoid re-registration of the same kretprobe earlierWang ShaoBo
commit 0188b87899ffc4a1d36a0badbe77d56c92fd91dc upstream. Our system encountered a re-init error when re-registering same kretprobe, where the kretprobe_instance in rp->free_instances is illegally accessed after re-init. Implementation to avoid re-registration has been introduced for kprobe before, but lags for register_kretprobe(). We must check if kprobe has been re-registered before re-initializing kretprobe, otherwise it will destroy the data struct of kretprobe registered, which can lead to memory leak, system crash, also some unexpected behaviors. We use check_kprobe_rereg() to check if kprobe has been re-registered before running register_kretprobe()'s body, for giving a warning message and terminate registration process. Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.com Cc: stable@vger.kernel.org Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe") [ The above commit should have been done for kretprobes too ] Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10elfcore: fix building with clangArnd Bergmann
commit 6e7b64b9dd6d96537d816ea07ec26b7dedd397b9 upstream. kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07workqueue: Restrict affinity change to rescuerPeter Zijlstra
[ Upstream commit 640f17c82460e9724fd256f0a1f5d99e7ff0bda4 ] create_worker() will already set the right affinity using kthread_bind_mask(), this means only the rescuer will need to change it's affinity. Howveer, while in cpu-hot-unplug a regular task is not allowed to run on online&&!active as it would be pushed away quite agressively. We need KTHREAD_IS_PER_CPU to survive in that environment. Therefore set the affinity after getting that magic flag. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-07kthread: Extract KTHREAD_IS_PER_CPUPeter Zijlstra
[ Upstream commit ac687e6e8c26181a33270efd1a2e2241377924b0 ] There is a need to distinguish geniune per-cpu kthreads from kthreads that happen to have a single CPU affinity. Geniune per-cpu kthreads are kthreads that are CPU affine for correctness, these will obviously have PF_KTHREAD set, but must also have PF_NO_SETAFFINITY set, lest userspace modify their affinity and ruins things. However, these two things are not sufficient, PF_NO_SETAFFINITY is also set on other tasks that have their affinities controlled through other means, like for instance workqueues. Therefore another bit is needed; it turns out kthread_create_per_cpu() already has such a bit: KTHREAD_IS_PER_CPU, which is used to make kthread_park()/kthread_unpark() work correctly. Expose this flag and remove the implicit setting of it from kthread_create_on_cpu(); the io_uring usage of it seems dubious at best. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-07sysctl: handle overflow in proc_get_longChristian Brauner
commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream. proc_get_long() is a funny function. It uses simple_strtoul() and for a good reason. proc_get_long() wants to always succeed the parse and return the maybe incorrect value and the trailing characters to check against a pre-defined list of acceptable trailing values. However, simple_strtoul() explicitly ignores overflows which can cause funny things like the following to happen: echo 18446744073709551616 > /proc/sys/fs/file-max cat /proc/sys/fs/file-max 0 (Which will cause your system to silently die behind your back.) On the other hand kstrtoul() does do overflow detection but does not return the trailing characters, and also fails the parse when anything other than '\n' is a trailing character whereas proc_get_long() wants to be more lenient. Now, before adding another kstrtoul() function let's simply add a static parse strtoul_lenient() which: - fails on overflow with -ERANGE - returns the trailing characters to the caller The reason why we should fail on ERANGE is that we already do a partial fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we already reject values such as 184467440737095516160 (21 chars) but accept values such as 18446744073709551616 (20 chars) but both are overflows. So we should just always reject 64bit overflows and not special-case this based on the number of chars. Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Joerg Vehlow <lkml@jv-coder.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03PM: hibernate: flush swap writer after markingLaurent Badel
commit fef9c8d28e28a808274a18fbd8cc2685817fd62a upstream. Flush the swap writer after, not before, marking the files, to ensure the signature is properly written. Fixes: 6f612af57821 ("PM / Hibernate: Group swap ops") Signed-off-by: Laurent Badel <laurentbadel@eaton.com> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03kernel: kexec: remove the lock operation of system_transition_mutexBaoquan He
commit 56c91a18432b631ca18438841fd1831ef756cabf upstream. Function kernel_kexec() is called with lock system_transition_mutex held in reboot system call. While inside kernel_kexec(), it will acquire system_transition_mutex agin. This will lead to dead lock. The dead lock should be easily triggered, it hasn't caused any failure report just because the feature 'kexec jump' is almost not used by anyone as far as I know. An inquiry can be made about who is using 'kexec jump' and where it's used. Before that, let's simply remove the lock operation inside CONFIG_KEXEC_JUMP ifdeffery scope. Fixes: 55f2503c3b69 ("PM / reboot: Eliminate race between reboot and suspend") Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pingfan Liu <kernelfans@gmail.com> Cc: 4.19+ <stable@vger.kernel.org> # 4.19+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30tracing: Fix race in trace_open and buffer resize callGaurav Kohli
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream. Below race can come, if trace_open and resize of cpu buffer is running parallely on different cpus CPUX CPUY ring_buffer_resize atomic_read(&buffer->resize_disabled) tracing_open tracing_reset_online_cpus ring_buffer_reset_cpu rb_reset_cpu rb_update_pages remove/insert pages resetting pointer This race can cause data abort or some times infinte loop in rb_remove_pages and rb_insert_pages while checking pages for sanity. Take buffer lock to fix this. Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeaurora.org Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Denis Efremov <efremov@linux.com> Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Handle faults correctly for PI futexesThomas Gleixner
commit 34b1a1ce1458f50ef27c54e28eb9b1947012907a upstream fixup_pi_state_owner() tries to ensure that the state of the rtmutex, pi_state and the user space value related to the PI futex are consistent before returning to user space. In case that the user space value update faults and the fault cannot be resolved by faulting the page in via fault_in_user_writeable() the function returns with -EFAULT and leaves the rtmutex and pi_state owner state inconsistent. A subsequent futex_unlock_pi() operates on the inconsistent pi_state and releases the rtmutex despite not owning it which can corrupt the RB tree of the rtmutex and cause a subsequent kernel stack use after free. It was suggested to loop forever in fixup_pi_state_owner() if the fault cannot be resolved, but that results in runaway tasks which is especially undesired when the problem happens due to a programming error and not due to malice. As the user space value cannot be fixed up, the proper solution is to make the rtmutex and the pi_state consistent so both have the same owner. This leaves the user space value out of sync. Any subsequent operation on the futex will fail because the 10th rule of PI futexes (pi_state owner and user space value are consistent) has been violated. As a consequence this removes the inept attempts of 'fixing' the situation in case that the current task owns the rtmutex when returning with an unresolvable fault by unlocking the rtmutex which left pi_state::owner and rtmutex::owner out of sync in a different and only slightly less dangerous way. Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi") Reported-by: gzobqq@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Simplify fixup_pi_state_owner()Thomas Gleixner
commit f2dac39d93987f7de1e20b3988c8685523247ae2 upstream Too many gotos already and an upcoming fix would make it even more unreadable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Use pi_state_update_owner() in put_pi_state()Thomas Gleixner
commit 6ccc84f917d33312eb2846bd7b567639f585ad6d upstream No point in open coding it. This way it gains the extra sanity checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30rtmutex: Remove unused argument from rt_mutex_proxy_unlock()Thomas Gleixner
commit 2156ac1934166d6deb6cd0f6ffc4c1076ec63697 upstream Nothing uses the argument. Remove it as preparation to use pi_state_update_owner(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Provide and use pi_state_update_owner()Thomas Gleixner
commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 upstream Updating pi_state::owner is done at several places with the same code. Provide a function for it and use that at the obvious places. This is also a preparation for a bug fix to avoid yet another copy of the same code or alternatively introducing a completely unpenetratable mess of gotos. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Replace pointless printk in fixup_owner()Thomas Gleixner
commit 04b79c55201f02ffd675e1231d731365e335c307 upstream If that unexpected case of inconsistent arguments ever happens then the futex state is left completely inconsistent and the printk is not really helpful. Replace it with a warning and make the state consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Ensure the correct return value from futex_lock_pi()Thomas Gleixner
commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream In case that futex_lock_pi() was aborted by a signal or a timeout and the task returned without acquiring the rtmutex, but is the designated owner of the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to establish consistent state. In that case it invokes fixup_pi_state_owner() which in turn tries to acquire the rtmutex again. If that succeeds then it does not propagate this success to fixup_owner() and futex_lock_pi() returns -EINTR or -ETIMEOUT despite having the futex locked. Return success from fixup_pi_state_owner() in all cases where the current task owns the rtmutex and therefore the futex and propagate it correctly through fixup_owner(). Fixup the other callsite which does not expect a positive return value. Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Prevent exit livelockThomas Gleixner
commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream Oleg provided the following test case: int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is: parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING The same problem can happen just by regular preemption as well: task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING; --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD. To cure this, the handling of the exiting owner is changed so: - A refcount is held on the task - The task pointer is stored in a caller visible location - The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated. This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions. For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material. Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stable Team <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Provide distinct return value when owner is exitingThomas Gleixner
commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream attach_to_pi_owner() returns -EAGAIN for various cases: - Owner task is exiting - Futex value has changed The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock. As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Add mutex around futex exitThomas Gleixner
commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Provide state handling for exec() as wellThomas Gleixner
commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does. The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary. While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Sanitize exit state handlingThomas Gleixner
commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Mark the begin of futex exit explicitlyThomas Gleixner
commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code. As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct. This allows to simplify that logic in a next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Set task::futex_state to DEAD right after handling futex exitThomas Gleixner
commit f24f22435dcc11389acc87e5586239c1819d217c upstream Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code. Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Split futex_mm_release() for exit/execThomas Gleixner
commit 150d71584b12809144b8145b817e83b81158ae5f upstream To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30exit/exec: Seperate mm_release()Thomas Gleixner
commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Replace PF_EXITPIDONE with a stateThomas Gleixner
commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code. Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step. This prepares for handling various futex exit issues later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Move futex exit handling into futex codeThomas Gleixner
commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code. Split it out into a function, move it into futex code and make the various futex exit functions static. Preparatory only and no functional change. Folded build fix from Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19tracing/kprobes: Do the notrace functions check without kprobes on ftraceMasami Hiramatsu
commit 7bb83f6fc4ee84e95d0ac0d14452c2619fb3fe70 upstream. Enable the notrace function check on the architecture which doesn't support kprobes on ftrace but support dynamic ftrace. This notrace function check is not only for the kprobes on ftrace but also sw-breakpoint based kprobes. Thus there is no reason to limit this check for the arch which supports kprobes on ftrace. This also changes the dependency of Kconfig. Because kprobe event uses the function tracer's address list for identifying notrace function, if the CONFIG_DYNAMIC_FTRACE=n, it can not check whether the target function is notrace or not. Link: https://lkml.kernel.org/r/20210105065730.2634785-1-naveen.n.rao@linux.vnet.ibm.com Link: https://lkml.kernel.org/r/161007957862.114704.4512260007555399463.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 45408c4f92506 ("tracing: kprobes: Prohibit probing on notrace function") Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12workqueue: Kick a worker based on the actual activation of delayed worksYunfeng Ye
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ] In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-06module: delay kobject uevent until after module init callJessica Yu
[ Upstream commit 38dc717e97153e46375ee21797aa54777e5498f3 ] Apparently there has been a longstanding race between udev/systemd and the module loader. Currently, the module loader sends a uevent right after sysfs initialization, but before the module calls its init function. However, some udev rules expect that the module has initialized already upon receiving the uevent. This race has been triggered recently (see link in references) in some systemd mount unit files. For instance, the configfs module creates the /sys/kernel/config mount point in its init function, however the module loader issues the uevent before this happens. sys-kernel-config.mount expects to be able to mount /sys/kernel/config upon receipt of the module loading uevent, but if the configfs module has not called its init function yet, then this directory will not exist and the mount unit fails. A similar situation exists for sys-fs-fuse-connections.mount, as the fuse sysfs mount point is created during the fuse module's init function. If udev is faster than module initialization then the mount unit would fail in a similar fashion. To fix this race, delay the module KOBJ_ADD uevent until after the module has finished calling its init routine. References: https://github.com/systemd/systemd/issues/17586 Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-By: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-06module: set MODULE_STATE_GOING state when a module fails to loadMiroslav Benes
[ Upstream commit 5e8ed280dab9eeabc1ba0b2db5dbe9fe6debb6b5 ] If a module fails to load due to an error in prepare_coming_module(), the following error handling in load_module() runs with MODULE_STATE_COMING in module's state. Fix it by correctly setting MODULE_STATE_GOING under "bug_cleanup" label. Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>