summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2019-02-12perf/core: Don't WARN() for impossible ring-buffer sizesMark Rutland
commit 9dff0aa95a324e262ffb03f425d00e4751f3294e upstream. The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how large its ringbuffer mmap should be. This can be configured to arbitrary values, which can be larger than the maximum possible allocation from kmalloc. When this is configured to a suitably large value (e.g. thanks to the perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in __alloc_pages_nodemask(): WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8 Let's avoid this by checking that the requested allocation is possible before calling kzalloc. Reported-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Julien Thierry <julien.thierry@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVMJosh Poimboeuf
commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream. With the following commit: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") ... the hotplug code attempted to detect when SMT was disabled by BIOS, in which case it reported SMT as permanently disabled. However, that code broke a virt hotplug scenario, where the guest is booted with only primary CPU threads, and a sibling is brought online later. The problem is that there doesn't seem to be a way to reliably distinguish between the HW "SMT disabled by BIOS" case and the virt "sibling not yet brought online" case. So the above-mentioned commit was a bit misguided, as it permanently disabled SMT for both cases, preventing future virt sibling hotplugs. Going back and reviewing the original problems which were attempted to be solved by that commit, when SMT was disabled in BIOS: 1) /sys/devices/system/cpu/smt/control showed "on" instead of "notsupported"; and 2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning. I'd propose that we instead consider #1 above to not actually be a problem. Because, at least in the virt case, it's possible that SMT wasn't disabled by BIOS and a sibling thread could be brought online later. So it makes sense to just always default the smt control to "on" to allow for that possibility (assuming cpuid indicates that the CPU supports SMT). The real problem is #2, which has a simple fix: change vmx_vm_init() to query the actual current SMT state -- i.e., whether any siblings are currently online -- instead of looking at the SMT "control" sysfs value. So fix it by: a) reverting the original "fix" and its followup fix: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation") and b) changing vmx_vm_init() to query the actual current SMT state -- instead of the sysfs control value -- to determine whether the L1TF warning is needed. This also requires the 'sched_smt_present' variable to exported, instead of 'cpu_smt_control'. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Joe Mario <jmario@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12futex: Handle early deadlock return correctlyThomas Gleixner
commit 1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434 upstream. commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex") changed the locking rules in the futex code so that the hash bucket lock is not longer held while the waiter is enqueued into the rtmutex wait list. This made the lock and the unlock path symmetric, but unfortunately the possible early exit from __rt_mutex_proxy_start() due to a detected deadlock was not updated accordingly. That allows a concurrent unlocker to observe inconsitent state which triggers the warning in the unlock path. futex_lock_pi() futex_unlock_pi() lock(hb->lock) queue(hb_waiter) lock(hb->lock) lock(rtmutex->wait_lock) unlock(hb->lock) // acquired hb->lock hb_waiter = futex_top_waiter() lock(rtmutex->wait_lock) __rt_mutex_proxy_start() ---> fail remove(rtmutex_waiter); ---> returns -EDEADLOCK unlock(rtmutex->wait_lock) // acquired wait_lock wake_futex_pi() rt_mutex_next_owner() --> returns NULL --> WARN lock(hb->lock) unqueue(hb_waiter) The problem is caused by the remove(rtmutex_waiter) in the failure case of __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the hash bucket but no waiter on the rtmutex, i.e. inconsistent state. The original commit handles this correctly for the other early return cases (timeout, signal) by delaying the removal of the rtmutex waiter until the returning task reacquired the hash bucket lock. Treat the failure case of __rt_mutex_proxy_start() in the same way and let the existing cleanup code handle the eventual handover of the rtmutex gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter removal for the failure case, so that the other callsites are still operating correctly. Add proper comments to the code so all these details are fully documented. Thanks to Peter for helping with the analysis and writing the really valuable code comments. Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex") Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Stefan Liebler <stli@linux.ibm.com> Cc: Sebastian Sewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12kernel/hung_task.c: force console verbose before panicLiu, Chuansheng
[ Upstream commit 168e06f7937d96c7222037d8a05565e8a6eb00fe ] Based on commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before panic"), we could get the call stack of hung task. However, if the console loglevel is not high, we still can not see the useful panic information in practice, and in most cases users don't set console loglevel to high level. This patch is to force console verbose before system panic, so that the real useful information can be seen in the console, instead of being like the following, which doesn't have hung task information. INFO: task init:1 blocked for more than 120 seconds. Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Kernel panic - not syncing: hung_task: blocked tasks CPU: 2 PID: 479 Comm: khungtaskd Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 Call Trace: dump_stack+0x4f/0x65 panic+0xde/0x231 watchdog+0x290/0x410 kthread+0x12c/0x150 ret_from_fork+0x35/0x40 reboot: panic mode set: p,w Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12proc/sysctl: fix return error for proc_doulongvec_minmax()Cheng Lin
[ Upstream commit 09be178400829dddc1189b50a7888495dd26aa84 ] If the number of input parameters is less than the total parameters, an EINVAL error will be returned. For example, we use proc_doulongvec_minmax to pass up to two parameters with kern_table: { .procname = "monitor_signals", .data = &monitor_sigs, .maxlen = 2*sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, Reproduce: When passing two parameters, it's work normal. But passing only one parameter, an error "Invalid argument"(EINVAL) is returned. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals -bash: echo: write error: Invalid argument [root@cl150 ~]# echo $? 1 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# The following is the result after apply this patch. No error is returned when the number of input parameters is less than the total parameters. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# echo $? 0 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# There are three processing functions dealing with digital parameters, __do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax. This patch deals with __do_proc_doulongvec_minmax, just as __do_proc_dointvec does, adding a check for parameters 'left'. In __do_proc_douintvec, its code implementation explicitly does not support multiple inputs. static int __do_proc_douintvec(...){ ... /* * Arrays are not supported, keep this simple. *Do not* add * support for them. */ if (vleft != 1) { *lenp = 0; return -EINVAL; } ... } So, just __do_proc_doulongvec_minmax has the problem. And most use of proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one parameter. Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12kernel/hung_task.c: break RCU locks based on jiffiesTetsuo Handa
[ Upstream commit 304ae42739b108305f8d7b3eb3c1aec7c2b643a9 ] check_hung_uninterruptible_tasks() is currently calling rcu_lock_break() for every 1024 threads. But check_hung_task() is very slow if printk() was called, and is very fast otherwise. If many threads within some 1024 threads called printk(), the RCU grace period might be extended enough to trigger RCU stall warnings. Therefore, calling rcu_lock_break() for every some fixed jiffies will be safer. Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12kdb: Don't back trace on a cpu that didn't round upDouglas Anderson
[ Upstream commit 162bc7f5afd75b72acbe3c5f3488ef7e64a3fe36 ] If you have a CPU that fails to round up and then run 'btc' you'll end up crashing in kdb becaue we dereferenced NULL. Let's add a check. It's wise to also set the task to NULL when leaving the debugger so that if we fail to round up on a later entry into the debugger we won't backtrace a stale task. Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12cgroup: fix parsing empty mount option stringOndrej Mosnacek
[ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ] This fixes the case where all mount options specified are consumed by an LSM and all that's left is an empty string. In this case cgroupfs should accept the string and not fail. How to reproduce (with SELinux enabled): # umount /sys/fs/cgroup/unified # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error. # dmesg | tail -n 1 [ 31.575952] cgroup: cgroup2: unknown option "" Fixes: 67e9c74b8a87 ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type") [NOTE: should apply on top of commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase] Suggested-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12kobject: return error code if writing /sys/.../uevent failsPeter Rajnoha
[ Upstream commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ] Propagate error code back to userspace if writing the /sys/.../uevent file fails. Before, the write operation always returned with success, even if we failed to recognize the input string or if we failed to generate the uevent itself. With the error codes properly propagated back to userspace, we are able to react in userspace accordingly by not assuming and awaiting a uevent that is not delivered. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12timekeeping: Use proper seqcount initializerBart Van Assche
[ Upstream commit ce10a5b3954f2514af726beb78ed8d7350c5e41c ] tk_core.seq is initialized open coded, but that misses to initialize the lockdep map when lockdep is enabled. Lockdep splats involving tk_core seq consequently lack a name and are hard to read. Use the proper initializer which takes care of the lockdep map initialization. [ tglx: Massaged changelog ] Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: peterz@infradead.org Cc: tj@kernel.org Cc: johannes.berg@intel.com Link: https://lkml.kernel.org/r/20181128234325.110011-12-bvanassche@acm.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-06kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin
commit 8fb335e078378c8426fabeed1ebee1fbf915690c upstream. Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31posix-cpu-timers: Unbreak timer rearmingThomas Gleixner
commit 93ad0fc088c5b4631f796c995bdd27a082ef33a6 upstream. The recent commit which prevented a division by 0 issue in the alarm timer code broke posix CPU timers as an unwanted side effect. The reason is that the common rearm code checks for timer->it_interval being 0 now. What went unnoticed is that the posix cpu timer setup does not initialize timer->it_interval as it stores the interval in CPU timer specific storage. The reason for the separate storage is historical as the posix CPU timers always had a 64bit nanoseconds representation internally while timer->it_interval is type ktime_t which used to be a modified timespec representation on 32bit machines. Instead of reverting the offending commit and fixing the alarmtimer issue in the alarmtimer code, store the interval in timer->it_interval at CPU timer setup time so the common code check works. This also repairs the existing inconistency of the posix CPU timer code which kept a single shot timer armed despite of the interval being 0. The separate storage can be removed in mainline, but that needs to be a separate commit as the current one has to be backported to stable kernels. Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug") Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-23sched/fair: Fix bandwidth timer clock drift conditionXunlei Pang
commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [alakeshh: backport: Fixed merge conflicts: - sched.h: Fix the indentation and order in which the variables are declared to match with coding style of the existing code in 4.14 Struct members of same type were declared in separate lines in upstream patch which has been changed back to having multiple members of same type in the same line. e.g. int a; int b; -> int a, b; ] Signed-off-by: Alakesh Haloi <alakeshh@amazon.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # 4.14.x Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13sched/fair: Fix infinite loop in update_blocked_averages() by reverting ↵Linus Torvalds
a9e7f6544b9c commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream. Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the scheduler under high loads, starting at around the v4.18 time frame, and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list manipulation. Do a (manual) revert of: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") It turns out that the list_del_leaf_cfs_rq() introduced by this commit is a surprising property that was not considered in followup commits such as: 9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list") As Vincent Guittot explains: "I think that there is a bigger problem with commit a9e7f6544b9c and cfs_rq throttling: Let take the example of the following topology TG2 --> TG1 --> root: 1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1 cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in one path because it has never been used and can't be throttled so tmp_alone_branch will point to leaf_cfs_rq_list at the end. 2) Then TG1 is throttled 3) and we add TG3 as a new child of TG1. 4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1 cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list. With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1 cfs_rq is removed from the list. Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list but tmp_alone_branch still points to TG3 cfs_rq because its throttled parent can't be enqueued when the lock is released. tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should. So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch points on another TG cfs_rq, the next TG cfs_rq that will be added, will be linked outside rq->leaf_cfs_rq_list - which is bad. In addition, we can break the ordering of the cfs_rq in rq->leaf_cfs_rq_list but this ordering is used to update and propagate the update from leaf down to root." Instead of trying to work through all these cases and trying to reproduce the very high loads that produced the lockup to begin with, simplify the code temporarily by reverting a9e7f6544b9c - which change was clearly not thought through completely. This (hopefully) gives us a kernel that doesn't lock up so people can continue to enjoy their holidays without worrying about regressions. ;-) [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ] Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com> Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org> Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com> Reported-by: Sargun Dhillon <sargun@sargun.me> Reported-by: Xie XiuQi <xiexiuqi@huawei.com> Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com> Tested-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> # v4.13+ Cc: Bin Li <huawei.libin@huawei.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, devm_memremap_pages: kill mapping "System RAM" supportDan Williams
commit 06489cfbd915ff36c8e36df27f1c2dc60f97ca56 upstream. Given the fact that devm_memremap_pages() requires a percpu_ref that is torn down by devm_memremap_pages_release() the current support for mapping RAM is broken. Support for remapping "System RAM" has been broken since the beginning and there is no existing user of this this code path, so just kill the support and make it an explicit error. This cleanup also simplifies a follow-on patch to fix the error path when setting a devm release action for devm_memremap_pages_release() fails. Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPLDan Williams
commit 808153e1187fa77ac7d7dad261ff476888dcf398 upstream. devm_memremap_pages() is a facility that can create struct page entries for any arbitrary range and give drivers the ability to subvert core aspects of page management. Specifically the facility is tightly integrated with the kernel's memory hotplug functionality. It injects an altmap argument deep into the architecture specific vmemmap implementation to allow allocating from specific reserved pages, and it has Linux specific assumptions about page structure reference counting relative to get_user_pages() and get_user_pages_fast(). It was an oversight and a mistake that this was not marked EXPORT_SYMBOL_GPL from the outset. Again, devm_memremap_pagex() exposes and relies upon core kernel internal assumptions and will continue to evolve along with 'struct page', memory hotplug, and support for new memory types / topologies. Only an in-kernel GPL-only driver is expected to keep up with this ongoing evolution. This interface, and functionality derived from this interface, is not suitable for kernel-external drivers. Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13fork: record start_time lateDavid Herrmann
commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream. This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13genirq/affinity: Don't return with empty affinity masks on errorThomas Gleixner
commit 0211e12dd0a5385ecffd3557bc570dbad7fcf245 upstream. When the allocation of node_to_possible_cpumask fails, then irq_create_affinity_masks() returns with a pointer to the empty affinity masks array, which will cause malfunction. Reorder the allocations so the masks array allocation comes last and every failure path returns NULL. Fixes: 9a0ef98e186d ("genirq/affinity: Assign vectors to all present CPUs") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09cgroup: fix CSS_TASK_ITER_PROCSTejun Heo
commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream. CSS_TASK_ITER_PROCS implements process-only iteration by making css_task_iter_advance() skip tasks which aren't threadgroup leaders; however, when an iteration is started css_task_iter_start() calls the inner helper function css_task_iter_advance_css_set() instead of css_task_iter_advance(). As the helper doesn't have the skip logic, when the first task to visit is a non-leader thread, it doesn't get skipped correctly as shown in the following example. # ps -L 2030 PID LWP TTY STAT TIME COMMAND 2030 2030 pts/0 Sl+ 0:00 ./test-thread 2030 2031 pts/0 Sl+ 0:00 ./test-thread # mkdir -p /sys/fs/cgroup/x/a/b # echo threaded > /sys/fs/cgroup/x/a/cgroup.type # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs # cat /sys/fs/cgroup/x/a/cgroup.threads 2030 2031 # cat /sys/fs/cgroup/x/cgroup.procs 2030 # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads # cat /sys/fs/cgroup/x/cgroup.procs 2031 2030 The last read of cgroup.procs is incorrectly showing non-leader 2031 in cgroup.procs output. This can be fixed by updating css_task_iter_advance() to handle the first advance and css_task_iters_tart() to call css_task_iter_advance() instead of the inner helper. After the fix, the same commands result in the following (correct) result: # ps -L 2062 PID LWP TTY STAT TIME COMMAND 2062 2062 pts/0 Sl+ 0:00 ./test-thread 2062 2063 pts/0 Sl+ 0:00 ./test-thread # mkdir -p /sys/fs/cgroup/x/a/b # echo threaded > /sys/fs/cgroup/x/a/cgroup.type # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs # cat /sys/fs/cgroup/x/a/cgroup.threads 2062 2063 # cat /sys/fs/cgroup/x/cgroup.procs 2062 # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads # cat /sys/fs/cgroup/x/cgroup.procs 2062 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29panic: avoid deadlocks in re-entrant console driversSergey Senozhatsky
commit c7c3f05e341a9a2bd1a92993d4f996cfd6e7348e upstream. From printk()/serial console point of view panic() is special, because it may force CPU to re-enter printk() or/and serial console driver. Therefore, some of serial consoles drivers are re-entrant. E.g. 8250: serial8250_console_write() { if (port->sysrq) locked = 0; else if (oops_in_progress) locked = spin_trylock_irqsave(&port->lock, flags); else spin_lock_irqsave(&port->lock, flags); ... } panic() does set oops_in_progress via bust_spinlocks(1), so in theory we should be able to re-enter serial console driver from panic(): CPU0 <NMI> uart_console_write() serial8250_console_write() // if (oops_in_progress) // spin_trylock_irqsave() call_console_drivers() console_unlock() console_flush_on_panic() bust_spinlocks(1) // oops_in_progress++ panic() <NMI/> spin_lock_irqsave(&port->lock, flags) // spin_lock_irqsave() serial8250_console_write() call_console_drivers() console_unlock() printk() ... However, this does not happen and we deadlock in serial console on port->lock spinlock. And the problem is that console_flush_on_panic() called after bust_spinlocks(0): void panic(const char *fmt, ...) { bust_spinlocks(1); ... bust_spinlocks(0); console_flush_on_panic(); ... } bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress can go back to zero. Thus even re-entrant console drivers will simply spin on port->lock spinlock. Given that port->lock may already be locked either by a stopped CPU, or by the very same CPU we execute panic() on (for instance, NMI panic() on printing CPU) the system deadlocks and does not reboot. Fix this by removing bust_spinlocks(0), so oops_in_progress is always set in panic() now and, thus, re-entrant console drivers will trylock the port->lock instead of spinning on it forever, when we call them from console_flush_on_panic(). Link: http://lkml.kernel.org/r/20181025101036.6823-1-sergey.senozhatsky@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Wang <wonderfly@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jiri Slaby <jslaby@suse.com> Cc: Peter Feiner <pfeiner@google.com> Cc: linux-serial@vger.kernel.org Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29posix-timers: Fix division by zero bugThomas Gleixner
commit 0e334db6bb4b1fd1e2d72c1f3d8f004313cd9f94 upstream. The signal delivery path of posix-timers can try to rearm the timer even if the interval is zero. That's handled for the common case (hrtimer) but not for alarm timers. In that case the forwarding function raises a division by zero exception. The handling for hrtimer based posix timers is wrong because it marks the timer as active despite the fact that it is stopped. Move the check from common_hrtimer_rearm() to posixtimer_rearm() to cure both issues. Reported-by: syzbot+9d38bedac9cc77b8ad5e@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: sboyd@kernel.org Cc: stable@vger.kernel.org Cc: syzkaller-bugs@googlegroups.com Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1812171328050.1880@nanos.tec.linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-21locking/qspinlock, x86: Provide liveness guaranteePeter Zijlstra
commit 7aa54be2976550f17c11a1c3e3630002dea39303 upstream. On x86 we cannot do fetch_or() with a single instruction and thus end up using a cmpxchg loop, this reduces determinism. Replace the fetch_or() with a composite operation: tas-pending + load. Using two instructions of course opens a window we previously did not have. Consider the scenario: CPU0 CPU1 CPU2 1) lock trylock -> (0,0,1) 2) lock trylock /* fail */ 3) unlock -> (0,0,0) 4) lock trylock -> (0,0,1) 5) tas-pending -> (0,1,1) load-val <- (0,1,0) from 3 6) clear-pending-set-locked -> (0,0,1) FAIL: _2_ owners where 5) is our new composite operation. When we consider each part of the qspinlock state as a separate variable (as we can when _Q_PENDING_BITS == 8) then the above is entirely possible, because tas-pending will only RmW the pending byte, so the later load is able to observe prior tail and lock state (but not earlier than its own trylock, which operates on the whole word, due to coherence). To avoid this we need 2 things: - the load must come after the tas-pending (obviously, otherwise it can trivially observe prior state). - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for example, such that we cannot observe other state prior to setting pending. On x86 we can realize this by using "LOCK BTS m32, r32" for tas-pending followed by a regular load. Note that observing later state is not a problem: - if we fail to observe a later unlock, we'll simply spin-wait for that store to become visible. - if we observe a later xchg_tail(), there is no difference from that xchg_tail() having taken place before the tas-pending. Suggested-by: Will Deacon <will.deacon@arm.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath") Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [bigeasy: GEN_BINARY_RMWcc macro redo] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Re-order codePeter Zijlstra
commit 53bf57fab7321fb42b703056a4c80fc9d986d170 upstream. Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL) such that we loose the indent. This also result in a more natural code flow IMO. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Kill cmpxchg() loop when claiming lock from head of queueWill Deacon
commit c61da58d8a9ba9238250a548f00826eaf44af0f7 upstream. When a queued locker reaches the head of the queue, it claims the lock by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it must also clear the tail as part of this operation so that subsequent lockers can avoid taking the slowpath altogether. Currently this is expressed as a cmpxchg() loop that practically only runs up to two iterations. This is confusing to the reader and unhelpful to the compiler. Rewrite the cmpxchg() loop without the loop, so that a failed cmpxchg() implies that there is contention and we just need to write to _Q_LOCKED_VAL without considering the rest of the lockword. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Remove duplicate clear_pending() function from PV codeWill Deacon
commit 3bea9adc96842b8a7345c7fb202c16ae9c8d5b25 upstream. The native clear_pending() function is identical to the PV version, so the latter can simply be removed. This fixes the build for systems with >= 16K CPUs using the PV lock implementation. Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20180427101619.GB21705@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpathWill Deacon
commit 59fb586b4a07b4e1a0ee577140ab4842ba451acd upstream. The qspinlock locking slowpath utilises a "pending" bit as a simple form of an embedded test-and-set lock that can avoid the overhead of explicit queuing in cases where the lock is held but uncontended. This bit is managed using a cmpxchg() loop which tries to transition the uncontended lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1). Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved indefinitely if the lock word is seen to oscillate between unlocked (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are able to take the lock in the cmpxchg() loop without queuing and pass it around amongst themselves. This patch fixes the problem by unconditionally setting _Q_PENDING_VAL using atomic_fetch_or, and then inspecting the old value to see whether we need to spin on the current lock owner, or whether we now effectively hold the lock. The tricky scenario is when concurrent lockers end up queuing on the lock and the lock becomes available, causing us to see a lockword of (n,0,0). With pending now set, simply queuing could lead to deadlock as the head of the queue may not have observed the pending flag being cleared. Conversely, if the head of the queue did observe pending being cleared, then it could transition the lock from (n,0,0) -> (0,0,1) meaning that any attempt to "undo" our setting of the pending bit could race with a concurrent locker trying to set it. We handle this race by preserving the pending bit when taking the lock after reaching the head of the queue and leaving the tail entry intact if we saw pending set, because we know that the tail is going to be updated shortly. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'Will Deacon
commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream. 'struct __qspinlock' provides a handy union of fields so that subcomponents of the lockword can be accessed by name, without having to manage shifts and masks explicitly and take endianness into account. This is useful in qspinlock.h and also potentially in arch headers, so move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra definition. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Bound spinning on pending->locked transition in slowpathWill Deacon
commit 6512276d97b160d90b53285bd06f7f201459a7e3 upstream. If a locker taking the qspinlock slowpath reads a lock value indicating that only the pending bit is set, then it will spin whilst the concurrent pending->locked transition takes effect. Unfortunately, there is no guarantee that such a transition will ever be observed since concurrent lockers could continuously set pending and hand over the lock amongst themselves, leading to starvation. Whilst this would probably resolve in practice, it means that it is not possible to prove liveness properties about the lock and means that lock acquisition time is unbounded. Rather than removing the pending->locked spinning from the slowpath altogether (which has been shown to heavily penalise a 2-threaded locking stress test on x86), this patch replaces the explicit spinning with a call to atomic_cond_read_relaxed and allows the architecture to provide a bound on the number of spins. For architectures that can respond to changes in cacheline state in their smp_cond_load implementation, it should be sufficient to use the default bound of 1. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking/qspinlock: Ensure node is initialised before updating prev->nextWill Deacon
commit 95bcade33a8af38755c9b0636e36a36ad3789fe6 upstream. When a locker ends up queuing on the qspinlock locking slowpath, we initialise the relevant mcs node and publish it indirectly by updating the tail portion of the lock word using xchg_tail. If we find that there was a pre-existing locker in the queue, we subsequently update their ->next field to point at our node so that we are notified when it's our turn to take the lock. This can be roughly illustrated as follows: /* Initialise the fields in node and encode a pointer to node in tail */ tail = initialise_node(node); /* * Exchange tail into the lockword using an atomic read-modify-write * operation with release semantics */ old = xchg_tail(lock, tail); /* If there was a pre-existing waiter ... */ if (old & _Q_TAIL_MASK) { prev = decode_tail(old); smp_read_barrier_depends(); /* ... then update their ->next field to point to node. WRITE_ONCE(prev->next, node); } The conditional update of prev->next therefore relies on the address dependency from the result of xchg_tail ensuring order against the prior initialisation of node. However, since the release semantics of the xchg_tail operation apply only to the write portion of the RmW, then this ordering is not guaranteed and it is possible for the CPU to return old before the writes to node have been published, consequently allowing us to point prev->next to an uninitialised node. This patch fixes the problem by making the update of prev->next a RELEASE operation, which also removes the reliance on dependency ordering. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21locking: Remove smp_read_barrier_depends() from queued_spin_lock_slowpath()Paul E. McKenney
commit 548095dea63ffc016d39c35b32c628d033638aca upstream. Queued spinlocks are not used by DEC Alpha, and furthermore operations such as READ_ONCE() and release/relaxed RMW atomics are being changed to imply smp_read_barrier_depends(). This commit therefore removes the now-redundant smp_read_barrier_depends() from queued_spin_lock_slowpath(), and adjusts the comments accordingly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-21tracing: Fix memory leak of instance function hash filtersSteven Rostedt (VMware)
commit 2840f84f74035e5a535959d5f17269c69fa6edc5 upstream. The following commands will cause a memory leak: # cd /sys/kernel/tracing # mkdir instances/foo # echo schedule > instance/foo/set_ftrace_filter # rmdir instances/foo The reason is that the hashes that hold the filters to set_ftrace_filter and set_ftrace_notrace are not freed if they contain any data on the instance and the instance is removed. Found by kmemleak detector. Cc: stable@vger.kernel.org Fixes: 591dffdade9f ("ftrace: Allow for function tracing instance to filter functions") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-21tracing: Fix memory leak in set_trigger_filter()Steven Rostedt (VMware)
commit 3cec638b3d793b7cacdec5b8072364b41caeb0e1 upstream. When create_event_filter() fails in set_trigger_filter(), the filter may still be allocated and needs to be freed. The caller expects the data->filter to be updated with the new filter, even if the new filter failed (we could add an error message by setting set_str parameter of create_event_filter(), but that's another update). But because the error would just exit, filter was left hanging and nothing could free it. Found by kmemleak detector. Cc: stable@vger.kernel.org Fixes: bac5fb97a173a ("tracing: Add and use generic set_trigger_filter() implementation") Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-21timer/debug: Change /proc/timer_list from 0444 to 0400Ingo Molnar
[ Upstream commit 8e7df2b5b7f245c9bd11064712db5cb69044a362 ] While it uses %pK, there's still few reasons to read this file as non-root. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17printk: Wake klogd when passing console_lock ownerPetr Mladek
[ Upstream commit c14376de3a1befa70d9811ca2872d47367b48767 ] wake_klogd is a local variable in console_unlock(). The information is lost when the console_lock owner using the busy wait added by the commit dbdda842fe96f8932 ("printk: Add console owner and waiter logic to load balance console writes"). The following race is possible: CPU0 CPU1 console_unlock() for (;;) /* calling console for last message */ printk() log_store() log_next_seq++; /* see new message */ if (seen_seq != log_next_seq) { wake_klogd = true; seen_seq = log_next_seq; } console_lock_spinning_enable(); if (console_trylock_spinning()) /* spinning */ if (console_lock_spinning_disable_and_check()) { printk_safe_exit_irqrestore(flags); return; console_unlock() if (seen_seq != log_next_seq) { /* already seen */ /* nothing to do */ Result: Nobody would wakeup klogd. One solution would be to make a global variable from wake_klogd. But then we would need to manipulate it under a lock or so. This patch wakes klogd also when console_lock is passed to the spinning waiter. It looks like the right way to go. Also userspace should have a chance to see and store any "flood" of messages. Note that the very late klogd wake up was a historic solution. It made sense on single CPU systems or when sys_syslog() operations were synchronized using the big kernel lock like in v2.1.113. But it is questionable these days. Fixes: dbdda842fe96f8932 ("printk: Add console owner and waiter logic to load balance console writes") Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Cc: Tejun Heo <tj@kernel.org> Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17printk: Never set console_may_schedule in console_trylock()Sergey Senozhatsky
[ Upstream commit fd5f7cde1b85d4c8e09ca46ce948e008a2377f64 ] This patch, basically, reverts commit 6b97a20d3a79 ("printk: set may_schedule for some of console_trylock() callers"). That commit was a mistake, it introduced a big dependency on the scheduler, by enabling preemption under console_sem in printk()->console_unlock() path, which is rather too critical. The patch did not significantly reduce the possibilities of printk() lockups, but made it possible to stall printk(), as has been reported by Tetsuo Handa [1]. Another issues is that preemption under console_sem also messes up with Steven Rostedt's hand off scheme, by making it possible to sleep with console_sem both in console_unlock() and in vprintk_emit(), after acquiring the console_sem ownership (anywhere between printk_safe_exit_irqrestore() in console_trylock_spinning() and printk_safe_enter_irqsave() in console_unlock()). This makes hand off less likely and, at the same time, may result in a significant amount of pending logbuf messages. Preempted console_sem owner makes it impossible for other CPUs to emit logbuf messages, but does not make it impossible for other CPUs to append new messages to the logbuf. Reinstate the old behavior and make printk() non-preemptible. Should any printk() lockup reports arrive they must be handled in a different way. [1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp Fixes: 6b97a20d3a79 ("printk: set may_schedule for some of console_trylock() callers") Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: akpm@linux-foundation.org Cc: linux-mm@kvack.org Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17printk: Hide console waiter logic into helpersPetr Mladek
[ Upstream commit c162d5b4338d72deed61aa65ed0f2f4ba2bbc8ab ] The commit ("printk: Add console owner and waiter logic to load balance console writes") made vprintk_emit() and console_unlock() even more complicated. This patch extracts the new code into 3 helper functions. They should help to keep it rather self-contained. It will be easier to use and maintain. This patch just shuffles the existing code. It does not change the functionality. Link: http://lkml.kernel.org/r/20180112160837.GD24497@linux.suse Cc: akpm@linux-foundation.org Cc: linux-mm@kvack.org Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: rostedt@home.goodmis.org Cc: Byungchul Park <byungchul.park@lge.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux-kernel@vger.kernel.org Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17printk: Add console owner and waiter logic to load balance console writesSteven Rostedt (VMware)
[ Upstream commit dbdda842fe96f8932bae554f0adf463c27c42bc7 ] This patch implements what I discussed in Kernel Summit. I added lockdep annotation (hopefully correctly), and it hasn't had any splats (since I fixed some bugs in the first iterations). It did catch problems when I had the owner covering too much. But now that the owner is only set when actively calling the consoles, lockdep has stayed quiet. Here's the design again: I added a "console_owner" which is set to a task that is actively writing to the consoles. It is *not* the same as the owner of the console_lock. It is only set when doing the calls to the console functions. It is protected by a console_owner_lock which is a raw spin lock. There is a console_waiter. This is set when there is an active console owner that is not current, and waiter is not set. This too is protected by console_owner_lock. In printk() when it tries to write to the consoles, we have: if (console_trylock()) console_unlock(); Now I added an else, which will check if there is an active owner, and no current waiter. If that is the case, then console_waiter is set, and the task goes into a spin until it is no longer set. When the active console owner finishes writing the current message to the consoles, it grabs the console_owner_lock and sees if there is a waiter, and clears console_owner. If there is a waiter, then it breaks out of the loop, clears the waiter flag (because that will release the waiter from its spin), and exits. Note, it does *not* release the console semaphore. Because it is a semaphore, there is no owner. Another task may release it. This means that the waiter is guaranteed to be the new console owner! Which it becomes. Then the waiter calls console_unlock() and continues to write to the consoles. If another task comes along and does a printk() it too can become the new waiter, and we wash rinse and repeat! By Petr Mladek about possible new deadlocks: The thing is that we move console_sem only to printk() call that normally calls console_unlock() as well. It means that the transferred owner should not bring new type of dependencies. As Steven said somewhere: "If there is a deadlock, it was there even before." We could look at it from this side. The possible deadlock would look like: CPU0 CPU1 console_unlock() console_owner = current; spin_lockA() printk() spin = true; while (...) call_console_drivers() spin_lockA() This would be a deadlock. CPU0 would wait for the lock A. While CPU1 would own the lockA and would wait for CPU0 to finish calling the console drivers and pass the console_sem owner. But if the above is true than the following scenario was already possible before: CPU0 spin_lockA() printk() console_unlock() call_console_drivers() spin_lockA() By other words, this deadlock was there even before. Such deadlocks are prevented by using printk_deferred() in the sections guarded by the lock A. By Steven Rostedt: To demonstrate the issue, this module has been shown to lock up a system with 4 CPUs and a slow console (like a serial console). It is also able to lock up a 8 CPU system with only a fast (VGA) console, by passing in "loops=100". The changes in this commit prevent this module from locking up the system. #include <linux/module.h> #include <linux/delay.h> #include <linux/sched.h> #include <linux/mutex.h> #include <linux/workqueue.h> #include <linux/hrtimer.h> static bool stop_testing; static unsigned int loops = 1; static void preempt_printk_workfn(struct work_struct *work) { int i; while (!READ_ONCE(stop_testing)) { for (i = 0; i < loops && !READ_ONCE(stop_testing); i++) { preempt_disable(); pr_emerg("%5d%-75s\n", smp_processor_id(), " XXX NOPREEMPT"); preempt_enable(); } msleep(1); } } static struct work_struct __percpu *works; static void finish(void) { int cpu; WRITE_ONCE(stop_testing, true); for_each_online_cpu(cpu) flush_work(per_cpu_ptr(works, cpu)); free_percpu(works); } static int __init test_init(void) { int cpu; works = alloc_percpu(struct work_struct); if (!works) return -ENOMEM; /* * This is just a test module. This will break if you * do any CPU hot plugging between loading and * unloading the module. */ for_each_online_cpu(cpu) { struct work_struct *work = per_cpu_ptr(works, cpu); INIT_WORK(work, &preempt_printk_workfn); schedule_work_on(cpu, work); } return 0; } static void __exit test_exit(void) { finish(); } module_param(loops, uint, 0); module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); Link: http://lkml.kernel.org/r/20180110132418.7080-2-pmladek@suse.com Cc: akpm@linux-foundation.org Cc: linux-mm@kvack.org Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux-kernel@vger.kernel.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> [pmladek@suse.com: Commit message about possible deadlocks] Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17Revert "printk: Never set console_may_schedule in console_trylock()"Sasha Levin
This reverts commit c9b8d580b3fb0ab65d37c372aef19a318fda3199. This is just a technical revert to make the printk fix apply cleanly, this patch will be re-picked in about 3 commits.
2018-12-17bpf: fix check of allowed specifiers in bpf_trace_printkMartynas Pumputis
[ Upstream commit 1efb6ee3edea57f57f9fb05dba8dcb3f7333f61f ] A format string consisting of "%p" or "%s" followed by an invalid specifier (e.g. "%p%\n" or "%s%") could pass the check which would make format_decode (lib/vsprintf.c) to warn. Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()") Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-08uprobes: Fix handle_swbp() vs. unregister() + register() race once moreAndrea Parri
commit 09d3f015d1e1b4fee7e9bbdcf54201d239393391 upstream. Commit: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race") added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb() memory barriers, to ensure that handle_swbp() uses fully-initialized uprobes only. However, the smp_rmb() is mis-placed: this barrier should be placed after handle_swbp() has tested for the flag, thus guaranteeing that (program-order) subsequent loads from the uprobe can see the initial stores performed by prepare_uprobe(). Move the smp_rmb() accordingly. Also amend the comments associated to the two memory barriers to indicate their actual locations. Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: stable@kernel.org Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race") Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08kdb: use memmove instead of overlapping memcpyArnd Bergmann
commit 2cf2f0d5b91fd1b06a6ae260462fc7945ea84add upstream. gcc discovered that the memcpy() arguments in kdbnearsym() overlap, so we should really use memmove(), which is defined to handle that correctly: In function 'memcpy', inlined from 'kdbnearsym' at /git/arm-soc/kernel/debug/kdb/kdb_support.c:132:4: /git/arm-soc/include/linux/string.h:353:9: error: '__builtin_memcpy' accessing 792 bytes at offsets 0 and 8 overlaps 784 bytes at offset 8 [-Werror=restrict] return __builtin_memcpy(p, q, size); Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRSThomas Gleixner
commit 46f7ecb1e7359f183f5bbd1e08b90e10e52164f9 upstream The IBPB control code in x86 removed the usage. Remove the functionality which was introduced for this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05x86/speculation: Rework SMT state changeThomas Gleixner
commit a74cfffb03b73d41e08f84c2e5c87dec0ce3db9f upstream arch_smt_update() is only called when the sysfs SMT control knob is changed. This means that when SMT is enabled in the sysfs control knob the system is considered to have SMT active even if all siblings are offline. To allow finegrained control of the speculation mitigations, the actual SMT state is more interesting than the fact that siblings could be enabled. Rework the code, so arch_smt_update() is invoked from each individual CPU hotplug function, and simplify the update function while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05sched/smt: Expose sched_smt_present static keyThomas Gleixner
commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream Make the scheduler's 'sched_smt_present' static key globaly available, so it can be used in the x86 speculation control code. Provide a query function and a stub for the CONFIG_SMP=n case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05sched/smt: Make sched_smt_present track topologyPeter Zijlstra (Intel)
commit c5511d03ec090980732e929c318a7a6374b5550e upstream Currently the 'sched_smt_present' static key is enabled when at CPU bringup SMT topology is observed, but it is never disabled. However there is demand to also disable the key when the topology changes such that there is no SMT present anymore. Implement this by making the key count the number of cores that have SMT enabled. In particular, the SMT topology bits are set before interrrupts are enabled and similarly, are cleared after interrupts are disabled for the last time and the CPU dies. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05x86/speculation: Apply IBPB more strictly to avoid cross-process data leakJiri Kosina
commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream Currently, IBPB is only issued in cases when switching into a non-dumpable process, the rationale being to protect such 'important and security sensitive' processess (such as GPG) from data leaking into a different userspace process via spectre v2. This is however completely insufficient to provide proper userspace-to-userpace spectrev2 protection, as any process can poison branch buffers before being scheduled out, and the newly scheduled process immediately becomes spectrev2 victim. In order to minimize the performance impact (for usecases that do require spectrev2 protection), issue the barrier only in cases when switching between processess where the victim can't be ptraced by the potential attacker (as in such cases, the attacker doesn't have to bother with branch buffers at all). [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably fine-grained ] Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch") Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigationJiri Kosina
commit 53c613fe6349994f023245519265999eed75957f upstream STIBP is a feature provided by certain Intel ucodes / CPUs. This feature (once enabled) prevents cross-hyperthread control of decisions made by indirect branch predictors. Enable this feature if - the CPU is vulnerable to spectre v2 - the CPU supports SMT and has SMT siblings online - spectre_v2 mitigation autoselection is enabled (default) After some previous discussion, this leaves STIBP on all the time, as wrmsr on crossing kernel boundary is a no-no. This could perhaps later be a bit more optimized (like disabling it in NOHZ, experiment with disabling it in idle, etc) if needed. Note that the synchronization of the mask manipulation via newly added spec_ctrl_mutex is currently not strictly needed, as the only updater is already being serialized by cpu_add_remove_lock, but let's make this a little bit more future-proof. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05sched/core: Fix cpu.max vs. cpuhotplug deadlockPeter Zijlstra
commit ce48c146495a1a50e48cdbfbfaba3e708be7c07c upstream Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion: tg_set_cfs_bandwidth() get_online_cpus() cpus_read_lock() cfs_bandwidth_usage_inc() static_key_slow_inc() cpus_read_lock() Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05bpf: Prevent memory disambiguation attackAlexei Starovoitov
commit af86ca4e3088fe5eacf2f7e58c01fa68ca067672 upstream. Detect code patterns where malicious 'speculative store bypass' can be used and sanitize such patterns. 39: (bf) r3 = r10 40: (07) r3 += -216 41: (79) r8 = *(u64 *)(r7 +0) // slow read 42: (7a) *(u64 *)(r10 -72) = 0 // verifier inserts this instruction 43: (7b) *(u64 *)(r8 +0) = r3 // this store becomes slow due to r8 44: (79) r1 = *(u64 *)(r6 +0) // cpu speculatively executes this load 45: (71) r2 = *(u8 *)(r1 +0) // speculatively arbitrary 'load byte' // is now sanitized Above code after x86 JIT becomes: e5: mov %rbp,%rdx e8: add $0xffffffffffffff28,%rdx ef: mov 0x0(%r13),%r14 f3: movq $0x0,-0x48(%rbp) fb: mov %rdx,0x0(%r14) ff: mov 0x0(%rbx),%rdi 103: movzbq 0x0(%rdi),%rsi Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 4.14: - Add bpf_verifier_env parameter to check_stack_write() - Look up stack slot_types with state->stack_slot_type[] rather than state->stack[].slot_type[] - Drop bpf_verifier_env argument to verbose() - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-01rcu: Make need_resched() respond to urgent RCU-QS needsPaul E. McKenney
commit 92aa39e9dc77481b90cbef25e547d66cab901496 upstream. The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent need for an RCU quiescent state from the force-quiescent-state processing within the grace-period kthread to context switches and to cond_resched(). Unfortunately, such urgent needs are not communicated to need_resched(), which is sometimes used to decide when to invoke cond_resched(), for but one example, within the KVM vcpu_run() function. As of v4.15, this can result in synchronize_sched() being delayed by up to ten seconds, which can be problematic, to say nothing of annoying. This commit therefore checks rcu_dynticks.rcu_urgent_qs from within rcu_check_callbacks(), which is invoked from the scheduling-clock interrupt handler. If the current task is not an idle task and is not executing in usermode, a context switch is forced, and either way, the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current task is an idle task, then RCU's dyntick-idle code will detect the quiescent state, so no further action is required. Similarly, if the task is executing in usermode, other code in rcu_check_callbacks() and its called functions will report the corresponding quiescent state. Reported-by: Marius Hillenbrand <mhillenb@amazon.de> Reported-by: David Woodhouse <dwmw2@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Backported to make patch apply cleanly on older versions. ] Tested-by: Marius Hillenbrand <mhillenb@amazon.de> Cc: <stable@vger.kernel.org> # 4.12.x - 4.19.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>