aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2023-08-30cgroup/cpuset: Free DL BW in case can_attach() failsDietmar Eggemann
commit 2ef269ef1ac006acf974793d975539244d77b28f upstream. cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> [ Fix conflicts in kernel/cgroup/cpuset.c due to new code being applied that is not applicable on this branch. Reject new code. ] Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30sched/deadline: Create DL BW alloc, free & check overflow interfaceDietmar Eggemann
commit 85989106feb734437e2d598b639991b9185a43a6 upstream. While moving a set of tasks between exclusive cpusets, cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for DL BW overflow checking and per-task DL BW allocation on the destination root_domain for the DL tasks in this set. This approach has the issue of not freeing already allocated DL BW in the following error cases: (1) The set of tasks includes multiple DL tasks and DL BW overflow checking fails for one of the subsequent DL tasks. (2) Another controller next to the cpuset controller which is attached to the same cgroup fails in its can_attach(). To address this problem rework dl_cpu_busy(): (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a dedicated dl_bw_free(). (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of a `struct task_struct *p` used in dl_cpu_busy(). This allows to allocate DL BW for a set of tasks too rather than only for a single task. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30sched/cpuset: Bring back cpuset_mutexJuri Lelli
commit 111cd11bbc54850f24191c52ff217da88a5e639b upstream. Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> [ Fix conflict in kernel/cgroup/cpuset.c due to pulling new functions or comment that don't exist on 5.10 or the usage of different cpu hotplug lock whenever replacing the rwsem with mutex. Remove BUG_ON() for rwsem that doesn't exist on mainline. ] Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26sched/uclamp: Fix fits_capacity() check in feec()Qais Yousef
commit 244226035a1f9b2b6c326e55ae5188fab4f428cb upstream. As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024, it'll always pick the previous CPU because fits_capacity() will always return false in this case. The new util_fits_cpu() logic should handle this correctly for us beside more corner cases where similar failures could occur, like when using UCLAMP_MAX. We open code uclamp_rq_util_with() except for the clamp() part, util_fits_cpu() needs the 'raw' values to be passed to it. Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp value for the rq. Makes the code more readable and ensures the right rules (use READ_ONCE/WRITE_ONCE) are respected transparently. [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Reported-by: Yun Hsiang <hsiang023167@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com (cherry picked from commit 244226035a1f9b2b6c326e55ae5188fab4f428cb) [Fix trivial conflict in kernel/sched/fair.c due to new automatic variables in master vs 5.10] Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-05sched_getaffinity: don't assume 'cpumask_size()' is fully initializedLinus Torvalds
[ Upstream commit 6015b1aca1a233379625385feb01dd014aca60b5 ] The getaffinity() system call uses 'cpumask_size()' to decide how big the CPU mask is - so far so good. It is indeed the allocation size of a cpumask. But the code also assumes that the whole allocation is initialized without actually doing so itself. That's wrong, because we might have fixed-size allocations (making copying and clearing more efficient), but not all of it is then necessarily used if 'nr_cpu_ids' is smaller. Having checked other users of 'cpumask_size()', they all seem to be ok, either using it purely for the allocation size, or explicitly zeroing the cpumask before using the size in bytes to copy it. See for example the ublk_ctrl_get_queue_affinity() function that uses the proper 'zalloc_cpumask_var()' to make sure that the whole mask is cleared, whether the storage is on the stack or if it was an external allocation. Fix this by just zeroing the allocation before using it. Do the same for the compat version of sched_getaffinity(), which had the same logic. Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to access the bits. For a cpumask_var_t, it ends up being a pointer to the same data either way, but it's just a good idea to treat it like you would a 'cpumask_t'. The compat case already did that. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/ Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-05sched/fair: Sanitize vruntime of entity being migratedVincent Guittot
commit a53ce18cacb477dd0513c607f187d16f0fa96f71 upstream. Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") fixes an overflowing bug, but ignore a case that se->exec_start is reset after a migration. For fixing this case, we delay the reset of se->exec_start after placing the entity which se->exec_start to detect long sleeping task. In order to take into account a possible divergence between the clock_task of 2 rqs, we increase the threshold to around 104 days. Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") Originally-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01panic: Consolidate open-coded panic_on_warn checksKees Cook
commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967 upstream. Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-04io_uring: import 5.15-stable io_uringJens Axboe
No upstream commit exists. This imports the io_uring codebase from 5.15.85, wholesale. Changes from that code base: - Drop IOCB_ALLOC_CACHE, we don't have that in 5.10. - Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports, and we don't support these in 5.10 to begin with. - sock_from_file() old style calling convention. - Use compat_get_bitmap() only for CONFIG_COMPAT=y Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21sched/fair: Fix fault in reweight_entityTadeusz Struk
commit 13765de8148f71fa795e0a6607de37c49ea5915a upstream. Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21sched: Fix the check of nr_running at queue wakelistTianchen Ding
[ Upstream commit 28156108fecb1f808b21d216e8ea8f0d205a530c ] The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling") checked rq->nr_running <= 1 to avoid task stacking when WF_ON_CPU. Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through the deactivate_task() in __schedule(), thus p has been accounted out of rq->nr_running. As such, the task being the only runnable task on the rq implies reading rq->nr_running == 0 at that point. The benchmark result is in [1]. [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/ Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowedWaiman Long
[ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ] With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating that the cpuset will just use the effective CPUs of its parent. So cpuset_can_attach() can call task_can_attach() with an empty mask. This can lead to cpumask_any_and() returns nr_cpu_ids causing the call to dl_bw_of() to crash due to percpu value access of an out of bound CPU value. For example: [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0 : [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0 : [80468.207946] Call Trace: [80468.208947] cpuset_can_attach+0xa0/0x140 [80468.209953] cgroup_migrate_execute+0x8c/0x490 [80468.210931] cgroup_update_dfl_csses+0x254/0x270 [80468.211898] cgroup_subtree_control_write+0x322/0x400 [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0 [80468.213777] new_sync_write+0x11f/0x1b0 [80468.214689] vfs_write+0x1eb/0x280 [80468.215592] ksys_write+0x5f/0xe0 [80468.216463] do_syscall_64+0x5c/0x80 [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix that by using effective_cpus instead. For cgroup v1, effective_cpus is the same as cpus_allowed. For v2, effective_cpus is the real cpumask to be used by tasks within the cpuset anyway. Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to reflect the change. In addition, a check is added to task_can_attach() to guard against the possibility that cpumask_any_and() may return a value >= nr_cpu_ids. Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()Dietmar Eggemann
[ Upstream commit 772b6539fdda31462cc08368e78df60b31a58bab ] Both functions are doing almost the same, that is checking if admission control is still respected. With exclusive cpusets, dl_task_can_attach() checks if the destination cpuset (i.e. its root domain) has enough CPU capacity to accommodate the task. dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in case the CPU is hot-plugged out. dl_task_can_attach() is used to check if a task can be admitted while dl_cpu_busy() is used to check if a CPU can be hotplugged out. Make dl_cpu_busy() able to deal with a task and use it instead of dl_task_can_attach() in task_can_attach(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08sched/core: Export pelt_thermal_tpQais Yousef
[ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ] We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08sched/uclamp: Fix rq->uclamp_max not set on first enqueueQais Yousef
[ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ] Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-01sched/scs: Reset task stack state in bringup_cpu()Mark Rutland
[ Upstream commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 ] To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()Vincent Donnefort
[ Upstream commit 42dc938a590c96eeb429e1830123fef2366d9c80 ] Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18kernel/sched: Fix sched_fork() access an invalid sched_task_groupZhang Qiao
[ Upstream commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a ] There is a small race between copy_process() and sched_fork() where child->sched_task_group point to an already freed pointer. parent doing fork() | someone moving the parent | to another cgroup -------------------------------+------------------------------- copy_process() + dup_task_struct()<1> parent move to another cgroup, and free the old cgroup. <2> + sched_fork() + __set_task_cpu()<3> + task_fork_fair() + sched_slice()<4> In the worst case, this bug can lead to "use-after-free" and cause panic as shown above: (1) parent copy its sched_task_group to child at <1>; (2) someone move the parent to another cgroup and free the old cgroup at <2>; (3) the sched_task_group and cfs_rq that belong to the old cgroup will be accessed at <3> and <4>, which cause a panic: [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0 [] Oops: 0000 [#1] SMP PTI [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1 [] RIP: 0010:sched_slice+0x84/0xc0 [] Call Trace: [] task_fork_fair+0x81/0x120 [] sched_fork+0x132/0x240 [] copy_process.part.5+0x675/0x20e0 [] ? __handle_mm_fault+0x63f/0x690 [] _do_fork+0xcd/0x3b0 [] do_syscall_64+0x5d/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x65/0xca [] RIP: 0033:0x7f04418cd7e1 Between cgroup_can_fork() and cgroup_post_fork(), the cgroup membership and thus sched_task_group can't change. So update child's sched_task_group at sched_post_fork() and move task_fork() and __set_task_cpu() (where accees the sched_task_group) from sched_fork() to sched_post_fork(). Fixes: 8323f26ce342 ("sched: Fix race in task_group") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27sched/scs: Reset the shadow stack when idle_task_exitWoody Lin
[ Upstream commit 63acd42c0d4942f74710b11c38602fb14dea7320 ] Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched: Fix UCLAMP_FLAG_IDLE settingQuentin Perret
[ Upstream commit ca4984a7dd863f3e1c0df775ae3e744bff24c303 ] The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last uclamp active task (that is, when buckets.tasks reaches 0 for all buckets) to maintain the last uclamp.max and prevent blocked util from suddenly becoming visible. However, there is an asymmetry in how the flag is set and cleared which can lead to having the flag set whilst there are active tasks on the rq. Specifically, the flag is cleared in the uclamp_rq_inc() path, which is called at enqueue time, but set in uclamp_rq_dec_id() which is called both when dequeueing a task _and_ in the update_uclamp_active() path. As a result, when both uclamp_rq_{dec,ind}_id() are called from update_uclamp_active(), the flag ends up being set but not cleared, hence leaving the runqueue in a broken state. Fix this by clearing the flag in update_uclamp_active() as well. Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-12sched/rt: Fix double enqueue caused by rt_effective_prioPeter Zijlstra
commit f558c2b834ec27e75d37b1c860c139e7b7c3a8e4 upstream. Double enqueues in rt runqueues (list) have been reported while running a simple test that spawns a number of threads doing a short sleep/run pattern while being concurrently setscheduled between rt and fair class. WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360 CPU: 3 PID: 2825 Comm: setsched__13 RIP: 0010:enqueue_task_rt+0x355/0x360 Call Trace: __sched_setscheduler+0x581/0x9d0 _sched_setscheduler+0x63/0xa0 do_sched_setscheduler+0xa0/0x150 __x64_sys_sched_setscheduler+0x1a/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40, next=ffff98679fc67ca0. kernel BUG at lib/list_debug.c:31! invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI CPU: 3 PID: 2825 Comm: setsched__13 RIP: 0010:__list_add_valid+0x41/0x50 Call Trace: enqueue_task_rt+0x291/0x360 __sched_setscheduler+0x581/0x9d0 _sched_setscheduler+0x63/0xa0 do_sched_setscheduler+0xa0/0x150 __x64_sys_sched_setscheduler+0x1a/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae __sched_setscheduler() uses rt_effective_prio() to handle proper queuing of priority boosted tasks that are setscheduled while being boosted. rt_effective_prio() is however called twice per each __sched_setscheduler() call: first directly by __sched_setscheduler() before dequeuing the task and then by __setscheduler() to actually do the priority change. If the priority of the pi_top_task is concurrently being changed however, it might happen that the two calls return different results. If, for example, the first call returned the same rt priority the task was running at and the second one a fair priority, the task won't be removed by the rt list (on_list still set) and then enqueued in the fair runqueue. When eventually setscheduled back to rt it will be seen as enqueued already and the WARNING/BUG be issued. Fix this by calling rt_effective_prio() only once and then reusing the return value. While at it refactor code as well for clarity. Concurrent priority inheritance handling is still safe and will eventually converge to a new state by following the inheritance chain(s). Fixes: 0782e63bc6fe ("sched: Handle priority boosted tasks proper in setscheduler()") [squashed Peterz changes; added changelog] Reported-by: Mark Simmons <msimmons@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14sched/uclamp: Fix uclamp_tg_restrict()Qais Yousef
[ Upstream commit 0213b7083e81f4acd69db32cb72eb4e5f220329a ] Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 60% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 20% -----------+-----+------+----------- With this fix we get: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. | p | tg | effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- *new* -----------+-----+------+----------- uclamp_min | 60% | 0% | *60%* -----------+-----+------+----------- uclamp_max | 80% |*70%* | *70%* -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14sched/uclamp: Fix locking around cpu_util_update_eff()Qais Yousef
[ Upstream commit 93b73858701fd01de26a4a874eb95f9b7156fd4b ] cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the uclamp_mutex or rcu_read_lock() like other call sites, which is a mistake. The uclamp_mutex is required to protect against concurrent reads and writes that could update the cgroup hierarchy. The rcu_read_lock() is required to traverse the cgroup data structures in cpu_util_update_eff(). Surround the caller with the required locks and add some asserts to better document the dependency in cpu_util_update_eff(). Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14sched/uclamp: Fix wrong implementation of cpu.uclamp.minQais Yousef
[ Upstream commit 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445 ] cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14sched/core: Initialize the idle task with preemption disabledValentin Schneider
[ Upstream commit f1a0a376ca0c4ef1fc3d24e3e502acbb5b795674 ] As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19sched: Fix out-of-bound access in uclampQuentin Perret
[ Upstream commit 6d2f8909a5fabb73fe2a63918117943986c39b6c ] Util-clamp places tasks in different buckets based on their clamp values for performance reasons. However, the size of buckets is currently computed using a rounding division, which can lead to an off-by-one error in some configurations. For instance, with 20 buckets, the bucket size will be 1024/20=51. A task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly, correct indexes are in range [0,19], hence leading to an out of bound memory access. Clamp the bucket id to fix the issue. Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Suggested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210430151412.160913-1-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-14smp: Fix smp_call_function_single_async prototypeArnd Bergmann
commit 1139aeb1c521eb4a050920ce6c64c36c4f2a3ab7 upstream. As of commit 966a967116e6 ("smp: Avoid using two cache lines for struct call_single_data"), the smp code prefers 32-byte aligned call_single_data objects for performance reasons, but the block layer includes an instance of this structure in the main 'struct request' that is more senstive to size than to performance here, see 4ccafe032005 ("block: unalign call_single_data in struct request"). The result is a violation of the calling conventions that clang correctly points out: block/blk-mq.c:630:39: warning: passing 8-byte aligned argument to 32-byte aligned parameter 2 of 'smp_call_function_single_async' may result in an unaligned pointer access [-Walign-mismatch] smp_call_function_single_async(cpu, &rq->csd); It does seem that the usage of the call_single_data without cache line alignment should still be allowed by the smp code, so just change the function prototype so it accepts both, but leave the default alignment unchanged for the other users. This seems better to me than adding a local hack to shut up an otherwise correct warning in the caller. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/20210505211300.3174456-1-arnd@kernel.org [nc: Fix conflicts, modify rq_csd_init] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07sched/features: Fix hrtick reprogrammingJuri Lelli
[ Upstream commit 156ec6f42b8d300dbbf382738ff35c8bad8f4c3a ] Hung tasks and RCU stall cases were reported on systems which were not 100% busy. Investigation of such unexpected cases (no sign of potential starvation caused by tasks hogging the system) pointed out that the periodic sched tick timer wasn't serviced anymore after a certain point and that caused all machinery that depends on it (timers, RCU, etc.) to stop working as well. This issues was however only reproducible if HRTICK was enabled. Looking at core dumps it was found that the rbtree of the hrtimer base used also for the hrtick was corrupted (i.e. next as seen from the base root and actual leftmost obtained by traversing the tree are different). Same base is also used for periodic tick hrtimer, which might get "lost" if the rbtree gets corrupted. Much alike what described in commit 1f71addd34f4c ("tick/sched: Do not mess with an enqueued hrtimer") there is a race window between hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in __hrtick_restart() in which the former might be operating on an already queued hrtick hrtimer, which might lead to corruption of the base. Use hrtick_start() (which removes the timer before enqueuing it back) to ensure hrtick hrtimer reprogramming is entirely guarded by the base lock, so that no race conditions can occur. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07sched/core: Allow try_invoke_on_locked_down_task() with irqs disabledPeter Zijlstra
commit 1b7af295541d75535374325fd617944534853919 upstream. The try_invoke_on_locked_down_task() function currently requires that interrupts be enabled, but it is called with interrupts disabled from rcu_print_task_stall(), resulting in an "IRQs not enabled as expected" diagnostic. This commit therefore updates try_invoke_on_locked_down_task() to use raw_spin_lock_irqsave() instead of raw_spin_lock_irq(), thus allowing use from either context. Link: https://lore.kernel.org/lkml/000000000000903d5805ab908fc4@google.com/ Link: https://lore.kernel.org/lkml/20200928075729.GC2611@hirez.programming.kicks-ass.net/ Reported-by: syzbot+cb3b69ae80afd6535b0e@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ] do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-17sched/deadline: Fix priority inheritance with multiple scheduling classesJuri Lelli
Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: Glenn Elliott <glenn@aurora.tech> Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
2020-11-17sched: Fix rq->nr_iowait orderingPeter Zijlstra
schedule() ttwu() deactivate_task(); if (p->on_rq && ...) // false atomic_dec(&task_rq(p)->nr_iowait); if (prev->in_iowait) atomic_inc(&rq->nr_iowait); Allows nr_iowait to be decremented before it gets incremented, resulting in more dodgy IO-wait numbers than usual. Note that because we can now do ttwu_queue_wakelist() before p->on_cpu==0, we lose the natural ordering and have to further delay the decrement. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
2020-10-14sched/features: Fix !CONFIG_JUMP_LABEL caseJuri Lelli
Commit: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") made sched features static for !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases. For the latter echoing changes to /sys/kernel/debug/sched_features has the nasty effect of effectively changing what sched_features reports, but without actually changing the scheduler behaviour (since different translation units get different sysctl_sched_features). Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly restructuring ifdefs. Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
2020-10-12Merge tag 'sched-core-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - reorganize & clean up the SD* flags definitions and add a bunch of sanity checks. These new checks caught quite a few bugs or at least inconsistencies, resulting in another set of patches. - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ - add a new tracepoint to improve CPU capacity tracking - improve overloaded SMP system load-balancing behavior - tweak SMT balancing - energy-aware scheduling updates - NUMA balancing improvements - deadline scheduler fixes and improvements - CPU isolation fixes - misc cleanups, simplifications and smaller optimizations * tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) sched/deadline: Unthrottle PI boosted threads while enqueuing sched/debug: Add new tracepoint to track cpu_capacity sched/fair: Tweak pick_next_entity() rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ rseq/selftests,x86_64: Add rseq_offset_deref_addv() rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ sched/fair: Use dst group while checking imbalance for NUMA balancer sched/fair: Reduce busy load balance interval sched/fair: Minimize concurrent LBs between domain level sched/fair: Reduce minimal imbalance threshold sched/fair: Relax constraint on task's load during load balance sched/fair: Remove the force parameter of update_tg_load_avg() sched/fair: Fix wrong cpu selecting from isolated domain sched: Remove unused inline function uclamp_bucket_base_value() sched/rt: Disable RT_RUNTIME_SHARE by default sched/deadline: Fix stale throttling on de-/boosted tasks sched/numa: Use runnable_avg to classify node sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL MAINTAINERS: Add myself as SCHED_DEADLINE reviewer sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h ...
2020-10-03sched/debug: Add new tracepoint to track cpu_capacityVincent Donnefort
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
2020-09-25sched: Remove unused inline function uclamp_bucket_base_value()YueHaibing
There is no caller in tree, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-08-26sched: Cache task_struct::flags in sched_submit_work()Sebastian Andrzej Siewior
sched_submit_work() is considered to be a hot path. The preempt_disable() instruction is a compiler barrier and forces the compiler to load task_struct::flags for the second comparison. By using a local variable, the compiler can load the value once and keep it in a register for the second comparison. Verified on x86-64 with gcc-10. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200819200025.lqvmyefqnbok5i4f@linutronix.de
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-15Merge tag 'sched-urgent-2020-08-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two fixes: fix a new tracepoint's output value, and fix the formatting of show-state syslog printouts" * tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Fix the alignment of the show-state debug output sched: Fix use of count for nr_running tracepoint
2020-08-14sched/debug: Fix the alignment of the show-state debug outputLibing Zhou
Current sysrq(t) output task fields name are not aligned with actual task fields value, e.g.: kernel: sysrq: Show State kernel: task PC stack pid father kernel: systemd S12456 1 0 0x00000000 kernel: Call Trace: kernel: ? __schedule+0x240/0x740 To make it more readable, print fields name together with task fields value in the same line, with fixed width: kernel: sysrq: Show State kernel: task:systemd state:S stack:12920 pid: 1 ppid: 0 flags:0x00000000 kernel: Call Trace: kernel: __schedule+0x282/0x620 Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200814030236.37835-1-libing.zhou@nokia-sbell.com
2020-08-06Merge tag 'sched-fifo-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched/fifo updates from Ingo Molnar: "This adds the sched_set_fifo*() encapsulation APIs to remove static priority level knowledge from non-scheduler code. The three APIs for non-scheduler code to set SCHED_FIFO are: - sched_set_fifo() - sched_set_fifo_low() - sched_set_normal() These are two FIFO priority levels: default (high), and a 'low' priority level, plus sched_set_normal() to set the policy back to non-SCHED_FIFO. Since the changes affect a lot of non-scheduler code, we kept this in a separate tree" * tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched,tracing: Convert to sched_set_fifo() sched: Remove sched_set_*() return value sched: Remove sched_setscheduler*() EXPORTs sched,psi: Convert to sched_set_fifo_low() sched,rcutorture: Convert to sched_set_fifo_low() sched,rcuperf: Convert to sched_set_fifo_low() sched,locktorture: Convert to sched_set_fifo() sched,irq: Convert to sched_set_fifo() sched,watchdog: Convert to sched_set_fifo() sched,serial: Convert to sched_set_fifo() sched,powerclamp: Convert to sched_set_fifo() sched,ion: Convert to sched_set_normal() sched,powercap: Convert to sched_set_fifo*() sched,spi: Convert to sched_set_fifo*() sched,mmc: Convert to sched_set_fifo*() sched,ivtv: Convert to sched_set_fifo*() sched,drm/scheduler: Convert to sched_set_fifo*() sched,msm: Convert to sched_set_fifo*() sched,psci: Convert to sched_set_fifo*() sched,drbd: Convert to sched_set_fifo*() ...
2020-08-03Merge tag 'sched-core-2020-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve uclamp performance by using a static key for the fast path - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for better power efficiency of RT tasks on battery powered devices. (The default is to maximize performance & reduce RT latencies.) - Improve utime and stime tracking accuracy, which had a fixed boundary of error, which created larger and larger relative errors as the values become larger. This is now replaced with more precise arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h. - Improve the deadline scheduler, such as making it capacity aware - Improve frequency-invariant scheduling - Misc cleanups in energy/power aware scheduling - Add sched_update_nr_running tracepoint to track changes to nr_running - Documentation additions and updates - Misc cleanups and smaller fixes * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst sched/doc: Document capacity aware scheduling sched: Document arch_scale_*_capacity() arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE Documentation/sysctl: Document uclamp sysctl knobs sched/uclamp: Add a new sysctl to control RT default boost value sched/uclamp: Fix a deadlock when enabling uclamp static key sched: Remove duplicated tick_nohz_full_enabled() check sched: Fix a typo in a comment sched/uclamp: Remove unnecessary mutex_init() arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry arch_topology, sched/core: Cleanup thermal pressure definition trace/events/sched.h: fix duplicated word linux/sched/mm.h: drop duplicated words in comments smp: Fix a potential usage of stale nr_cpus sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal sched: nohz: stop passing around unused "ticks" parameter. sched: Better document ttwu() sched: Add a tracepoint to track rq->nr_running ...
2020-07-29sched/uclamp: Add a new sysctl to control RT default boost valueQais Yousef
RT tasks by default run at the highest capacity/performance level. When uclamp is selected this default behavior is retained by enforcing the requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum value. This is also referred to as 'the default boost value of RT tasks'. See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks"). On battery powered devices, it is desired to control this default (currently hardcoded) behavior at runtime to reduce energy consumed by RT tasks. For example, a mobile device manufacturer where big.LITTLE architecture is dominant, the performance of the little cores varies across SoCs, and on high end ones the big cores could be too power hungry. Given the diversity of SoCs, the new knob allows manufactures to tune the best performance/power for RT tasks for the particular hardware they run on. They could opt to further tune the value when the user selects a different power saving mode or when the device is actively charging. The runtime aspect of it further helps in creating a single kernel image that can be run on multiple devices that require different tuning. Keep in mind that a lot of RT tasks in the system are created by the kernel. On Android for instance I can see over 50 RT tasks, only a handful of which created by the Android framework. To control the default behavior globally by system admins and device integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default to change the default boost value of the RT tasks. I anticipate this to be mostly in the form of modifying the init script of a particular device. To avoid polluting the fast path with unnecessary code, the approach taken is to synchronously do the update by traversing all the existing tasks in the system. This could race with a concurrent fork(), which is dealt with by introducing sched_post_fork() function which will ensure the racy fork will get the right update applied. Tested on Juno-r2 in combination with the RT capacity awareness [1]. By default an RT task will go to the highest capacity CPU and run at the maximum frequency, which is particularly energy inefficient on high end mobile devices because the biggest core[s] are 'huge' and power hungry. With this patch the RT task can be controlled to run anywhere by default, and doesn't cause the frequency to be maximum all the time. Yet any task that really needs to be boosted can easily escape this default behavior by modifying its requested uclamp.min value (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall. [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29sched/uclamp: Fix a deadlock when enabling uclamp static keyQais Yousef
The following splat was caught when setting uclamp value of a task: BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49 cpus_read_lock+0x68/0x130 static_key_enable+0x1c/0x38 __sched_setscheduler+0x900/0xad8 Fix by ensuring we enable the key outside of the critical section in __sched_setscheduler() Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
2020-07-25sched/uclamp: Remove unnecessary mutex_init()Qinglang Miao
The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(), it is unnecessary to initialize it runtime via mutex_init(). Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com
2020-07-24sched: Warn if garbage is passed to default_wake_function()Chris Wilson
Since the default_wake_function() passes its flags onto try_to_wake_up(), warn if those flags collide with internal values. Given that the supplied flags are garbage, no repair can be done but at least alert the user to the damage they are causing. In the belief that these errors should be picked up during testing, the warning is only compiled in under CONFIG_SCHED_DEBUG. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200723201042.18861-1-chris@chris-wilson.co.uk
2020-07-22arch_topology, sched/core: Cleanup thermal pressure definitionValentin Schneider
The following commit: 14533a16c46d ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code") moved the definition of arch_set_thermal_pressure() to sched/core.c, but kept its declaration in linux/arch_topology.h. When building e.g. an x86 kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up getting the declaration of arch_set_thermal_pressure() from include/linux/arch_topology.h, which is somewhat awkward. On top of this, sched/core.c unconditionally defines o The thermal_pressure percpu variable o arch_set_thermal_pressure() while arch_scale_thermal_pressure() does nothing unless redefined by the architecture. arch_*() functions are meant to be defined by architectures, so revert the aforementioned commit and re-implement it in a way that keeps arch_set_thermal_pressure() architecture-definable, and doesn't define the thermal pressure percpu variable for kernels that don't need it (CONFIG_SCHED_THERMAL_PRESSURE=n). Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
2020-07-22sched: Better document ttwu()Peter Zijlstra
Dave hit the problem fixed by commit: b6e13e85829f ("sched/core: Fix ttwu() race") and failed to understand much of the code involved. Per his request a few comments to (hopefully) clarify things. Requested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200702125211.GQ4800@hirez.programming.kicks-ass.net
2020-07-22Merge branch 'sched/urgent'Peter Zijlstra
2020-07-22sched: Fix race against ptrace_freeze_trace()Peter Zijlstra
There is apparently one site that violates the rule that only current and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced() will change task->state for a remote task. Oleg explains: "TASK_TRACED/TASK_STOPPED was always protected by siglock. In particular, ttwu(__TASK_TRACED) must be always called with siglock held. That is why ptrace_freeze_traced() assumes it can safely do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)." This breaks the ordering scheme introduced by commit: dbfb089d360b ("sched: Fix loadavg accounting race") Specifically, the reload not matching no longer implies we don't have to block. Simply things by noting that what we need is a LOAD->STORE ordering and this can be provided by a control dependency. So replace: prev_state = prev->state; raw_spin_lock(&rq->lock); smp_mb__after_spinlock(); /* SMP-MB */ if (... && prev_state && prev_state == prev->state) deactivate_task(); with: prev_state = prev->state; if (... && prev_state) /* CTRL-DEP */ deactivate_task(); Since that already implies the 'prev->state' load must be complete before allowing the 'prev->on_rq = 0' store to become visible. Fixes: dbfb089d360b ("sched: Fix loadavg accounting race") Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-08sched: Add a tracepoint to track rq->nr_runningPhil Auld
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks ->nr_running CPU's rq. This is used to accurately trace this data and provide a visualization of scheduler imbalances in, for example, the form of a heat map. The tracepoint is accessed by loading an external kernel module. An example module (forked from Qais' module and including the pelt related tracepoints) can be found at: https://github.com/auldp/tracepoints-helpers.git A script to turn the trace-cmd report output into a heatmap plot can be found at: https://github.com/jirvoz/plot-nr-running The tracepoints are added to add_nr_running() and sub_nr_running() which are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in the header a wrapper call is used and the trace/events/sched.h include is moved before sched.h in kernel/sched/core. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200629192303.GC120228@lorien.usersys.redhat.com