aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched
AgeCommit message (Collapse)Author
2022-09-05sched/deadline: Fix priority inheritance with multiple scheduling classesJuri Lelli
commit 2279f540ea7d05f22d2f0c4224319330228586bc upstream. Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: Glenn Elliott <glenn@aurora.tech> Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com [Ankit: Regenerated the patch for v4.19.y] Signed-off-by: Ankit Jain <ankitja@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05sched/deadline: Fix stale throttling on de-/boosted tasksLucas Stach
commit 46fcc4b00c3cca8adb9b7c9afdd499f64e427135 upstream. When a boosted task gets throttled, what normally happens is that it's immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the runtime and clears the dl_throttled flag. There is a special case however: if the throttling happened on sched-out and the task has been deboosted in the meantime, the replenish is skipped as the task will return to its normal scheduling class. This leaves the task with the dl_throttled flag set. Now if the task gets boosted up to the deadline scheduling class again while it is sleeping, it's still in the throttled state. The normal wakeup however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't actually place it on the rq. Thus we end up with a task that is runnable, but not actually on the rq and neither a immediate replenishment happens, nor is the replenishment timer set up, so the task is stuck in forever-throttled limbo. Clear the dl_throttled flag before dropping back to the normal scheduling class to fix this issue. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de [Ankit: Regenerated the patch for v4.19.y] Signed-off-by: Ankit Jain <ankitja@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05sched/deadline: Unthrottle PI boosted threads while enqueuingDaniel Bristot de Oliveira
commit feff2e65efd8d84cf831668e182b2ce73c604bbb upstream. stress-ng has a test (stress-ng --cyclic) that creates a set of threads under SCHED_DEADLINE with the following parameters: dl_runtime = 10000 (10 us) dl_deadline = 100000 (100 us) dl_period = 100000 (100 us) These parameters are very aggressive. When using a system without HRTICK set, these threads can easily execute longer than the dl_runtime because the throttling happens with 1/HZ resolution. During the main part of the test, the system works just fine because the workload does not try to run over the 10 us. The problem happens at the end of the test, on the exit() path. During exit(), the threads need to do some cleanups that require real-time mutex locks, mainly those related to memory management, resulting in this scenario: Note: locks are rt_mutexes... ------------------------------------------------------------------------ TASK A: TASK B: TASK C: activation activation activation lock(a): OK! lock(b): OK! <overrun runtime> lock(a) -> block (task A owns it) -> self notice/set throttled +--< -> arm replenished timer | switch-out | lock(b) | -> <C prio > B prio> | -> boost TASK B | unlock(a) switch-out | -> handle lock a to B | -> wakeup(B) | -> B is throttled: | -> do not enqueue | switch-out | | +---------------------> replenishment timer -> TASK B is boosted: -> do not enqueue ------------------------------------------------------------------------ BOOM: TASK B is runnable but !enqueued, holding TASK C: the system crashes with hung task C. This problem is avoided by removing the throttle state from the boosted thread while boosting it (by TASK A in the example above), allowing it to be queued and run boosted. The next replenishment will take care of the runtime overrun, pushing the deadline further away. See the "while (dl_se->runtime <= 0)" on replenish_dl_entity() for more information. Reported-by: Mark Simmons <msimmons@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Mark Simmons <msimmons@redhat.com> Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com [Ankit: Regenerated the patch for v4.19.y] Signed-off-by: Ankit Jain <ankitja@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()Nicolas Saenz Julienne
[ Upstream commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a ] dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having called sched_update_tick_dependency() preventing it from re-enabling the tick on systems that no longer have pending SCHED_RT tasks but have multiple runnable SCHED_OTHER tasks: dequeue_task_rt() dequeue_rt_entity() dequeue_rt_stack() dequeue_top_rt_rq() sub_nr_running() // decrements rq->nr_running sched_update_tick_dependency() sched_can_stop_tick() // checks rq->rt.rt_nr_running, ... __dequeue_rt_entity() dec_rt_tasks() // decrements rq->rt.rt_nr_running ... Every other scheduler class performs the operation in the opposite order, and sched_update_tick_dependency() expects the values to be updated as such. So avoid the misbehaviour by inverting the order in which the above operations are performed in the RT scheduler. Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-15sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numaBharata B Rao
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ] The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them. Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-23sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()Dietmar Eggemann
commit 71e5f6644fb2f3304fcb310145ded234a37e7cc1 upstream. Commit "sched/topology: Make sched_init_numa() use a set for the deduplicating sort" allocates 'i + nr_levels (level)' instead of 'i + nr_levels + 1' sched_domain_topology_level. This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG): sched_init_domains build_sched_domains() __free_domain_allocs() __sdt_free() { ... for_each_sd_topology(tl) ... sd = *per_cpu_ptr(sdd->sd, j); <-- ... } Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Barry Song <song.bao.hua@hisilicon.com> Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-23sched/topology: Make sched_init_numa() use a set for the deduplicating sortValentin Schneider
commit 620a6dc40754dc218f5b6389b5d335e9a107fd29 upstream. The deduplicating sort in sched_init_numa() assumes that the first line in the distance table contains all unique values in the entire table. I've been trying to pen what this exactly means for the topology, but it's not straightforward. For instance, topology.c uses this example: node 0 1 2 3 0: 10 20 20 30 1: 20 10 20 20 2: 20 20 10 20 3: 30 20 20 10 0 ----- 1 | / | | / | | / | 2 ----- 3 Which works out just fine. However, if we swap nodes 0 and 1: 1 ----- 0 | / | | / | | / | 2 ----- 3 we get this distance table: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 30 2: 20 20 10 20 3: 20 30 20 10 Which breaks the deduplicating sort (non-representative first line). In this case this would just be a renumbering exercise, but it so happens that we can have a deduplicating sort that goes through the whole table in O(n²) at the extra cost of a temporary memory allocation (i.e. any form of set). The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following this, implement the set as a 256-bits bitmap. Should this not be satisfactory (i.e. we want to support 32-bit values), then we'll have to go for some other sparse set implementation. This has the added benefit of letting us allocate just the right amount of memory for sched_domains_numa_distance[], rather than an arbitrary (nr_node_ids + 1). Note: DT binding equivalent (distance-map) decodes distances as 32-bit values. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27cputime, cpuacct: Include guest time in user time in cpuacct.statAndrey Ryabinin
commit 9731698ecb9c851f353ce2496292ff9fcea39dff upstream. cpuacct.stat in no-root cgroups shows user time without guest time included int it. This doesn't match with user time shown in root cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat in the same way. Make account_guest_time() to add user time to cgroup's cpustat to fix this. Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27sched/rt: Try to restart rt period timer when rt runtime exceededLi Hua
[ Upstream commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c ] When rt_runtime is modified from -1 to a valid control value, it may cause the task to be throttled all the time. Operations like the following will trigger the bug. E.g: 1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us 2. Run a FIFO task named A that executes while(1) 3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us When rt_runtime is -1, The rt period timer will not be activated when task A enqueued. And then the task will be throttled after setting rt_runtime to 950,000. The task will always be throttled because the rt period timer is not activated. Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Li Hua <hucool.lihua@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-14wait: add wake_up_pollfree()Eric Biggers
commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream. Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'. However, that has a bug: wake_up_poll() calls __wake_up() with nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters, and the wakeup function for the first one returns a positive value, only that one will be called. That's *not* what's needed for POLLFREE; POLLFREE is special in that it really needs to wake up everyone. Considering the three non-blocking poll systems: - io_uring poll doesn't handle POLLFREE at all, so it is broken anyway. - aio poll is unaffected, since it doesn't support exclusive waits. However, that's fragile, as someone could add this feature later. - epoll doesn't appear to be broken by this, since its wakeup function returns 0 when it sees POLLFREE. But this is fragile. Although there is a workaround (see epoll), it's better to define a function which always sends POLLFREE to all waiters. Add such a function. Also make it verify that the queue really becomes empty after all waiters have been woken up. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()Vincent Donnefort
[ Upstream commit 42dc938a590c96eeb429e1830123fef2366d9c80 ] Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-06cpufreq: schedutil: Use kobject release() method to free sugov_tunablesKevin Hao
[ Upstream commit e5c6b312ce3cc97e90ea159446e6bfa06645364d ] The struct sugov_tunables is protected by the kobject, so we can't free it directly. Otherwise we would get a call trace like this: ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x30 WARNING: CPU: 3 PID: 720 at lib/debugobjects.c:505 debug_print_object+0xb8/0x100 Modules linked in: CPU: 3 PID: 720 Comm: a.sh Tainted: G W 5.14.0-rc1-next-20210715-yocto-standard+ #507 Hardware name: Marvell OcteonTX CN96XX board (DT) pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) pc : debug_print_object+0xb8/0x100 lr : debug_print_object+0xb8/0x100 sp : ffff80001ecaf910 x29: ffff80001ecaf910 x28: ffff00011b10b8d0 x27: ffff800011043d80 x26: ffff00011a8f0000 x25: ffff800013cb3ff0 x24: 0000000000000000 x23: ffff80001142aa68 x22: ffff800011043d80 x21: ffff00010de46f20 x20: ffff800013c0c520 x19: ffff800011d8f5b0 x18: 0000000000000010 x17: 6e6968207473696c x16: 5f72656d6974203a x15: 6570797420746365 x14: 6a626f2029302065 x13: 303378302f307830 x12: 2b6e665f72656d69 x11: ffff8000124b1560 x10: ffff800012331520 x9 : ffff8000100ca6b0 x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 0000000000000001 x5 : ffff800011d8c000 x4 : ffff800011d8c740 x3 : 0000000000000000 x2 : ffff0001108301c0 x1 : ab3c90eedf9c0f00 x0 : 0000000000000000 Call trace: debug_print_object+0xb8/0x100 __debug_check_no_obj_freed+0x1c0/0x230 debug_check_no_obj_freed+0x20/0x88 slab_free_freelist_hook+0x154/0x1c8 kfree+0x114/0x5d0 sugov_exit+0xbc/0xc0 cpufreq_exit_governor+0x44/0x90 cpufreq_set_policy+0x268/0x4a8 store_scaling_governor+0xe0/0x128 store+0xc0/0xf0 sysfs_kf_write+0x54/0x80 kernfs_fop_write_iter+0x128/0x1c0 new_sync_write+0xf0/0x190 vfs_write+0x2d4/0x478 ksys_write+0x74/0x100 __arm64_sys_write+0x24/0x30 invoke_syscall.constprop.0+0x54/0xe0 do_el0_svc+0x64/0x158 el0_svc+0x2c/0xb0 el0t_64_sync_handler+0xb0/0xb8 el0t_64_sync+0x198/0x19c irq event stamp: 5518 hardirqs last enabled at (5517): [<ffff8000100cbd7c>] console_unlock+0x554/0x6c8 hardirqs last disabled at (5518): [<ffff800010fc0638>] el1_dbg+0x28/0xa0 softirqs last enabled at (5504): [<ffff8000100106e0>] __do_softirq+0x4d0/0x6c0 softirqs last disabled at (5483): [<ffff800010049548>] irq_exit+0x1b0/0x1b8 So split the original sugov_tunables_free() into two functions, sugov_clear_global_tunables() is just used to clear the global_tunables and the new sugov_tunables_free() is used as kobj_type::release to release the sugov_tunables safely. Fixes: 9bdcb44e391d ("cpufreq: schedutil: New governor based on scheduler utilization data") Cc: 4.7+ <stable@vger.kernel.org> # 4.7+ Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22sched/deadline: Fix missing clock update in migrate_task_rq_dl()Dietmar Eggemann
[ Upstream commit b4da13aa28d4fd0071247b7b41c579ee8a86c81a ] A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453 sub_running_bw.isra.0+0x190/0x1a0 ... CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1 Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP: 1.06.20210526) 2021/05/26 ... Call trace: sub_running_bw.isra.0+0x190/0x1a0 migrate_task_rq_dl+0xf8/0x1e0 set_task_cpu+0xa8/0x1f0 try_to_wake_up+0x150/0x3d4 wake_up_q+0x64/0xc0 __up_write+0xd0/0x1c0 up_write+0x4c/0x2b0 cppc_set_perf+0x120/0x2d0 cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq] __cpufreq_driver_target+0x74/0x140 sugov_work+0x64/0x80 kthread_worker_fn+0xe0/0x230 kthread+0x138/0x140 ret_from_fork+0x10/0x18 The task causing this is the `cppc_fie` DL task introduced by commit 1eb5dde674f5 ("cpufreq: CPPC: Add support for frequency invariance"). With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm Server): DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter is in `non_contending` state, migrate_task_rq_dl() calls sub_running_bw()->__sub_running_bw()->cpufreq_update_util()-> rq_clock()->assert_clock_updated() on p. Fix this by updating the clock for a non_contending task in migrate_task_rq_dl() before calling sub_running_bw(). Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22sched/deadline: Fix reset_on_fork reporting of DL tasksQuentin Perret
[ Upstream commit f95091536f78971b269ec321b057b8d630b0ad8a ] It is possible for sched_getattr() to incorrectly report the state of the reset_on_fork flag when called on a deadline task. Indeed, if the flag was set on a deadline task using sched_setattr() with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then p->sched_reset_on_fork will be set, but __setscheduler() will bail out early, which means that the dl_se->flags will not get updated by __setscheduler_params()->__setparam_dl(). Consequently, if sched_getattr() is then called on the task, __getparam_dl() will override kattr.sched_flags with the now out-of-date copy in dl_se->flags and report the stale value to userspace. To fix this, make sure to only copy the flags that are relevant to sched_deadline to and from the dl_se->flags field. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-28sched/fair: Fix CFS bandwidth hrtimer expiry typeOdin Ugedal
[ Upstream commit 72d0ad7cb5bad265adb2014dbe46c4ccb11afaba ] The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen. This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations. Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20sched/fair: Fix ascii art by relpacing tabsOdin Ugedal
[ Upstream commit 08f7c2f4d0e9f4283f5796b8168044c034a1bfcb ] When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-16sched/fair: Make sure to update tg contrib for blocked loadVincent Guittot
commit 02da26ad5ed6ea8680e5d01f20661439611ed776 upstream. During the update of fair blocked load (__update_blocked_fair()), we update the contribution of the cfs in tg->load_avg if cfs_rq's pelt has decayed. Nevertheless, the pelt values of a cfs_rq could have been recently updated while propagating the change of a child. In this case, cfs_rq's pelt will not decayed because it has already been updated and we don't update tg->load_avg. __update_blocked_fair ... for_each_leaf_cfs_rq_safe: child cfs_rq update cfs_rq_load_avg() for child cfs_rq ... update_load_avg(cfs_rq_of(se), se, 0) ... update cfs_rq_load_avg() for parent cfs_rq -propagation of child's load makes parent cfs_rq->load_sum becoming null -UPDATE_TG is not set so it doesn't update parent cfs_rq->tg_load_avg_contrib .. for_each_leaf_cfs_rq_safe: parent cfs_rq update cfs_rq_load_avg() for parent cfs_rq - nothing to do because parent cfs_rq has already been updated recently so cfs_rq->tg_load_avg_contrib is not updated ... parent cfs_rq is decayed list_del_leaf_cfs_rq parent cfs_rq - but it still contibutes to tg->load_avg we must set UPDATE_TG flags when propagting pending load to the parent Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10sched/fair: Optimize select_idle_cpuCheng Jian
commit 60588bfa223ff675b95f866249f90616613fbe31 upstream. select_idle_cpu() will scan the LLC domain for idle CPUs, it's always expensive. so the next commit : 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()") introduces a way to limit how many CPUs we scan. But it consume some CPUs out of 'nr' that are not allowed for the task and thus waste our attempts. The function always return nr_cpumask_bits, and we can't find a CPU which our task is allowed to run. Cpumask may be too big, similar to select_idle_core(), use per_cpu_ptr 'select_idle_mask' to prevent stack overflow. Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()") Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com Signed-off-by: Yang Wei <yang.wei@linux.alibaba.com> Tested-by: Yang Wei <yang.wei@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22sched/fair: Fix unfairness caused by missing load decayOdin Ugedal
[ Upstream commit 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 ] This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. This applies to cases where the new process is the first in the cfs_rq, issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and when there has previously been load on the cgroup but the cgroup was removed from the leaflist due to having null PELT load, indroduced in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path"). For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes") Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-22sched/debug: Fix cgroup_path[] serializationWaiman Long
[ Upstream commit ad789f84c9a145f8a18744c0387cec22ec51651e ] The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack. The solution implemented in this patch is to allow only one caller at a time to use the full size group_path[], while other simultaneous callers will have to use shorter stack buffers with the possibility of path name truncation. A "..." suffix will be printed if truncation may have happened. The cgroup path name is provided for informational purpose only, so occasional path name truncation should not be a big problem. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ] do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30sched/deadline: Fix sched_dl_global_validate()Peng Liu
[ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ] When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth: sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period(); for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } } But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain. Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it. * With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU. [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/ Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-30sched/features: Fix !CONFIG_JUMP_LABEL caseJuri Lelli
[ Upstream commit a73f863af4ce9730795eab7097fb2102e6854365 ] Commit: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") made sched features static for !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases. For the latter echoing changes to /sys/kernel/debug/sched_features has the nasty effect of effectively changing what sched_features reports, but without actually changing the scheduler behaviour (since different translation units get different sysctl_sched_features). Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly restructuring ifdefs. Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19sched: correct SD_flags returned by tl->sd_flags()Peng Liu
[ Upstream commit 9b1b234bb86bcdcdb142e900d39b599185465dbb ] During sched domain init, we check whether non-topological SD_flags are returned by tl->sd_flags(), if found, fire a waning and correct the violation, but the code failed to correct the violation. Correct this. Fixes: 143e1e28cb40 ("sched: Rework sched_domain topology definition") Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20200609150936.GA13060@iZj6chx1xj0e0buvshuecpZ Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19sched/fair: Fix NOHZ next idle balanceVincent Guittot
[ Upstream commit 3ea2f097b17e13a8280f1f9386c331b326a3dbef ] With commit: 'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")' rebalance_domains of the local cfs_rq happens before others idle cpus have updated nohz.next_balance and its value is overwritten. Move the update of nohz.next_balance for other idles cpus before balancing and updating the next_balance of local cfs_rq. Also, the nohz.next_balance is now updated only if all idle cpus got a chance to rebalance their domains and the idle balance has not been aborted because of new activities on the CPU. In case of need_resched, the idle load balance will be kick the next jiffie in order to address remaining ilb. Fixes: b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK") Reported-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20200609123748.18636-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22sched/fair: handle case of task_h_load() returning 0Vincent Guittot
commit 01cfcde9c26d8555f0e6e9aea9d6049f87683998 upstream. task_h_load() can return 0 in some situations like running stress-ng mmapfork, which forks thousands of threads, in a sched group on a 224 cores system. The load balance doesn't handle this correctly because env->imbalance never decreases and it will stop pulling tasks only after reaching loop_max, which can be equal to the number of running tasks of the cfs. Make sure that imbalance will be decreased by at least 1. misfit task is the other feature that doesn't handle correctly such situation although it's probably more difficult to face the problem because of the smaller number of CPUs and running tasks on heterogenous system. We can't simply ensure that task_h_load() returns at least one because it would imply to handle underflow in other places. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: <stable@vger.kernel.org> # v4.4+ Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22sched: Fix unreliable rseq cpu_id for new tasksMathieu Desnoyers
commit ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db upstream. While integrating rseq into glibc and replacing glibc's sched_getcpu implementation with rseq, glibc's tests discovered an issue with incorrect __rseq_abi.cpu_id field value right after the first time a newly created process issues sched_setaffinity. For the records, it triggers after building glibc and running tests, and then issuing: for x in {1..2000} ; do posix/tst-affinity-static & done and shows up as: error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 This is caused by the scheduler invoking __set_task_cpu() directly from sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which is done by set_task_cpu(). Add the missing rseq_migrate() to both functions. The only other direct use of __set_task_cpu() is done by init_idle(), which does not involve a user-space task. Based on my testing with the glibc test-case, just adding rseq_migrate() to wake_up_new_task() is sufficient to fix the observed issue. Also add it to sched_fork() to keep things consistent. The reason why this never triggered so far with the rseq/basic_test selftest is unclear. The current use of sched_getcpu(3) does not typically require it to be always accurate. However, use of the __rseq_abi.cpu_id field within rseq critical sections requires it to be accurate. If it is not accurate, it can cause corruption in the per-cpu data targeted by rseq critical sections in user-space. Reported-By: Florian Weimer <fweimer@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-By: Florian Weimer <fweimer@redhat.com> Cc: stable@vger.kernel.org # v4.18+ Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30sched/core: Fix PI boosting between RT and DEADLINE tasksJuri Lelli
[ Upstream commit 740797ce3a124b7dd22b7fb832d87bc8fba1cf6f ] syzbot reported the following warning: WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504 At deadline.c:628 we have: 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) 624 { 625 struct dl_rq *dl_rq = dl_rq_of_se(dl_se); 626 struct rq *rq = rq_of_dl_rq(dl_rq); 627 628 WARN_ON(dl_se->dl_boosted); 629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline)); [...] } Which means that setup_new_dl_entity() has been called on a task currently boosted. This shouldn't happen though, as setup_new_dl_entity() is only called when the 'dynamic' deadline of the new entity is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this condition. Digging through the PI code I noticed that what above might in fact happen if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the first branch of boosting conditions we check only if a pi_task 'dynamic' deadline is earlier than mutex holder's and in this case we set mutex holder to be dl_boosted. However, since RT 'dynamic' deadlines are only initialized if such tasks get boosted at some point (or if they become DEADLINE of course), in general RT 'dynamic' deadlines are usually equal to 0 and this verifies the aforementioned condition. Fix it by checking that the potential donor task is actually (even if temporary because in turn boosted) running at DEADLINE priority before using its 'dynamic' deadline value. Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic") Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Tested-by: Daniel Wagner <dwagner@suse.de> Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30sched/deadline: Initialize ->dl_boostedJuri Lelli
[ Upstream commit ce9bc3b27f2a21a7969b41ffb04df8cf61bd1592 ] syzbot reported the following warning triggered via SYSC_sched_setattr(): WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline] WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline] WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441 This happens because the ->dl_boosted flag is currently not initialized by __dl_clear_params() (unlike the other flags) and setup_new_dl_entity() rightfully complains about it. Initialize dl_boosted to 0. Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic") Reported-by: syzbot+5ac8bac25f95e8b221e7@syzkaller.appspotmail.com Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Daniel Wagner <dwagner@suse.de> Link: https://lkml.kernel.org/r/20200617072919.818409-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22sched/core: Fix illegal RCU from offline CPUsPeter Zijlstra
[ Upstream commit bf2c59fce4074e55d622089b34be3a6bc95484fb ] In the CPU-offline process, it calls mmdrop() after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaining when mmdrop() uses RCU from either memcg or debugobjects below. Fix it by cleaning up the active_mm state from BP instead. Every arch which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit() from AP. The only exception is parisc because it switches them to &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()), but the patch will still work there because it calls mmgrab(&init_mm) in smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu(). WARNING: suspicious RCU usage ----------------------------- kernel/workqueue.c:710 RCU or wq_pool_mutex should be held! other info that might help us debug this: RCU used illegally from offline CPU! Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 get_work_pool+0x110/0x150 __queue_work+0x1bc/0xca0 queue_work_on+0x114/0x120 css_release+0x9c/0xc0 percpu_ref_put_many+0x204/0x230 free_pcp_prepare+0x264/0x570 free_unref_page+0x38/0xf0 __mmdrop+0x21c/0x2c0 idle_task_exit+0x170/0x1b0 pnv_smp_cpu_kill_self+0x38/0x2e0 cpu_die+0x48/0x64 arch_cpu_idle_dead+0x30/0x50 do_idle+0x2f4/0x470 cpu_startup_entry+0x38/0x40 start_secondary+0x7a8/0xa80 start_secondary_resume+0x10/0x14 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22sched/fair: Don't NUMA balance for kthreadsJens Axboe
[ Upstream commit 18f855e574d9799a0e7489f8ae6fd8447d0dd74a ] Stefano reported a crash with using SQPOLL with io_uring: BUG: kernel NULL pointer dereference, address: 00000000000003b0 CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11 RIP: 0010:task_numa_work+0x4f/0x2c0 Call Trace: task_work_run+0x68/0xa0 io_sq_thread+0x252/0x3d0 kthread+0xf9/0x130 ret_from_fork+0x35/0x40 which is task_numa_work() oopsing on current->mm being NULL. The task work is queued by task_tick_numa(), which checks if current->mm is NULL at the time of the call. But this state isn't necessarily persistent, if the kthread is using use_mm() to temporarily adopt the mm of a task. Change the task_tick_numa() check to exclude kernel threads in general, as it doesn't make sense to attempt ot balance for kthreads anyway. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17sched: Avoid scale real weight down to zeroMichael Wang
[ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ] During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-05sched/fair: Fix O(nr_cgroups) in the load balancing pathVincent Guittot
commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 upstream. This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05sched/fair: Optimize update_blocked_averages()Vincent Guittot
commit 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c upstream. Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-01sched/fair: Fix insertion in rq->leaf_cfs_rq_listVincent Guittot
commit f6783319737f28e4436a69611853a5a098cbe974 upstream. Sargun reported a crash: "I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c and put it on top of 4.19.13. In addition to this, I uninlined list_add_leaf_cfs_rq for debugging. This revealed a new bug that we didn't get to because we kept getting crashes from the previous issue. When we are running with cgroups that are rapidly changing, with CFS bandwidth control, and in addition using the cpusets cgroup, we see this crash. Specifically, it seems to occur with cgroups that are throttled and we change the allowed cpuset." The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that it will walk down to root the 1st time a cfs_rq is used and we will finish to add either a cfs_rq without parent or a cfs_rq with a parent that is already on the list. But this is not always true in presence of throttling. Because a cfs_rq can be throttled even if it has never been used but other CPUs of the cgroup have already used all the bandwdith, we are not sure to go down to the root and add all cfs_rq in the list. Ensure that all cfs_rq will be added in the list even if they are throttled. [ mingo: Fix !CGROUPS build. ] Reported-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: tj@kernel.org Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list") Link: https://lkml.kernel.org/r/1548825767-10799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Janne Huttunen <janne.huttunen@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-01sched/fair: Add tmp_alone_branch assertionPeter Zijlstra
commit 5d299eabea5a251fbf66e8277704b874bbba92dc upstream. The magic in list_add_leaf_cfs_rq() requires that at the end of enqueue_task_fair(): rq->tmp_alone_branch == &rq->lead_cfs_rq_list If this is violated, list integrity is compromised for list entries and the tmp_alone_branch pointer might dangle. Also, reflow list_add_leaf_cfs_rq() while there. This looses one indentation level and generates a form that's convenient for the next patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Janne Huttunen <janne.huttunen@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31cpufreq: Avoid leaving stale IRQ work items during CPU offlineRafael J. Wysocki
commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream. The scheduler code calling cpufreq_update_util() may run during CPU offline on the target CPU after the IRQ work lists have been flushed for it, so the target CPU should be prevented from running code that may queue up an IRQ work item on it at that point. Unfortunately, that may not be the case if dvfs_possible_from_any_cpu is set for at least one cpufreq policy in the system, because that allows the CPU going offline to run the utilization update callback of the cpufreq governor on behalf of another (online) CPU in some cases. If that happens, the cpufreq governor callback may queue up an IRQ work on the CPU running it, which is going offline, and the IRQ work may not be flushed after that point. Moreover, that IRQ work cannot be flushed until the "offlining" CPU goes back online, so if any other CPU calls irq_work_sync() to wait for the completion of that IRQ work, it will have to wait until the "offlining" CPU is back online and that may not happen forever. In particular, a system-wide deadlock may occur during CPU online as a result of that. The failing scenario is as follows. CPU0 is the boot CPU, so it creates a cpufreq policy and becomes the "leader" of it (policy->cpu). It cannot go offline, because it is the boot CPU. Next, other CPUs join the cpufreq policy as they go online and they leave it when they go offline. The last CPU to go offline, say CPU3, may queue up an IRQ work while running the governor callback on behalf of CPU0 after leaving the cpufreq policy because of the dvfs_possible_from_any_cpu effect described above. Then, CPU0 is the only online CPU in the system and the stale IRQ work is still queued on CPU3. When, say, CPU1 goes back online, it will run irq_work_sync() to wait for that IRQ work to complete and so it will wait for CPU3 to go back online (which may never happen even in principle), but (worse yet) CPU0 is waiting for CPU1 at that point too and a system-wide deadlock occurs. To address this problem notice that CPUs which cannot run cpufreq utilization update code for themselves (for example, because they have left the cpufreq policies that they belonged to), should also be prevented from running that code on behalf of the other CPUs that belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so in that case the cpufreq_update_util_data pointer of the CPU running the code must not be NULL as well as for the CPU which is the target of the cpufreq utilization update in progress. Accordingly, change cpufreq_this_cpu_can_update() into a regular function in kernel/sched/cpufreq.c (instead of a static inline in a header file) and make it check the cpufreq_update_util_data pointer of the local CPU if dvfs_possible_from_any_cpu is set for the target cpufreq policy. Also update the schedutil governor to do the cpufreq_this_cpu_can_update() check in the non-fast-switch case too to avoid the stale IRQ work issues. Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits") Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/ Reported-by: Anson Huang <anson.huang@nxp.com> Tested-by: Anson Huang <anson.huang@nxp.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-13sched/fair: Scale bandwidth quota and period without losing quota/period ↵Xuewei Zhang
ratio precision commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream. The quota/period ratio is used to ensure a child task group won't get more bandwidth than the parent task group, and is calculated as: normalized_cfs_quota() = [(quota_us << 20) / period_us] If the quota/period ratio was changed during this scaling due to precision loss, it will cause inconsistency between parent and child task groups. See below example: A userspace container manager (kubelet) does three operations: 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. 2) Create a few children cgroups. 3) Set quota to 1,000us and period to 10,000us on a child cgroup. These operations are expected to succeed. However, if the scaling of 147/128 happens before step 3, quota and period of the parent cgroup will be changed: new_quota: 1148437ns, 1148us new_period: 11484375ns, 11484us And when step 3 comes in, the ratio of the child cgroup will be 104857, which will be larger than the parent cgroup ratio (104821), and will fail. Scaling them by a factor of 2 will fix the problem. Tested-by: Phil Auld <pauld@redhat.com> Signed-off-by: Xuewei Zhang <xueweiz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Phil Auld <pauld@redhat.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-13sched/core: Avoid spurious lock dependenciesPeter Zijlstra
[ Upstream commit ff51ff84d82aea5a889b85f2b9fb3aa2b8691668 ] While seemingly harmless, __sched_fork() does hrtimer_init(), which, when DEBUG_OBJETS, can end up doing allocations. This then results in the following lock order: rq->lock zone->lock.rlock batched_entropy_u64.lock Which in turn causes deadlocks when we do wakeups while holding that batched_entropy lock -- as the random code does. Solve this by moving __sched_fork() out from under rq->lock. This is safe because nothing there relies on rq->lock, as also evident from the other __sched_fork() callsite. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: bigeasy@linutronix.de Cc: cl@linux.com Cc: keescook@chromium.org Cc: penberg@kernel.org Cc: rientjes@google.com Cc: thgarnie@google.com Cc: tytso@mit.edu Cc: will@kernel.org Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy") Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01sched/fair: Don't increase sd->balance_interval on newidle balanceValentin Schneider
[ Upstream commit 3f130a37c442d5c4d66531b240ebe9abfef426b5 ] When load_balance() fails to move some load because of task affinity, we end up increasing sd->balance_interval to delay the next periodic balance in the hopes that next time we look, that annoying pinned task(s) will be gone. However, idle_balance() pays no attention to sd->balance_interval, yet it will still lead to an increase in balance_interval in case of pinned tasks. If we're going through several newidle balances (e.g. we have a periodic task), this can lead to a huge increase of the balance_interval in a very small amount of time. To prevent that, don't increase the balance interval when going through a newidle balance. This is a similar approach to what is done in commit 58b26c4c0257 ("sched: Increment cache_nice_tries only on periodic lb"), where we disregard newidle balance and rely on periodic balance for more stable results. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01sched/topology: Fix off by one bugPeter Zijlstra
[ Upstream commit 993f0b0510dad98b4e6e39506834dab0d13fd539 ] With the addition of the NUMA identity level, we increased @level by one and will run off the end of the array in the distance sort loop. Fixed: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20sched/debug: Explicitly cast sched_feat() to boolPeter Zijlstra
[ Upstream commit 7e6f4c5d600c1c8e2a1d900e65cab319d9b6782e ] LLVM has a warning that tags expressions like: if (foo && non-bool-const) This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends up being whatever bit we select. Avoid the warning with an explicit cast to bool. Reported-by: Philipp Klocke <philipp97kl@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12sched/fair: Fix -Wunused-but-set-variable warningsQian Cai
commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream. Commit: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") introduced a few compilation warnings: kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime': kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable] kernel/sched/fair.c: In function 'start_cfs_bandwidth': kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable] Also, __refill_cfs_bandwidth_runtime() does no longer update the expiration time, so fix the comments accordingly. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pauld@redhat.com Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12sched/fair: Fix low cpu usage with high throttling by removing expiration of ↵Dave Chiluk
cpu-local slices commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06sched/vtime: Fix guest/system mis-accounting on task switchFrederic Weisbecker
[ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ] vtime_account_system() assumes that the target task to account cputime to is always the current task. This is most often true indeed except on task switch where we call: vtime_common_task_switch(prev) vtime_account_system(prev) Here prev is the scheduling-out task where we account the cputime to. It doesn't match current that is already the scheduling-in task at this stage of the context switch. So we end up checking the wrong task flags to determine if we are accounting guest or system time to the previous task. As a result the wrong task is used to check if the target is running in guest mode. We may then spuriously account or leak either system or guest time on task switch. Fix this assumption and also turn vtime_guest_enter/exit() to use the task passed in parameter as well to avoid future similar issues. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpengli@tencent.com> Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource") Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()KeMeng Shi
[ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ] An oops can be triggered in the scheduler when running qemu on arm64: Unable to handle kernel paging request at virtual address ffff000008effe40 Internal error: Oops: 96000007 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x00000000084e3736) pstate: 20000085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid CPU in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. The reproduce the crash: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid CPU is set in memory after tested cpumask. Signed-off-by: KeMeng Shi <shikemeng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11sched/membarrier: Fix private expedited registration checkMathieu Desnoyers
[ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ] Fix a logic flaw in the way membarrier_register_private_expedited() handles ready state checks for private expedited sync core and private expedited registrations. If a private expedited membarrier registration is first performed, and then a private expedited sync_core registration is performed, the ready state check will skip the second registration when it really should not. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05sched/cpufreq: Align trace event behavior of fast switchingDouglas RAILLARD
[ Upstream commit 77c84dd1881d0f0176cb678d770bfbda26c54390 ] Fast switching path only emits an event for the CPU of interest, whereas the regular path emits an event for all the CPUs that had their frequency changed, i.e. all the CPUs sharing the same policy. With the current behavior, looking at cpu_frequency event for a given CPU that is using the fast switching path will not give the correct frequency signal. Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05idle: Prevent late-arriving interrupts from disrupting offlinePeter Zijlstra
[ Upstream commit e78a7614f3876ac649b3df608789cb6ef74d0480 ] Scheduling-clock interrupts can arrive late in the CPU-offline process, after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaints when the interrupt handler uses RCU: ------------------------------------------------------------------------ ============================= WARNING: suspicious RCU usage 5.2.0-rc1+ #681 Not tainted ----------------------------- kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/5/0. stack backtrace: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5e/0x8b trigger_load_balance+0xa8/0x390 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x3b/0x50 tick_sched_handle+0x2f/0x40 tick_sched_timer+0x32/0x70 __hrtimer_run_queues+0xd3/0x3b0 hrtimer_interrupt+0x11d/0x270 ? sched_clock_local+0xc/0x74 smp_apic_timer_interrupt+0x79/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:delay_tsc+0x22/0x50 Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29 RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64 RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15 RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000 cpuhp_report_idle_dead+0x31/0x60 do_idle+0x1d5/0x200 ? _raw_spin_unlock_irqrestore+0x2d/0x40 cpu_startup_entry+0x14/0x20 start_secondary+0x151/0x170 secondary_startup_64+0xa4/0xb0 ------------------------------------------------------------------------ This happens rarely, but can be forced by happen more often by placing delays in cpuhp_report_idle_dead() following the call to rcu_report_dead(). With this in place, the following rcutorture scenario reproduces the problem within a few minutes: tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04" This commit uses the crude but effective expedient of moving the disabling of interrupts within the idle loop to precede the cpu_is_offline() check. It also invokes tick_nohz_idle_stop_tick() instead of tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock interrupt. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05sched/fair: Use rq_lock/unlock in online_fair_sched_groupPhil Auld
[ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ] Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes warning to fire in update_rq_clock. This seems to be caused by onlining a new fair sched group not using the rq lock wrappers. [] rq->clock_update_flags & RQCF_UPDATED [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150 [] Call Trace: [] online_fair_sched_group+0x53/0x100 [] cpu_cgroup_css_online+0x16/0x20 [] online_css+0x1c/0x60 [] cgroup_apply_control_enable+0x231/0x3b0 [] cgroup_mkdir+0x41b/0x530 [] kernfs_iop_mkdir+0x61/0xa0 [] vfs_mkdir+0x108/0x1a0 [] do_mkdirat+0x77/0xe0 [] do_syscall_64+0x55/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using the wrappers in online_fair_sched_group instead of the raw locking removes this warning. [ tglx: Use rq_*lock_irq() ] Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>