aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched/rt.c
AgeCommit message (Collapse)Author
2022-08-17nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()Nicolas Saenz Julienne
[ Upstream commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a ] dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having called sched_update_tick_dependency() preventing it from re-enabling the tick on systems that no longer have pending SCHED_RT tasks but have multiple runnable SCHED_OTHER tasks: dequeue_task_rt() dequeue_rt_entity() dequeue_rt_stack() dequeue_top_rt_rq() sub_nr_running() // decrements rq->nr_running sched_update_tick_dependency() sched_can_stop_tick() // checks rq->rt.rt_nr_running, ... __dequeue_rt_entity() dec_rt_tasks() // decrements rq->rt.rt_nr_running ... Every other scheduler class performs the operation in the opposite order, and sched_update_tick_dependency() expects the values to be updated as such. So avoid the misbehaviour by inverting the order in which the above operations are performed in the RT scheduler. Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-26Merge tag 'sysctl-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. Thanks a lot to the Uniontech and Huawei folks for doing some of this nasty work" * tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits) sched: Fix build warning without CONFIG_SYSCTL reboot: Fix build warning without CONFIG_SYSCTL kernel/kexec_core: move kexec_core sysctls into its own file sysctl: minor cleanup in new_dir() ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n fs/proc: Introduce list_for_each_table_entry for proc sysctl mm: fix unused variable kernel warning when SYSCTL=n latencytop: move sysctl to its own file ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y ftrace: Fix build warning ftrace: move sysctl_ftrace_enabled to ftrace.c kernel/do_mount_initrd: move real_root_dev sysctls to its own file kernel/delayacct: move delayacct sysctls to its own file kernel/acct: move acct sysctls to its own file kernel/panic: move panic sysctls to its own file kernel/lockdep: move lockdep sysctls to its own file mm: move page-writeback sysctls to their own file mm: move oom_kill sysctls to their own file kernel/reboot: move reboot sysctls to its own file sched: Move energy_aware sysctls to topology.c ...
2022-05-11sched/core: Avoid obvious double update_rq_clock warningHao Jia
When we use raw_spin_rq_lock() to acquire the rq lock and have to update the rq clock while holding the lock, the kernel may issue a WARN_DOUBLE_CLOCK warning. Since we directly use raw_spin_rq_lock() to acquire rq lock instead of rq_lock(), there is no corresponding change to rq->clock_update_flags. In particular, we have obtained the rq lock of other CPUs, the rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning. So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid the WARN_DOUBLE_CLOCK warning. For the sched_rt_period_timer() and migrate_task_rq_dl() cases we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with rq_lock()/rq_unlock(). For the {pull,push}_{rt,dl}_task() cases, we add the double_rq_clock_clear_update() function to clear RQCF_UPDATED of rq->clock_update_flags, and call double_rq_clock_clear_update() before double_lock_balance()/double_rq_lock() returns to avoid the WARN_DOUBLE_CLOCK warning. Some call trace reports: Call Trace 1: <IRQ> sched_rt_period_timer+0x10f/0x3a0 ? enqueue_top_rt_rq+0x110/0x110 __hrtimer_run_queues+0x1a9/0x490 hrtimer_interrupt+0x10b/0x240 __sysvec_apic_timer_interrupt+0x8a/0x250 sysvec_apic_timer_interrupt+0x9a/0xd0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 Call Trace 2: <TASK> activate_task+0x8b/0x110 push_rt_task.part.108+0x241/0x2c0 push_rt_tasks+0x15/0x30 finish_task_switch+0xaa/0x2e0 ? __switch_to+0x134/0x420 __schedule+0x343/0x8e0 ? hrtimer_start_range_ns+0x101/0x340 schedule+0x4e/0xb0 do_nanosleep+0x8e/0x160 hrtimer_nanosleep+0x89/0x120 ? hrtimer_init_sleeper+0x90/0x90 __x64_sys_nanosleep+0x96/0xd0 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 3: <TASK> deactivate_task+0x93/0xe0 pull_rt_task+0x33e/0x400 balance_rt+0x7e/0x90 __schedule+0x62f/0x8e0 do_task_dead+0x3f/0x50 do_exit+0x7b8/0xbb0 do_group_exit+0x2d/0x90 get_signal+0x9df/0x9e0 ? preempt_count_add+0x56/0xa0 ? __remove_hrtimer+0x35/0x70 arch_do_signal_or_restart+0x36/0x720 ? nanosleep_copyout+0x39/0x50 ? do_nanosleep+0x131/0x160 ? audit_filter_inodes+0xf5/0x120 exit_to_user_mode_prepare+0x10f/0x1e0 syscall_exit_to_user_mode+0x17/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 4: update_rq_clock+0x128/0x1a0 migrate_task_rq_dl+0xec/0x310 set_task_cpu+0x84/0x1e4 try_to_wake_up+0x1d8/0x5c0 wake_up_process+0x1c/0x30 hrtimer_wakeup+0x24/0x3c __hrtimer_run_queues+0x114/0x270 hrtimer_interrupt+0xe8/0x244 arch_timer_handler_phys+0x30/0x50 handle_percpu_devid_irq+0x88/0x140 generic_handle_domain_irq+0x40/0x60 gic_handle_irq+0x48/0xe0 call_on_irq_stack+0x2c/0x60 do_interrupt_handler+0x80/0x84 Steps to reproduce: 1. Enable CONFIG_SCHED_DEBUG when compiling the kernel 2. echo 1 > /sys/kernel/debug/clear_warn_once echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features 3. Run some rt/dl tasks that periodically work and sleep, e.g. Create 2*n rt or dl (90% running) tasks via rt-app (on a system with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running on PREEMPT_RT kernel. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com
2022-04-06sched/rt: fix build error when CONFIG_SYSCTL is disableBaisong Zhong
Avoid random build errors which do not select CONFIG_SYSCTL by depending on it in Kconfig. This fixes the following warning: In file included from kernel/sched/build_policy.c:43: At top level: kernel/sched/rt.c:3017:12: error: ‘sched_rr_handler’ defined but not used [-Werror=unused-function] 3017 | static int sched_rr_handler(struct ctl_table *table, int write, void *buffer, | ^~~~~~~~~~~~~~~~ kernel/sched/rt.c:2978:12: error: ‘sched_rt_handler’ defined but not used [-Werror=unused-function] 2978 | static int sched_rt_handler(struct ctl_table *table, int write, void *buffer, | ^~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors make[2]: *** [scripts/Makefile.build:310: kernel/sched/build_policy.o] Error 1 make[1]: *** [scripts/Makefile.build:638: kernel/sched] Error 2 make[1]: *** Waiting for unfinished jobs.... Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com> [mcgrof: small build fix, we need sched_rt_can_attach() even when CONFIG_SYSCTL is disabled] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06sched: Move rr_timeslice sysctls to rt.cZhen Ni
move rr_timeslice sysctls to rt.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06sched: Move rt_period/runtime sysctls to rt.cZhen Ni
move rt_period/runtime sysctls to rt.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-03-15Merge branch 'sched/fast-headers' into sched/coreIngo Molnar
Merge the scheduler build speedup of the fast-headers tree. Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel: _____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________ Summary: - CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds. - Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-08sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()Dietmar Eggemann
The `struct rq *rq` parameter isn't used. Remove it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-7-dietmar.eggemann@arm.com
2022-03-08sched/deadline,rt: Remove unused functions for !CONFIG_SMPDietmar Eggemann
The need_pull_[rt|dl]_task() and pull_[rt|dl]_task() functions are not used on a !CONFIG_SMP system. Remove them. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-6-dietmar.eggemann@arm.com
2022-03-01sched/rt: Plug rt_mutex_setprio() vs push_rt_task() raceValentin Schneider
John reported that push_rt_task() can end up invoking find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS one), which causes mayhem down convert_prio(). This can happen when current gets demoted to e.g. CFS when releasing an rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before getting the chance to reschedule. Exactly who triggers this work isn't entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task() if there are no RT tasks on the local RQ, which means the local CPU can't be in the rto_mask. My current suspected sequence is something along the lines of the below, with the demoted task being current. mark_wakeup_next_waiter() rt_mutex_adjust_prio() rt_mutex_setprio() // deboost originally-CFS task check_class_changed() switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running switched_to_fair() // Sets need_resched __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above raw_spin_rq_unlock(rq) // need_resched is set, so task_woken_rt() can't // invoke push_rt_tasks(). Best I can come up with is // local CPU has rt_nr_migratory >= 2 after the demotion, so stays // in the rto_mask, and then: <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU> push_rt_task() // breakage follows here as rq->curr is CFS Move an existing check to check rq->curr vs the next pushable task's priority before getting anywhere near find_lowest_rq(). While at it, add an explicit sched_class of rq->curr check prior to invoking find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless of next_task's migratability. Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing") Reported-by: John Keeping <john@metanate.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: John Keeping <john@metanate.com> Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com
2022-02-23sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c ↵Ingo Molnar
files there Similarly to kernel/sched/build_utility.c, collect all 'scheduling policy' related source code files into kernel/sched/build_policy.c: kernel/sched/idle.c kernel/sched/rt.c kernel/sched/cpudeadline.c kernel/sched/pelt.c kernel/sched/cputime.c kernel/sched/deadline.c With the exception of fair.c, which we continue to build as a separate file for build efficiency and parallelism reasons. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2021-12-07sched/rt: Try to restart rt period timer when rt runtime exceededLi Hua
When rt_runtime is modified from -1 to a valid control value, it may cause the task to be throttled all the time. Operations like the following will trigger the bug. E.g: 1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us 2. Run a FIFO task named A that executes while(1) 3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us When rt_runtime is -1, The rt period timer will not be activated when task A enqueued. And then the task will be throttled after setting rt_runtime to 950,000. The task will always be throttled because the rt period timer is not activated. Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Li Hua <hucool.lihua@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com
2021-11-11sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause
Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-05sched/rt: Support schedstats for RT sched classYafang Shao
We want to measure the latency of RT tasks in our production environment with schedstats facility, but currently schedstats is only supported for fair sched class. This patch enable it for RT sched class as well. After we make the struct sched_statistics and the helpers of it independent of fair sched class, we can easily use the schedstats facility for RT sched class. The schedstat usage in RT sched class is similar with fair sched class, for example, fair RT enqueue update_stats_enqueue_fair update_stats_enqueue_rt dequeue update_stats_dequeue_fair update_stats_dequeue_rt put_prev_task update_stats_wait_start update_stats_wait_start_rt set_next_task update_stats_wait_end update_stats_wait_end_rt The user can get the schedstats information in the same way in fair sched class. For example, fair RT /proc/[pid]/sched /proc/[pid]/sched schedstats is not supported for RT group. The output of a RT task's schedstats as follows, $ cat /proc/10349/sched ... sum_sleep_runtime : 972.434535 sum_block_runtime : 960.433522 wait_start : 188510.871584 sleep_start : 0.000000 block_start : 0.000000 sleep_max : 12.001013 block_max : 952.660622 exec_max : 0.049629 slice_max : 0.000000 wait_max : 0.018538 wait_sum : 0.424340 wait_count : 49 iowait_sum : 956.495640 iowait_count : 24 nr_migrations_cold : 0 nr_failed_migrations_affine : 0 nr_failed_migrations_running : 0 nr_failed_migrations_hot : 0 nr_forced_migrations : 0 nr_wakeups : 49 nr_wakeups_sync : 0 nr_wakeups_migrate : 0 nr_wakeups_local : 49 nr_wakeups_remote : 0 nr_wakeups_affine : 0 nr_wakeups_affine_attempts : 0 nr_wakeups_passive : 0 nr_wakeups_idle : 0 ... The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can be used to trace RT tasks as well. The output of these tracepoints for a RT tasks as follows, - runtime stress-10352 [004] d.h. 1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns] [vruntime=0 means it is a RT task] - wait <idle>-0 [004] dN.. 1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns] - blocked kworker/4:1-465 [004] dN.. 1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns] - iowait kworker/4:1-465 [004] dN.. 1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns] - sleep sleep-18194 [023] dN.. 1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns] sleep-18196 [023] dN.. 1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns] sleep-18197 [023] dN.. 1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns] [ In sleep.sh, it sleeps 1 sec each time. ] [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com
2021-10-05sched/rt: Support sched_stat_runtime tracepoint for RT sched classYafang Shao
The runtime of a RT task has already been there, so we only need to add a tracepoint. One difference between fair task and RT task is that there is no vruntime in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as vruntime for RT task. The output of this tracepoint for RT task as follows, stress-9748 [039] d.h. 113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns] stress-9748 [039] d.h. 113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns] stress-9748 [039] d.h. 113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com
2021-10-05sched: Make struct sched_statistics independent of fair sched classYafang Shao
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal. After the patch, schestats are orgnized as follows, struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... }; Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter - struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout; Then with the se in a task_group, we can easily get the stats. The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned. As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no impact on the sched performance. No functional change. [lkp@intel.com: reported build failure in earlier version] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-06-22sched/rt: Fix RT utilization tracking during policy changeVincent Donnefort
RT keeps track of the utilization on a per-rq basis with the structure avg_rt. This utilization is updated during task_tick_rt(), put_prev_task_rt() and set_next_task_rt(). However, when the current running task changes its policy, set_next_task_rt() which would usually take care of updating the utilization when the rq starts running RT tasks, will not see a such change, leaving the avg_rt structure outdated. When that very same task will be dequeued later, put_prev_task_rt() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the RT utilization signal. The signal would eventually recover from this issue after few ms. Even if no RT tasks are run, avg_rt is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_rt, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to RT. Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-05-12sched: Introduce sched_class::pick_task()Peter Zijlstra
Because sched_class::pick_next_task() also implies sched_class::set_next_task() (and possibly put_prev_task() and newidle_balance) it is not state invariant. This makes it unsuitable for remote task selection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Vineeth: folded fixes] Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
2021-05-12sched: Wrap rq::lock accessPeter Zijlstra
In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
2021-03-22sched: Fix various typosIngo Molnar
Fix ~42 single-word typos in scheduler code comments. We have accumulated a few fun ones over the years. :-) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-kernel@vger.kernel.org
2021-01-14sched: Use task_current() instead of 'rq->curr == p'Hui Su
Use the task_current() function where appropriate. No functional change. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2020-11-10sched: Remove select_task_rq()'s sd_flag parameterValentin Schneider
Only select_task_rq_fair() uses that parameter to do an actual domain search, other classes only care about what kind of wakeup is happening (fork, exec, or "regular") and thus just translate the flag into a wakeup type. WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to encode the wakeup types we care about. For select_task_rq_fair(), we can simply use the shiny new WF_flag : SD_flag mapping. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
2020-11-10Merge branch 'sched/migrate-disable'Peter Zijlstra
2020-11-10sched: Fix migrate_disable() vs rt/dl balancingPeter Zijlstra
In order to minimize the interference of migrate_disable() on lower priority tasks, which can be deprived of runtime due to being stuck below a higher priority task. Teach the RT/DL balancers to push away these higher priority tasks when a lower priority task gets selected to run on a freshly demoted CPU (pull). This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost. Without this it would be possible to have all tasks on the system stuck on a single CPU, each task preempted in a migrate_disable() section with a single high priority task running. This way we can still approximate running the M highest priority tasks on the system. Migrating the top task away is (ofcourse) still subject to migrate_disable() too, which means the lower task is subject to an interference equivalent to the worst case migrate_disable() section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
2020-11-10sched,rt: Use the full cpumask for balancingPeter Zijlstra
We want migrate_disable() tasks to get PULLs in order for them to PUSH away the higher priority task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.310519774@infradead.org
2020-11-10sched,rt: Use cpumask_any*_distribute()Peter Zijlstra
Replace a bunch of cpumask_any*() instances with cpumask_any*_distribute(), by injecting this little bit of random in cpu selection, we reduce the chance two competing balance operations working off the same lowest_mask pick the same CPU. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.190759694@infradead.org
2020-11-10sched: Fix hotplug vs CPU bandwidth controlPeter Zijlstra
Since we now migrate tasks away before DYING, we should also move bandwidth unthrottle, otherwise we can gain tasks from unthrottle after we expect all tasks to be gone already. Also; it looks like the RT balancers don't respect cpu_active() and instead rely on rq->online in part, complete this. This too requires we do set_rq_offline() earlier to match the cpu_active() semantics. (The bigger patch is to convert RT to cpu_active() entirely) Since set_rq_online() is called from sched_cpu_activate(), place set_rq_offline() in sched_cpu_deactivate(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
2020-10-29sched: Remove relyance on STRUCT_ALIGNMENTPeter Zijlstra
Florian reported that all of kernel/sched/ is rebuild when CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is unexpected. This is due to us including vmlinux.lds.h. Jakub explained that the problem is that we put the alignment requirement on the type instead of on a variable. Type alignment is a minimum, the compiler is free to pick any larger alignment for a specific instance of the type (eg. the variable). So force the type alignment on all individual variable definitions and remove the undesired dependency on vmlinux.lds.h. Fixes: 85c2ce9104eb ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1Peter Zijlstra
This makes the mapping continuous and frees up 100 for other usage. Prev mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 100 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 New mapping: p->rt_priority p->prio newpri cpupri -1 -1 (CPUPRI_INVALID) 99 0 (CPUPRI_NORMAL) 1 98 98 1 ... 49 50 50 49 50 49 49 50 ... 99 0 0 99 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-25sched: Remove struct sched_class::next fieldSteven Rostedt (VMware)
Now that the sched_class descriptors are defined in order via the linker script vmlinux.lds.h, there's no reason to have a "next" pointer to the previous priroity structure. The order of the sturctures can be aligned as an array, and used to index and find the next sched_class descriptor. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191219214558.845353593@goodmis.org
2020-06-25sched: Force the address order of each sched class descriptorSteven Rostedt (VMware)
In order to make a micro optimization in pick_next_task(), the order of the sched class descriptor address must be in the same order as their priority to each other. That is: &idle_sched_class < &fair_sched_class < &rt_sched_class < &dl_sched_class < &stop_sched_class In order to guarantee this order of the sched class descriptors, add each one into their own data section and force the order in the linker script. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
2020-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...
2020-05-19sched: Defend cfs and rt bandwidth quota against overflowHuaixin Chang
When users write some huge number into cpu.cfs_quota_us or cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of schedulable checks. to_ratio() could be altered to avoid unnecessary internal overflow, but min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to prevent overflow. Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-06sched/rt: Remove unnecessary push for unfit tasksQais Yousef
In task_woken_rt() and switched_to_rto() we try trigger push-pull if the task is unfit. But the logic is found lacking because if the task was the only one running on the CPU, then rt_rq is not in overloaded state and won't trigger a push. The necessity of this logic was under a debate as well, a summary of the discussion can be found in the following thread: https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/ Remove the logic for now until a better approach is agreed upon. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com
2020-03-06sched/rt: Allow pulling unfitting taskQais Yousef
When implemented RT Capacity Awareness; the logic was done such that if a task was running on a fitting CPU, then it was sticky and we would try our best to keep it there. But as Steve suggested, to adhere to the strict priority rules of RT class; allow pulling an RT task to unfitting CPU to ensure it gets a chance to run ASAP. LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com
2020-03-06sched/rt: Optimize cpupri_find() on non-heterogenous systemsQais Yousef
By introducing a new cpupri_find_fitness() function that takes the fitness_fn as an argument and only called when asym_system static key is enabled. cpupri_find() is now a wrapper function that calls cpupri_find_fitness() passing NULL as a fitness_fn, hence disabling the logic that handles fitness by default. LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/ Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com
2020-03-06sched/rt: Re-instate old behavior in select_task_rq_rt()Qais Yousef
When RT Capacity Aware support was added, the logic in select_task_rq_rt was modified to force a search for a fitting CPU if the task currently doesn't run on one. But if the search failed, and the search was only triggered to fulfill the fitness request; we could end up selecting a new CPU unnecessarily. Fix this and re-instate the original behavior by ensuring we bail out in that case. This behavior change only affected asymmetric systems that are using util_clamp to implement capacity aware. None asymmetric systems weren't affected. LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/ Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com
2020-01-28sched/rt: Optimize checking group RT scheduler constraintsKonstantin Khlebnikov
Group RT scheduler contains protection against setting zero runtime for cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which runtime is zero (not only for changed one). Default RT runtime is zero, thus tg_has_rt_tasks() will is called for almost at CPU cgroups. This protection already is slightly racy: runtime limit could be changed between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing cgroup attribute does not lock cgroup_mutex while attach does not lock rt_constraints_mutex. Changing task scheduler class also races with changing rt runtime: check in __sched_setscheduler() isn't protected. Function tg_has_rt_tasks() iterates over all threads in the system. This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking tasklist_lock for write (for example fork) will stuck with disabled irqs. This patch makes two optimizations: 1) Remove locking tasklist_lock and iterate only tasks in cgroup 2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero All changed code is under CONFIG_RT_GROUP_SCHED. Testcase: # mkdir /sys/fs/cgroup/cpu/test{1..10000} # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us At the same time without patch fork time will be >100ms: # perf trace -e clone --duration 100 stress-ng --fork 1 Also remote ping will show timings >100ms caused by irq latency. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
2019-12-25sched/rt: Make RT capacity-awareQais Yousef
Capacity Awareness refers to the fact that on heterogeneous systems (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence when placing tasks we need to be aware of this difference of CPU capacities. In such scenarios we want to ensure that the selected CPU has enough capacity to meet the requirement of the running task. Enough capacity means here that capacity_orig_of(cpu) >= task.requirement. The definition of task.requirement is dependent on the scheduling class. For CFS, utilization is used to select a CPU that has >= capacity value than the cfs_task.util. capacity_orig_of(cpu) >= cfs_task.util DL isn't capacity aware at the moment but can make use of the bandwidth reservation to implement that in a similar manner CFS uses utilization. The following patchset implements that: https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/ capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime For RT we don't have a per task utilization signal and we lack any information in general about what performance requirement the RT task needs. But with the introduction of uclamp, RT tasks can now control that by setting uclamp_min to guarantee a minimum performance point. ATM the uclamp value are only used for frequency selection; but on heterogeneous systems this is not enough and we need to ensure that the capacity of the CPU is >= uclamp_min. Which is what implemented here. capacity_orig_of(cpu) >= rt_task.uclamp_min Note that by default uclamp.min is 1024, which means that RT tasks will always be biased towards the big CPUs, which make for a better more predictable behavior for the default case. Must stress that the bias acts as a hint rather than a definite placement strategy. For example, if all big cores are busy executing other RT tasks we can't guarantee that a new RT task will be placed there. On non-heterogeneous systems the original behavior of RT should be retained. Similarly if uclamp is not selected in the config. [ mingo: Minor edits to comments. ] Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Further clarify sched_class::set_next_task()Peter Zijlstra
It turns out there really is something special to the first set_next_task() invocation. In specific the 'change' pattern really should not cause balance callbacks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task") Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11sched/core: Simplify sched_class::pick_next_task()Peter Zijlstra
Now that the indirect class call never uses the last two arguments of pick_next_task(), remove them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: ktkhai@virtuozzo.com Cc: mgorman@suse.de Cc: qais.yousef@arm.com Cc: qperret@google.com Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-08sched: Fix pick_next_task() vs 'change' pattern racePeter Zijlstra
Commit 67692435c411 ("sched: Rework pick_next_task() slow-path") inadvertly introduced a race because it changed a previously unexplored dependency between dropping the rq->lock and sched_class::put_prev_task(). The comments about dropping rq->lock, in for example newidle_balance(), only mentions the task being current and ->on_cpu being set. But when we look at the 'change' pattern (in for example sched_setnuma()): queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */ running = task_current(rq, p); /* rq->curr == p */ if (queued) dequeue_task(...); if (running) put_prev_task(...); /* change task properties */ if (queued) enqueue_task(...); if (running) set_next_task(...); It becomes obvious that if we do this after put_prev_task() has already been called on @p, things go sideways. This is exactly what the commit in question allows to happen when it does: prev->sched_class->put_prev_task(rq, prev, rf); if (!rq->nr_running) newidle_balance(rq, rf); The newidle_balance() call will drop rq->lock after we've called put_prev_task() and that allows the above 'change' pattern to interleave and mess up the state. Furthermore, it turns out we lost the RT-pull when we put the last DL task. Fix both problems by extracting the balancing from put_prev_task() and doing a multi-class balance() pass before put_prev_task(). Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Quentin Perret <qperret@google.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-09-17Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-08-28posix-cpu-timers: Move expiry cache into struct posix_cputimersThomas Gleixner
The expiry cache belongs into the posix_cputimers container where the other cpu timers information is. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-08sched: Rework pick_next_task() slow-pathPeter Zijlstra
Avoid the RETRY_TASK case in the pick_next_task() slow path. By doing the put_prev_task() early, we get the rt/deadline pull done, and by testing rq->nr_running we know if we need newidle_balance(). This then gives a stable state to pick a task from. Since the fast-path is fair only; it means the other classes will always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Allow put_prev_task() to drop rq->lockPeter Zijlstra
Currently the pick_next_task() loop is convoluted and ugly because of how it can drop the rq->lock and needs to restart the picking. For the RT/Deadline classes, it is put_prev_task() where we do balancing, and we could do this before the picking loop. Make this possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Add task_struct pointer to sched_class::set_curr_taskPeter Zijlstra
In preparation of further separating pick_next_task() and set_curr_task() we have to pass the actual task into it, while there, rename the thing to better pair with put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08sched/{rt,deadline}: Fix set_next_task vs pick_next_taskPeter Zijlstra
Because pick_next_task() implies set_curr_task() and some of the details haven't mattered too much, some of what _should_ be in set_curr_task() ended up in pick_next_task, correct this. This prepares the way for a pick_next_task() variant that does not affect the current state; allowing remote picking. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com