summaryrefslogtreecommitdiffstats
path: root/kernel/sched
AgeCommit message (Collapse)Author
2014-10-28sched/dl: Fix preemption checksKirill Tkhai
1) switched_to_dl() check is wrong. We reschedule only if rq->curr is deadline task, and we do not reschedule if it's a lower priority task. But we must always preempt a task of other classes. 2) dl_task_timer(): Policy does not change in case of priority inheritance. rt_mutex_setprio() changes prio, while policy remains old. So we lose some balancing logic in dl_task_timer() and switched_to_dl() when we check policy instead of priority. Boosted task may be rq->curr. (I didn't change switched_from_dl() because no check is necessary there at all). I've looked at this place(switched_to_dl) several times and even fixed this function, but found just now... I suppose some performance tests may work better after this. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched: stop the unbound recursion in preempt_schedule_context()Oleg Nesterov
preempt_schedule_context() does preempt_enable_notrace() at the end and this can call the same function again; exception_exit() is heavy and it is quite possible that need-resched is true again. 1. Change this code to dec preempt_count() and check need_resched() by hand. 2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid the enable/disable dance around __schedule(). But in this case we need to move into sched/core.c. 3. Cosmetic, but x86 forgets to declare this function. This doesn't really matter because it is only called by asm helpers, still it make sense to add the declaration into asm/preempt.h to match preempt_schedule(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Graf <agraf@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/fair: Fix division by zero sysctl_numa_balancing_scan_sizeKirill Tkhai
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero. This bash command reproduces problem: $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \ echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done divide error: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000 RIP: 0010:[<ffffffff81074191>] [<ffffffff81074191>] task_scan_min+0x21/0x50 RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246 RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600 RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90 R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003 FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0 Stack: ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001 Call Trace: [<ffffffff810741d1>] task_scan_max+0x11/0x40 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0 [<ffffffff8104887d>] ? do_fork+0x14d/0x340 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30 [<ffffffff811799df>] ? __fd_install+0x1f/0x60 [<ffffffff8103e67c>] do_page_fault+0xc/0x10 [<ffffffff8150d322>] page_fault+0x22/0x30 RIP [<ffffffff81074191>] task_scan_min+0x21/0x50 RSP <ffff880037a6bce0> ---[ end trace 9a826d16936c04de ]--- Also fix race in task_scan_min (it depends on compiler behaviour). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@fb.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/fair: Care divide error in update_task_scan_period()Yasuaki Ishimatsu
While offling node by hot removing memory, the following divide error occurs: divide error: 0000 [#1] SMP [...] Call Trace: [...] handle_mm_fault [...] ? try_to_wake_up [...] ? wake_up_state [...] __do_page_fault [...] ? do_futex [...] ? put_prev_entity [...] ? __switch_to [...] do_page_fault [...] page_fault [...] RIP [<ffffffff810a7081>] task_numa_fault RSP <ffff88084eb2bcb0> The issue occurs as follows: 1. When page fault occurs and page is allocated from node 1, task_struct->numa_faults_buffer_memory[] of node 1 is incremented and p->numa_faults_locality[] is also incremented as follows: o numa_faults_buffer_memory[] o numa_faults_locality[] NR_NUMA_HINT_FAULT_TYPES | 0 | 1 | ---------------------------------- ---------------------- node 0 | 0 | 0 | remote | 0 | node 1 | 0 | 1 | locale | 1 | ---------------------------------- ---------------------- 2. node 1 is offlined by hot removing memory. 3. When page fault occurs, fault_types[] is calculated by using p->numa_faults_buffer_memory[] of all online nodes in task_numa_placement(). But node 1 was offline by step 2. So the fault_types[] is calculated by using only p->numa_faults_buffer_memory[] of node 0. So both of fault_types[] are set to 0. 4. The values(0) of fault_types[] pass to update_task_scan_period(). 5. numa_faults_locality[1] is set to 1. So the following division is calculated. static void update_task_scan_period(struct task_struct *p, unsigned long shared, unsigned long private){ ... ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared)); } 6. But both of private and shared are set to 0. So divide error occurs here. The divide error is rare case because the trigger is node offline. This patch always increments denominator for avoiding divide error. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/numa: Fix unsafe get_task_struct() in task_numa_assign()Kirill Tkhai
Unlocked access to dst_rq->curr in task_numa_compare() is racy. If curr task is exiting this may be a reason of use-after-free: task_numa_compare() do_exit() ... current->flags |= PF_EXITING; ... release_task() ... ~~delayed_put_task_struct()~~ ... schedule() rcu_read_lock() ... cur = ACCESS_ONCE(dst_rq->curr) ... ... rq->curr = next; ... context_switch() ... finish_task_switch() ... put_task_struct() ... __put_task_struct() ... free_task_struct() task_numa_assign() ... get_task_struct() ... As noted by Oleg: <<The lockless get_task_struct(tsk) is only safe if tsk == current and didn't pass exit_notify(), or if this tsk was found on a rcu protected list (say, for_each_process() or find_task_by_vpid()). IOW, it is only safe if release_task() was not called before we take rcu_read_lock(), in this case we can rely on the fact that delayed_put_pid() can not drop the (potentially) last reference until rcu_read_unlock(). And as Kirill pointed out task_numa_compare()->task_numa_assign() path does get_task_struct(dst_rq->curr) and this is not safe. The task_struct itself can't go away, but rcu_read_lock() can't save us from the final put_task_struct() in finish_task_switch(); this reference goes away without rcu gp>> The patch provides simple check of PF_EXITING flag. If it's not set, this guarantees that call_rcu() of delayed_put_task_struct() callback hasn't happened yet, so we can safely do get_task_struct() in task_numa_assign(). Locked dst_rq->lock protects from concurrency with the last schedule(). Reusing or unmapping of cur's memory may happen without it. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()Juri Lelli
dl_task_timer() is racy against several paths. Daniel noticed that the replenishment timer may experience a race condition against an enqueue_dl_entity() called from rt_mutex_setprio(). With his own words: rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is: start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0, sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task() throttled is 0 => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already enqueued on the -deadline runqueue. As we do for the other races, we just bail out in the replenishment timer code. Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: vincent@legout.info Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/deadline: Don't replenish from a !SCHED_DEADLINE entityJuri Lelli
In the deboost path, right after the dl_boosted flag has been reset, we can currently end up replenishing using -deadline parameters of a !SCHED_DEADLINE entity. This of course causes a bug, as those parameters are empty. In the case depicted above it is safe to simply bail out, as the deboosted task is going to be back to its original scheduling class anyway. Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: vincent@legout.info Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Fabio Checconi <fchecconi@gmail.com> Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched: Fix race between task_group and sched_task_groupKirill Tkhai
The race may happen when somebody is changing task_group of a forking task. Child's cgroup is the same as parent's after dup_task_struct() (there just memory copying). Also, cfs_rq and rt_rq are the same as parent's. But if parent changes its task_group before it's called cgroup_post_fork(), we do not reflect this situation on child. Child's cfs_rq and rt_rq remain the same, while child's task_group changes in cgroup_post_fork(). To fix this we introduce fork() method, which calls sched_move_task() directly. This function changes sched_task_group on appropriate (also its logic has no problem with freshly created tasks, so we shouldn't introduce something special; we are able just to use it). Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456 Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-10-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave Hansen) - Various sched/idle refinements for better idle handling (Nicolas Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot) - sched/numa updates and optimizations (Rik van Riel) - sysbench speedup (Vincent Guittot) - capacity calculation cleanups/refactoring (Vincent Guittot) - Various cleanups to thread group iteration (Oleg Nesterov) - Double-rq-lock removal optimization and various refactorings (Kirill Tkhai) - various sched/deadline fixes ... and lots of other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/dl: Use dl_bw_of() under rcu_read_lock_sched() sched/fair: Delete resched_cpu() from idle_balance() sched, time: Fix build error with 64 bit cputime_t on 32 bit systems sched: Improve sysbench performance by fixing spurious active migration sched/x86: Fix up typo in topology detection x86, sched: Add new topology for multi-NUMA-node CPUs sched/rt: Use resched_curr() in task_tick_rt() sched: Use rq->rd in sched_setaffinity() under RCU read lock sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' sched: Use dl_bw_of() under RCU read lock sched/fair: Remove duplicate code from can_migrate_task() sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW sched: print_rq(): Don't use tasklist_lock sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() sched: Fix the task-group check in tg_has_rt_tasks() sched/fair: Leverage the idle state info when choosing the "idlest" cpu sched: Let the scheduler see CPU idle states sched/deadline: Fix inter- exclusive cpusets migrations sched/deadline: Clear dl_entity params when setscheduling to different class sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault() ...
2014-10-13Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Ingo Molnar: "The main updates in this cycle were: - mutex MCS refactoring finishing touches: improve comments, refactor and clean up code, reduce debug data structure footprint, etc. - qrwlock finishing touches: remove old code, self-test updates. - small rwsem optimization - various smaller fixes/cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Revert qrwlock recusive stuff locking/rwsem: Avoid double checking before try acquiring write lock locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code locking/semaphore: Resolve some shadow warnings locking/selftest: Support queued rwlock locking/lockdep: Restrict the use of recursive read_lock() with qrwlock locking/spinlocks: Always evaluate the second argument of spin_lock_nested() locking/Documentation: Update locking/mutex-design.txt disadvantages locking/Documentation: Move locking related docs into Documentation/locking/ locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate locking/mutexes: Refactor optimistic spinning code locking/mcs: Remove obsolete comment locking/mutexes: Document quick lock release when unlocking locking/mutexes: Standardize arguments in lock/unlock slowpaths locking: Remove deprecated smp_mb__() barriers
2014-10-09mempolicy: remove the "task" arg of vma_policy_mof() and simplify itOleg Nesterov
1. vma_policy_mof(task) is simply not safe unless task == current, it can race with do_exit()->mpol_put(). Remove this arg and update its single caller. 2. vma can not be NULL, remove this check and simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-08Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - fix an NFSv4.1 state renewal regression - fix open/lock state recovery error handling - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails - fix statd when reconnection fails - don't wake tasks during connection abort - don't start reboot recovery if lease check fails - fix duplicate proc entries Features: - pNFS block driver fixes and clean ups from Christoph - More code cleanups from Anna - Improve mmap() writeback performance - Replace use of PF_TRANS with a more generic mechanism for avoiding deadlocks in nfs_release_page" * tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits) NFSv4.1: Fix an NFSv4.1 state renewal regression NFSv4: fix open/lock state recovery error handling NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails NFS: Fabricate fscache server index key correctly SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT NFSv3: Fix missing includes of nfs3_fs.h NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() NFS: avoid waiting at all in nfs_release_page when congested. NFS: avoid deadlocks with loop-back mounted NFS filesystems. MM: export page_wakeup functions SCHED: add some "wait..on_bit...timeout()" interfaces. NFS: don't use STABLE writes during writeback. NFSv4: use exponential retry on NFS4ERR_DELAY for async requests. rpc: Add -EPERM processing for xs_udp_send_request() rpc: return sent and err from xs_sendpages() lockd: Try to reconnect if statd has moved SUNRPC: Don't wake tasks during connection abort Fixing lease renewal nfs: fix duplicate proc entries pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe ...
2014-10-03sched/dl: Use dl_bw_of() under rcu_read_lock_sched()Kirill Tkhai
rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it is not enough. We should use either rcu_read_lock_sched() or preempt_disable(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Fixes: 66339c31bc39 "sched: Use dl_bw_of() under RCU read lock" Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03sched/fair: Delete resched_cpu() from idle_balance()Kirill Tkhai
We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr() if this is necessary. Furthermore, a higher priority class task may be current on dest rq, we shouldn't disturb it. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03sched, time: Fix build error with 64 bit cputime_t on 32 bit systemsRik van Riel
On 32 bit systems cmpxchg cannot handle 64 bit values, so some additional magic is required to allow a 32 bit system with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build. Make sure the correct cmpxchg function is used when doing an atomic swap of a cputime_t. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Cc: oleg@redhat.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: linux390@de.ibm.com Cc: linux-arch@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03sched: Improve sysbench performance by fixing spurious active migrationVincent Guittot
Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...") sd_pick_busiest returns a group that can be neither imbalanced nor overloaded but is only more loaded than others. This change has been introduced to ensure a better load balance in system that are not overloaded but as a side effect, it can also generate useless active migration between groups. Let take the example of 3 tasks on a quad cores system. We will always have an idle core so the load balance will find a busiest group (core) whenever an ILB is triggered and it will force an active migration (once above nr_balance_failed threshold) so the idle core becomes busy but another core will become idle. With the next ILB, the freshly idle core will try to pull the task of a busy CPU. The number of spurious active migration is not so huge in quad core system because the ILB is not triggered so much. But it becomes significant as soon as you have more than one sched_domain level like on a dual cluster of quad cores where the ILB is triggered every tick when you have more than 1 busy_cpu We need to ensure that the migration generate a real improveùent and will not only move the avg_load imbalance on another CPU. Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use case was ensured by the following test in f_b_g: if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) This patch modified the condition to take into account situation where busiest group is not overloaded: If the diff between the number of idle cpus in 2 groups is less than or equal to 1 and the busiest group is not overloaded, moving a task will not improve the load balance but just move it. A test with sysbench on a dual clusters of quad cores gives the following results: command: sysbench --test=cpu --num-threads=5 --max-time=5 run The HZ is 200 which means that 1000 ticks has fired during the test. With Mainline, perf gives the following figures: Samples: 727 of event 'sched:sched_migrate_task' Event count (approx.): 727 Overhead Command Shared Object Symbol ........ ............... ............. .............. 12.52% migration/1 [unknown] [.] 00000000 12.52% migration/5 [unknown] [.] 00000000 12.52% migration/7 [unknown] [.] 00000000 12.10% migration/6 [unknown] [.] 00000000 11.83% migration/0 [unknown] [.] 00000000 11.83% migration/3 [unknown] [.] 00000000 11.14% migration/4 [unknown] [.] 00000000 10.87% migration/2 [unknown] [.] 00000000 2.75% sysbench [unknown] [.] 00000000 0.83% swapper [unknown] [.] 00000000 0.55% ktps65090charge [unknown] [.] 00000000 0.41% mmcqd/1 [unknown] [.] 00000000 0.14% perf [unknown] [.] 00000000 With this patch, perf gives the following figures Samples: 20 of event 'sched:sched_migrate_task' Event count (approx.): 20 Overhead Command Shared Object Symbol ........ ............... ............. .............. 80.00% sysbench [unknown] [.] 00000000 10.00% swapper [unknown] [.] 00000000 5.00% ktps65090charge [unknown] [.] 00000000 5.00% migration/1 [unknown] [.] 00000000 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-25SCHED: add some "wait..on_bit...timeout()" interfaces.NeilBrown
In commit c1221321b7c25b53204447cff9949a6d5a7ddddc sched: Allow wait_on_bit_action() functions to support a timeout I suggested that a "wait_on_bit_timeout()" interface would not meet my need. This isn't true - I was just over-engineering. Including a 'private' field in wait_bit_key instead of a focused "timeout" field was just premature generalization. If some other use is ever found, it can be generalized or added later. So this patch renames "private" to "timeout" with a meaning "stop waiting when "jiffies" reaches or passes "timeout", and adds two of the many possible wait..bit..timeout() interfaces: wait_on_page_bit_killable_timeout(), which is the one I want to use, and out_of_line_wait_on_bit_timeout() which is a reasonably general example. Others can be added as needed. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24sched/rt: Use resched_curr() in task_tick_rt()Kirill Tkhai
Some time ago PREEMPT_NEED_RESCHED was implemented, so reschedule technics is a little more difficult now. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Use rq->rd in sched_setaffinity() under RCU read lockKirill Tkhai
Probability of use-after-free isn't zero in this place. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'Kirill Tkhai
Nothing is locked there, so label's name only confuses a reader. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Use dl_bw_of() under RCU read lockKirill Tkhai
dl_bw_of() dereferences rq->rd which has to have RCU read lock held. Probability of use-after-free isn't zero here. Also add lockdep assert into dl_bw_cpus(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/fair: Remove duplicate code from can_migrate_task()Kirill Tkhai
Combine two branches which do the same. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSWPeter Zijlstra
Kirill found that there's a subtle race in the __ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the entire exception because neither arch that uses it seems to actually still require it. Boot tested on mips64el (qemu) only. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tony Luck <tony.luck@intel.com> Cc: oleg@redhat.com Cc: linux@roeck-us.net Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: print_rq(): Don't use tasklist_lockOleg Nesterov
read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do not need to disable irqs, and they are already disabled by the caller. And afaics this lock buys nothing, we can rely on rcu_read_lock(). In this case it makes sense to also move rcu_read_lock/unlock from the caller to print_rq(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use ↵Oleg Nesterov
task_rq_lock() 1. read_lock(tasklist_lock) does not need to disable irqs. 2. ->mm != NULL is a common mistake, use PF_KTHREAD. 3. The second ->mm check can be simply removed. 4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) + __task_rq_lock(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Fix the task-group check in tg_has_rt_tasks()Oleg Nesterov
tg_has_rt_tasks() wants to find an RT task in this task_group, but task_rq(p)->rt.tg wrongly checks the root rt_rq. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/fair: Leverage the idle state info when choosing the "idlest" cpuNicolas Pitre
The code in find_idlest_cpu() looks for the CPU with the smallest load. However, if multiple CPUs are idle, the first idle CPU is selected irrespective of the depth of its idle state. Among the idle CPUs we should pick the one with with the shallowest idle state, or the latest to have gone idle if all idle CPUs are in the same state. The later applies even when cpuidle is configured out. This patch doesn't cover the following issues: - The idle exit latency of a CPU might be larger than the time needed to migrate the waking task to an already running CPU with sufficient capacity, and therefore performance would benefit from task packing in such case (in most cases task packing is about power saving). - Some idle states have a non negligible and non abortable entry latency which needs to run to completion before the exit latency can start. A concurrent patch series is making this info available to the cpuidle core. Once available, the entry latency with the idle timestamp could determine when the exit latency may be effective. Those issues will be handled in due course. In the mean time, what is implemented here should improve things already compared to the current state of affairs. Based on an initial patch from Daniel Lezcano. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched: Let the scheduler see CPU idle statesDaniel Lezcano
When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/deadline: Fix inter- exclusive cpusets migrationsJuri Lelli
Users can perform clustered scheduling using the cpuset facility. After an exclusive cpuset is created, task migrations happen only between CPUs belonging to the same cpuset. Inter- cpuset migrations can only happen when the user requires so, moving a task between different cpusets. This behaviour is broken in SCHED_DEADLINE, as currently spurious inter- cpuset migration may happen without user intervention. This patch fix the problem (and shuffles the code a bit to improve clarity). Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: raistlin@linux.it Cc: michael@amarulasolutions.com Cc: fchecconi@gmail.com Cc: daniel.wagner@bmw-carit.de Cc: vincent@legout.info Cc: luca.abeni@unitn.it Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/deadline: Clear dl_entity params when setscheduling to different classJuri Lelli
When a task is using SCHED_DEADLINE and the user setschedules it to a different class its sched_dl_entity static parameters are not cleaned up. This causes a bug if the user sets it back to SCHED_DEADLINE with the same parameters again. The problem resides in the check we perform at the very beginning of dl_overflow(): if (new_bw == p->dl.dl_bw) return 0; This condition is met in the case depicted above, so the function returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not added to it). After this, admission control is broken. This patch fixes the thing, properly clearing static parameters for a task that ceases to use SCHED_DEADLINE. Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Tested-by: Luca Abeni <luca.abeni@unitn.it> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()Oleg Nesterov
current->state == TASK_DEAD means that the task is doing its last schedule(), page fault is obviously impossible at this stage. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-21sched: Clean up some typos and grammatical errors in code/commentsZhihui Zhang
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Test the CPU's capacity in wake_affine()Vincent Guittot
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity. Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Allow all architectures to set 'capacity_orig'Vincent Guittot
'capacity_orig' is only changed for systems with an SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an original capacity that is different from the default value. Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platforms to set capacity_orig. The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant, in order to keep backward compatibility with the use of capacity_orig. arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as they were not used elsewhere than in arch_scale_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com> [ Added default_scale_cpu_capacity() back. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Fix avg_load computationVincent Guittot
The computation of avg_load and avg_load_per_task should only take into account the number of CFS tasks. The non-CFS tasks are already taken into account by decreasing the CPU's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Remove a wake_affine() conditionVincent Guittot
In wake_affine() I have tried to understand the meaning of the condition: (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but I failed to find a use case that can take advantage of it and I haven't found clear description in the previous commit's log. Futhermore, the comment of the condition refers to the task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */ If we look more deeply the below condition: this_load + target_load(prev_cpu, idx) <= tl_per_task When sync is clear, we have: tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx]) It implies that runnable_load_avg == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load. but if this _load is null, 'balanced' is already set and the test is redundant. If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values. The current conclusion is that these additional condition don't give any benefit so we can remove them. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Fix imbalance flag resetVincent Guittot
The imbalance flag can stay set whereas there is no imbalance. Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU. We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Add default-disabled option to BUG() when stack end location is ↵Aaron Tomlin
overwritten Currently in the event of a stack overrun a call to schedule() does not check for this type of corruption. This corruption is often silent and can go unnoticed. However once the corrupted region is examined at a later stage, the outcome is undefined and often results in a sporadic page fault which cannot be handled. This patch checks for a stack overrun and takes appropriate action since the damage is already done, there is no point in continuing. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: oleg@redhat.com Cc: riel@redhat.com Cc: prarit@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Cc: rostedt@goodmis.org Cc: hannes@cmpxchg.org Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David S. Miller <davem@davemloft.net> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lubomir Rintel <lkundrak@v3.sk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1410527779-8133-4-git-send-email-atomlin@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Do not stop cpu in set_cpus_allowed_ptr() if task is not runningKirill Tkhai
If a task is queued but not running on it rq, we can simply migrate it without migration thread and switching of context. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1410519814.3569.7.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/dl: Simplify pick_dl_task()Kirill Tkhai
1) Nobody calls pick_dl_task() with negative cpu, it's old RT leftover. 2) If p->nr_cpus_allowed is 1, than the affinity has just been changed in set_cpus_allowed_ptr(); we'll pick it just earlier than migration thread. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529340.3569.27.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/rt: Remove useless if from cleanup pick_next_task_rt()Kirill Tkhai
_pick_next_task_rt() never returns NULL. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529321.3569.26.camel@tkhai Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/core: Use put_prev_task() accessor where possibleKirill Tkhai
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529300.3569.25.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/fair: cleanup: Remove useless assignment in select_task_rq_fair()Kirill Tkhai
new_cpu is reassigned below, so we do not need this here. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1410529276.3569.24.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched, time: Fix lock inversion in thread_group_cputime()Rik van Riel
The sig->stats_lock nests inside the tasklist_lock and the sighand->siglock in __exit_signal and wait_task_zombie. However, both of those locks can be taken from irq context, which means we need to use the interrupt safe variant of read_seqbegin_or_lock. This blocks interrupts when the "lock" branch is taken (seq is odd), preventing the lock inversion. On the first (lockless) pass through the loop, irqs are not blocked. Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: prarit@redhat.com Cc: oleg@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched: Add new API wake_up_if_idle() to wake up the idle cpuChuansheng Liu
Implementing one new API wake_up_if_idle(), which is used to wake up the idle CPU. Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: daniel.lezcano@linaro.org Cc: rjw@rjwysocki.net Cc: linux-pm@vger.kernel.org Cc: changcheng.liu@intel.com Cc: xiaoming.wang@intel.com Cc: souvik.k.chakravarty@intel.com Cc: chuansheng.liu@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409815075-4180-1-git-send-email-chuansheng.liu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19sched/numa: Use select_idle_sibling() to select a destination for ↵Rik van Riel
task_numa_move() The code in task_numa_compare() will only examine at most one idle CPU per node, because they all have the same score. However, some idle CPUs are better candidates than others, due to busy or idle SMT siblings, etc... The scheduler has logic to find the best CPU within an LLC to place a task. The NUMA code should probably use it. This seems to reduce the standard deviation for single instance SPECjbb2005 with a low warehouse count on my 4 node test system. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140904163530.189d410a@cuia.bos.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09sched: Reduce contention in update_cfs_rq_blocked_load()Jason Low
When running workloads on 2+ socket systems, based on perf profiles, the update_cfs_rq_blocked_load() function often shows up as taking up a noticeable % of run time. Much of the contention is in __update_cfs_rq_tg_load_contrib() when we update the tg load contribution stats. However, it turns out that in many cases, they don't need to be updated and "tg_contrib" is 0. This patch adds a check in __update_cfs_rq_tg_load_contrib() to skip updating tg load contribution stats when nothing needs to be updated. This reduces the cacheline contention that would be unnecessary. Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: jason.low2@hp.com Cc: Yuyang Du <yuyang.du@intel.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409643684.19197.15.camel@j-VirtualBox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09sched: Migrate waking tasksLai Jiangshan
Current code can fail to migrate a waking task (silently) when TTWU_QUEUE is enabled. When a task is waking, it is pending on the wake_list of the rq, but it is not queued (task->on_rq == 0). In this case, set_cpus_allowed_ptr() and __migrate_task() will not migrate it because its invisible to them. This behavior is incorrect, because the task has been already woken, it will be running on the wrong CPU without correct placement until the next wake-up or update for cpus_allowed. To fix this problem, we need to finish the wakeup (so they appear on the runqueue) before we migrate them. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Tested-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/538ED7EB.5050303@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08sched, time: Atomically increment stime & utimeRik van Riel
The functions task_cputime_adjusted and thread_group_cputime_adjusted() can be called locklessly, as well as concurrently on many different CPUs. This can occasionally lead to the utime and stime reported by times(), and other syscalls like it, going backward. The cause for this appears to be multiple threads racing in cputime_adjust(), both with values for utime or stime that is larger than the original, but each with a different value. Sometimes the larger value gets saved first, only to be immediately overwritten with a smaller value by another thread. Using atomic exchange prevents that problem, and ensures time progresses monotonically. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: akpm@linux-foundation.org Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>