summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2018-10-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman
Dave writes: "Networking fixes: 1) Fix truncation of 32-bit right shift in bpf, from Jann Horn. 2) Fix memory leak in wireless wext compat, from Stefan Seyfried. 3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao. 4) Need to cancel pending work when unbinding in smsc75xx otherwise we oops, also from Yu Zhao. 5) Don't allow enslaving a team device to itself, from Ido Schimmel. 6) Fix backwards compat with older userspace for rtnetlink FDB dumps. From Mauricio Faria. 7) Add validation of tc policy netlink attributes, from David Ahern. 8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: mvpp2: Extract the correct ethtype from the skb for tx csum offload ipv6: take rcu lock in rawv6_send_hdrinc() net: sched: Add policy validation for tc attributes rtnetlink: fix rtnl_fdb_dump() for ndmsg header yam: fix a missing-check bug net: bpfilter: Fix type cast and pointer warnings net: cxgb3_main: fix a missing-check bug bpf: 32-bit RSH verification must truncate input before the ALU op net: phy: phylink: fix SFP interface autodetection be2net: don't flip hw_features when VXLANs are added/deleted net/packet: fix packet drop as of virtio gso net: dsa: b53: Keep CPU port as tagged in all VLANs openvswitch: load NAT helper bnxt_en: get the reduced max_irqs by the ones used by RDMA bnxt_en: free hwrm resources, if driver probe fails. bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request bnxt_en: Fix VNIC reservations on the PF. team: Forbid enslaving team device to itself net/usb: cancel pending work when unbinding smsc75xx mlxsw: spectrum: Delete RIF when VLAN device is removed ...
2018-10-05Merge branch 'perf-urgent-for-linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "perf fixes: - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver - fix a CPU event enumeration bug in the x86 AMD PMU driver - fix a perf ring-buffer corruption bug when using tracepoints - fix a PMU unregister locking bug" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX perf/ring_buffer: Prevent concurent ring buffer access perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0 perf/core: Fix perf_pmu_unregister() locking
2018-10-05Merge branch 'sched-urgent-for-linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "scheduler fixes: These fixes address a rather involved performance regression between v4.17->v4.19 in the sched/numa auto-balancing code. Since distros really need this fix we accelerated it to sched/urgent for a faster upstream merge. NUMA scheduling and balancing performance is now largely back to v4.17 levels, without reintroducing the NUMA placement bugs that v4.18 and v4.19 fixed. Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for reporting, testing, re-testing and solving this rather complex set of bugs." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration sched/numa: Avoid task migration for small NUMA improvement mm/migrate: Use spin_trylock() while resetting rate limit sched/numa: Limit the conditions where scan period is reset sched/numa: Reset scan rate whenever task moves across nodes sched/numa: Pass destination CPU as a parameter to migrate_task_rq sched/numa: Stop multiple tasks from moving to the CPU at the same time
2018-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-10-05 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix to truncate input on ALU operations in 32 bit mode, from Jann. 2) Fixes for cgroup local storage to reject reserved flags on element update and rejection of map allocation with zero-sized value, from Roman. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-05bpf: 32-bit RSH verification must truncate input before the ALU opJann Horn
When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it is sufficient to just truncate the output to 32 bits; and so I just moved the register size coercion that used to be at the start of the function to the end of the function. That assumption is true for almost every op, but not for 32-bit right shifts, because those can propagate information towards the least significant bit. Fix it by always truncating inputs for 32-bit ops to 32 bits. Also get rid of the coerce_reg_to_size() after the ALU op, since that has no effect. Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-03locking/ww_mutex: Fix runtime warning in the WW mutex selftestGuenter Roeck
If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image in an arm64 virtual machine results in the following traceback if 8 CPUs are enabled: DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current) WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0 ... Call trace: __mutex_unlock_slowpath() ww_mutex_unlock() test_cycle_work() process_one_work() worker_thread() kthread() ret_from_fork() If requesting b_mutex fails with -EDEADLK, the error variable is reassigned to the return value from calling ww_mutex_lock on a_mutex again. If this call fails, a_mutex is not locked. It is, however, unconditionally unlocked subsequently, causing the reported warning. Fix the problem by using two error variables. With this change, the selftest still fails as follows: cyclic deadlock not resolved, ret[7/8] = -35 However, the traceback is gone. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks") Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02bpf: don't accept cgroup local storage with zero value sizeRoman Gushchin
Explicitly forbid creating cgroup local storage maps with zero value size, as it makes no sense and might even cause a panic. Reported-by: syzbot+18628320d3b14a5c459c@syzkaller.appspotmail.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-02sched/numa: Migrate pages to local nodes quicker early in the lifetime of a taskMel Gorman
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks. With this patch applied, STREAM performance is the same as 4.17 even though processes are not spread cross-node prematurely. Other workloads showed a mix of minor gains and losses. This is somewhat expected most workloads are not very sensitive to the starting conditions of a process. 4.19.0-rc5 4.19.0-rc5 4.17.0 numab-v1r1 fastmigrate-v1r1 vanilla MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%) MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%) MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%) MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Avoid task migration for small NUMA improvementSrikar Dronamraju
If NUMA improvement from the task migration is going to be very minimal, then avoid task migration. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 198512 205910 3.72673 1 313559 318491 1.57291 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 8 74761.9 74935.9 0.232739 1 214874 226796 5.54837 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 180536 189780 5.12031 1 210281 205695 -2.18089 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 56511.4 60370 6.828 1 104899 108100 3.05151 1/7 cases is regressing, if we look at events migrate_pages seem to vary the most especially in the regressing case. Also some amount of variance is expected between different runs of Specjbb2005. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,818,546 13,801,554 migrations 1,149,960 1,151,541 faults 385,583 433,246 cache-misses 55,259,546,768 55,168,691,835 sched:sched_move_numa 2,257 2,551 sched:sched_stick_numa 9 24 sched:sched_swap_numa 512 904 migrate:mm_migrate_pages 2,225 1,571 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 72692 113682 numa_hint_faults_local 62270 102163 numa_hit 238762 240181 numa_huge_pte_updates 48 36 numa_interleave 75 64 numa_local 238676 240103 numa_other 86 78 numa_pages_migrated 2225 1564 numa_pte_updates 98557 134080 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,173,490 3,079,150 migrations 36,966 31,455 faults 108,776 99,081 cache-misses 12,200,075,320 11,588,126,740 sched:sched_move_numa 1,264 1 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 899 36 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 21109 430 numa_hint_faults_local 17120 77 numa_hit 72934 71277 numa_huge_pte_updates 42 0 numa_interleave 33 22 numa_local 72866 71218 numa_other 68 59 numa_pages_migrated 915 23 numa_pte_updates 42326 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,312,022 8,707,565 migrations 231,705 171,342 faults 310,242 310,820 cache-misses 402,324,573 136,115,400 sched:sched_move_numa 193 215 sched:sched_stick_numa 0 6 sched:sched_swap_numa 3 24 migrate:mm_migrate_pages 93 162 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11838 8985 numa_hint_faults_local 11216 8154 numa_hit 90689 93819 numa_huge_pte_updates 0 0 numa_interleave 1579 882 numa_local 89634 93496 numa_other 1055 323 numa_pages_migrated 92 169 numa_pte_updates 12109 9217 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,170,481 2,152,072 migrations 10,126 10,704 faults 160,962 164,376 cache-misses 10,834,845 3,818,437 sched:sched_move_numa 10 16 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 7 migrate:mm_migrate_pages 2 199 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 403 2248 numa_hint_faults_local 358 1666 numa_hit 25898 25704 numa_huge_pte_updates 0 0 numa_interleave 207 200 numa_local 25860 25679 numa_other 38 25 numa_pages_migrated 2 197 numa_pte_updates 400 2234 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 110,339,633 93,330,595 migrations 4,139,812 4,122,061 faults 863,622 865,979 cache-misses 231,838,045,660 225,395,083,479 sched:sched_move_numa 2,196 2,372 sched:sched_stick_numa 33 24 sched:sched_swap_numa 544 769 migrate:mm_migrate_pages 2,469 1,677 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 85748 91638 numa_hint_faults_local 66831 78096 numa_hit 242213 242225 numa_huge_pte_updates 0 0 numa_interleave 0 2 numa_local 242211 242219 numa_other 2 6 numa_pages_migrated 2376 1515 numa_pte_updates 86233 92274 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 59,331,057 51,487,271 migrations 552,019 537,170 faults 266,586 256,921 cache-misses 73,796,312,990 70,073,831,187 sched:sched_move_numa 981 576 sched:sched_stick_numa 54 24 sched:sched_swap_numa 286 327 migrate:mm_migrate_pages 713 726 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 14807 12000 numa_hint_faults_local 5738 5024 numa_hit 36230 36470 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36228 36465 numa_other 2 5 numa_pages_migrated 703 726 numa_pte_updates 14742 11930 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Limit the conditions where scan period is resetMel Gorman
migrate_task_rq_fair() resets the scan rate for NUMA balancing on every cross-node migration. In the event of excessive load balancing due to saturation, this may result in the scan rate being pegged at maximum and further overloading the machine. This patch only resets the scan if NUMA balancing is active, a preferred node has been selected and the task is being migrated from the preferred node as these are the most harmful. For example, a migration to the preferred node does not justify a faster scan rate. Similarly, a migration between two nodes that are not preferred is probably bouncing due to over-saturation of the machine. In that case, scanning faster and trapping more NUMA faults will further overload the machine. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203370 205332 0.964744 1 328431 319785 -2.63252 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 206070 206585 0.249915 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188386 189162 0.41192 1 201566 213760 6.04963 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 59157.4 58736.8 -0.710985 1 105495 105419 -0.0720413 Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,825,492 14,285,708 migrations 1,152,509 1,180,621 faults 371,948 339,114 cache-misses 55,654,206,041 55,205,631,894 sched:sched_move_numa 1,856 843 sched:sched_stick_numa 4 6 sched:sched_swap_numa 428 219 migrate:mm_migrate_pages 898 365 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 57146 26907 numa_hint_faults_local 51612 24279 numa_hit 238164 239771 numa_huge_pte_updates 16 0 numa_interleave 63 68 numa_local 238085 239688 numa_other 79 83 numa_pages_migrated 883 363 numa_pte_updates 67540 27415 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,288,525 3,202,779 migrations 38,652 37,186 faults 111,678 106,076 cache-misses 12,111,197,376 12,024,873,744 sched:sched_move_numa 900 931 sched:sched_stick_numa 0 0 sched:sched_swap_numa 5 1 migrate:mm_migrate_pages 714 637 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 18572 17409 numa_hint_faults_local 14850 14367 numa_hit 73197 73953 numa_huge_pte_updates 11 20 numa_interleave 25 25 numa_local 73138 73892 numa_other 59 61 numa_pages_migrated 712 668 numa_pte_updates 24021 27276 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,451,543 8,474,013 migrations 202,804 254,934 faults 310,024 320,506 cache-misses 253,522,507 110,580,458 sched:sched_move_numa 213 725 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 7 migrate:mm_migrate_pages 88 145 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11830 22797 numa_hint_faults_local 11301 21539 numa_hit 90038 89308 numa_huge_pte_updates 0 0 numa_interleave 855 865 numa_local 89796 88955 numa_other 242 353 numa_pages_migrated 88 149 numa_pte_updates 12039 22930 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,049,153 2,195,628 migrations 11,405 11,179 faults 162,309 149,656 cache-misses 7,203,343 8,117,515 sched:sched_move_numa 22 49 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 1 5 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 1693 3577 numa_hint_faults_local 1669 3476 numa_hit 25177 26142 numa_huge_pte_updates 0 0 numa_interleave 194 358 numa_local 24993 26042 numa_other 184 100 numa_pages_migrated 1 5 numa_pte_updates 1577 3587 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 94,515,937 100,602,296 migrations 4,203,554 4,135,630 faults 832,697 789,256 cache-misses 226,248,698,331 226,160,621,058 sched:sched_move_numa 1,730 1,366 sched:sched_stick_numa 14 16 sched:sched_swap_numa 432 374 migrate:mm_migrate_pages 1,398 1,350 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 80079 47857 numa_hint_faults_local 68620 39768 numa_hit 241187 240165 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 241186 240165 numa_other 1 0 numa_pages_migrated 1347 1224 numa_pte_updates 80729 48354 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 63,704,961 58,515,496 migrations 573,404 564,845 faults 230,878 245,807 cache-misses 76,568,222,781 73,603,757,976 sched:sched_move_numa 509 996 sched:sched_stick_numa 31 10 sched:sched_swap_numa 182 193 migrate:mm_migrate_pages 541 646 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 8501 13422 numa_hint_faults_local 2960 5619 numa_hit 35526 36118 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35526 36116 numa_other 0 2 numa_pages_migrated 539 616 numa_pte_updates 8433 13374 Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Reset scan rate whenever task moves across nodesSrikar Dronamraju
Currently task scan rate is reset when NUMA balancer migrates the task to a different node. If NUMA balancer initiates a swap, reset is only applicable to the task that initiates the swap. Similarly no scan rate reset is done if the task is migrated across nodes by traditional load balancer. Instead move the scan reset to the migrate_task_rq. This ensures the task moved out of its preferred node, either gets back to its preferred node quickly or finds a new preferred node. Doing so, would be fair to all tasks migrating across nodes. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200668 203370 1.3465 1 321791 328431 2.06345 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 204848 206070 0.59654 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188098 188386 0.153112 1 200351 201566 0.606436 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 58145.9 59157.4 1.73959 1 103798 105495 1.63491 Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,912,183 13,825,492 migrations 1,155,931 1,152,509 faults 367,139 371,948 cache-misses 54,240,196,814 55,654,206,041 sched:sched_move_numa 1,571 1,856 sched:sched_stick_numa 9 4 sched:sched_swap_numa 463 428 migrate:mm_migrate_pages 703 898 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 50155 57146 numa_hint_faults_local 45264 51612 numa_hit 239652 238164 numa_huge_pte_updates 36 16 numa_interleave 68 63 numa_local 239576 238085 numa_other 76 79 numa_pages_migrated 680 883 numa_pte_updates 71146 67540 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,156,720 3,288,525 migrations 30,354 38,652 faults 97,261 111,678 cache-misses 12,400,026,826 12,111,197,376 sched:sched_move_numa 4 900 sched:sched_stick_numa 0 0 sched:sched_swap_numa 1 5 migrate:mm_migrate_pages 20 714 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 272 18572 numa_hint_faults_local 186 14850 numa_hit 71362 73197 numa_huge_pte_updates 0 11 numa_interleave 23 25 numa_local 71299 73138 numa_other 63 59 numa_pages_migrated 2 712 numa_pte_updates 0 24021 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,606,824 8,451,543 migrations 155,352 202,804 faults 301,409 310,024 cache-misses 157,759,224 253,522,507 sched:sched_move_numa 168 213 sched:sched_stick_numa 0 0 sched:sched_swap_numa 3 2 migrate:mm_migrate_pages 125 88 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 4650 11830 numa_hint_faults_local 3946 11301 numa_hit 90489 90038 numa_huge_pte_updates 0 0 numa_interleave 892 855 numa_local 90034 89796 numa_other 455 242 numa_pages_migrated 124 88 numa_pte_updates 4818 12039 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,113,167 2,049,153 migrations 10,533 11,405 faults 142,727 162,309 cache-misses 5,594,192 7,203,343 sched:sched_move_numa 10 22 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 1 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 744 1693 numa_hint_faults_local 584 1669 numa_hit 25551 25177 numa_huge_pte_updates 0 0 numa_interleave 263 194 numa_local 25302 24993 numa_other 249 184 numa_pages_migrated 6 1 numa_pte_updates 744 1577 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 101,227,352 94,515,937 migrations 4,151,829 4,203,554 faults 745,233 832,697 cache-misses 224,669,561,766 226,248,698,331 sched:sched_move_numa 617 1,730 sched:sched_stick_numa 2 14 sched:sched_swap_numa 187 432 migrate:mm_migrate_pages 316 1,398 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 24195 80079 numa_hint_faults_local 21639 68620 numa_hit 238331 241187 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 238331 241186 numa_other 0 1 numa_pages_migrated 204 1347 numa_pte_updates 24561 80729 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 62,738,978 63,704,961 migrations 562,702 573,404 faults 228,465 230,878 cache-misses 75,778,067,952 76,568,222,781 sched:sched_move_numa 648 509 sched:sched_stick_numa 13 31 sched:sched_swap_numa 137 182 migrate:mm_migrate_pages 733 541 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 10281 8501 numa_hint_faults_local 3242 2960 numa_hit 36338 35526 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36338 35526 numa_other 0 0 numa_pages_migrated 706 539 numa_pte_updates 10176 8433 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Pass destination CPU as a parameter to migrate_task_rqSrikar Dronamraju
This additional parameter (new_cpu) is used later for identifying if task migration is across nodes. No functional change. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203353 200668 -1.32036 1 328205 321791 -1.95427 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 214384 204848 -4.44809 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188553 188098 -0.241311 1 196273 200351 2.07772 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 57581.2 58145.9 0.980702 1 103468 103798 0.318939 Brings out the variance between different specjbb2005 runs. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,941,377 13,912,183 migrations 1,157,323 1,155,931 faults 382,175 367,139 cache-misses 54,993,823,500 54,240,196,814 sched:sched_move_numa 2,005 1,571 sched:sched_stick_numa 14 9 sched:sched_swap_numa 529 463 migrate:mm_migrate_pages 1,573 703 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 67099 50155 numa_hint_faults_local 58456 45264 numa_hit 240416 239652 numa_huge_pte_updates 18 36 numa_interleave 65 68 numa_local 240339 239576 numa_other 77 76 numa_pages_migrated 1574 680 numa_pte_updates 77182 71146 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,176,453 3,156,720 migrations 30,238 30,354 faults 87,869 97,261 cache-misses 12,544,479,391 12,400,026,826 sched:sched_move_numa 23 4 sched:sched_stick_numa 0 0 sched:sched_swap_numa 6 1 migrate:mm_migrate_pages 10 20 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 236 272 numa_hint_faults_local 201 186 numa_hit 72293 71362 numa_huge_pte_updates 0 0 numa_interleave 26 23 numa_local 72233 71299 numa_other 60 63 numa_pages_migrated 8 2 numa_pte_updates 0 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,478,820 8,606,824 migrations 171,323 155,352 faults 307,499 301,409 cache-misses 240,353,599 157,759,224 sched:sched_move_numa 214 168 sched:sched_stick_numa 0 0 sched:sched_swap_numa 4 3 migrate:mm_migrate_pages 89 125 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 5301 4650 numa_hint_faults_local 4745 3946 numa_hit 92943 90489 numa_huge_pte_updates 0 0 numa_interleave 899 892 numa_local 92345 90034 numa_other 598 455 numa_pages_migrated 88 124 numa_pte_updates 5505 4818 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,066,172 2,113,167 migrations 11,076 10,533 faults 149,544 142,727 cache-misses 10,398,067 5,594,192 sched:sched_move_numa 43 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3552 744 numa_hint_faults_local 3347 584 numa_hit 25611 25551 numa_huge_pte_updates 0 0 numa_interleave 213 263 numa_local 25583 25302 numa_other 28 249 numa_pages_migrated 6 6 numa_pte_updates 3535 744 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 99,358,136 101,227,352 migrations 4,041,607 4,151,829 faults 749,653 745,233 cache-misses 225,562,543,251 224,669,561,766 sched:sched_move_numa 771 617 sched:sched_stick_numa 14 2 sched:sched_swap_numa 204 187 migrate:mm_migrate_pages 1,180 316 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 27409 24195 numa_hint_faults_local 20677 21639 numa_hit 239988 238331 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 239983 238331 numa_other 5 0 numa_pages_migrated 1016 204 numa_pte_updates 27916 24561 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 60,899,307 62,738,978 migrations 544,668 562,702 faults 270,834 228,465 cache-misses 74,543,455,635 75,778,067,952 sched:sched_move_numa 735 648 sched:sched_stick_numa 25 13 sched:sched_swap_numa 174 137 migrate:mm_migrate_pages 816 733 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 11059 10281 numa_hint_faults_local 4733 3242 numa_hit 41384 36338 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 41383 36338 numa_other 1 0 numa_pages_migrated 815 706 numa_pte_updates 11323 10176 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Stop multiple tasks from moving to the CPU at the same timeSrikar Dronamraju
Task migration under NUMA balancing can happen in parallel. More than one task might choose to migrate to the same CPU at the same time. This can result in: - During task swap, choosing a task that was not part of the evaluation. - During task swap, task which just got moved into its preferred node, moving to a completely different node. - During task swap, task failing to move to the preferred node, will have to wait an extra interval for the next migrate opportunity. - During task movement, multiple task movements can cause load imbalance. This problem is more likely if there are more cores per node or more nodes in the system. Use a per run-queue variable to check if NUMA-balance is active on the run-queue. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200194 203353 1.57797 1 311331 328205 5.41995 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 197654 214384 8.46429 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 192605 188553 -2.10379 1 213402 196273 -8.02664 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 52227.1 57581.2 10.2516 1 102529 103468 0.915838 There is a regression on power 9 box. If we look at the details, that box has a sudden jump in cache-misses with this patch. All other parameters seem to be pointing towards NUMA consolidation. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,345,784 13,941,377 migrations 1,127,820 1,157,323 faults 374,736 382,175 cache-misses 55,132,054,603 54,993,823,500 sched:sched_move_numa 1,923 2,005 sched:sched_stick_numa 52 14 sched:sched_swap_numa 595 529 migrate:mm_migrate_pages 1,932 1,573 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 60605 67099 numa_hint_faults_local 51804 58456 numa_hit 239945 240416 numa_huge_pte_updates 14 18 numa_interleave 60 65 numa_local 239865 240339 numa_other 80 77 numa_pages_migrated 1931 1574 numa_pte_updates 67823 77182 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,016,467 3,176,453 migrations 37,326 30,238 faults 115,342 87,869 cache-misses 11,692,155,554 12,544,479,391 sched:sched_move_numa 965 23 sched:sched_stick_numa 8 0 sched:sched_swap_numa 35 6 migrate:mm_migrate_pages 1,168 10 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 16286 236 numa_hint_faults_local 11863 201 numa_hit 112482 72293 numa_huge_pte_updates 33 0 numa_interleave 20 26 numa_local 112419 72233 numa_other 63 60 numa_pages_migrated 1144 8 numa_pte_updates 32859 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,629,724 8,478,820 migrations 221,052 171,323 faults 308,661 307,499 cache-misses 135,574,913 240,353,599 sched:sched_move_numa 147 214 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 4 migrate:mm_migrate_pages 64 89 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11481 5301 numa_hint_faults_local 10968 4745 numa_hit 89773 92943 numa_huge_pte_updates 0 0 numa_interleave 1116 899 numa_local 89220 92345 numa_other 553 598 numa_pages_migrated 62 88 numa_pte_updates 11694 5505 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,272,887 2,066,172 migrations 12,206 11,076 faults 163,704 149,544 cache-misses 4,801,186 10,398,067 sched:sched_move_numa 44 43 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 17 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 2261 3552 numa_hint_faults_local 1993 3347 numa_hit 25726 25611 numa_huge_pte_updates 0 0 numa_interleave 239 213 numa_local 25498 25583 numa_other 228 28 numa_pages_migrated 17 6 numa_pte_updates 2266 3535 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 117,980,962 99,358,136 migrations 3,950,220 4,041,607 faults 736,979 749,653 cache-misses 224,976,072,879 225,562,543,251 sched:sched_move_numa 504 771 sched:sched_stick_numa 50 14 sched:sched_swap_numa 239 204 migrate:mm_migrate_pages 1,260 1,180 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 18293 27409 numa_hint_faults_local 11969 20677 numa_hit 240854 239988 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240851 239983 numa_other 3 5 numa_pages_migrated 1190 1016 numa_pte_updates 18106 27916 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 61,053,158 60,899,307 migrations 551,586 544,668 faults 244,174 270,834 cache-misses 74,326,766,973 74,543,455,635 sched:sched_move_numa 344 735 sched:sched_stick_numa 24 25 sched:sched_swap_numa 140 174 migrate:mm_migrate_pages 568 816 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 6461 11059 numa_hint_faults_local 2283 4733 numa_hit 35661 41384 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35661 41383 numa_other 0 1 numa_pages_migrated 568 815 numa_pte_updates 6518 11323 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/ring_buffer: Prevent concurent ring buffer accessJiri Olsa
Some of the scheduling tracepoints allow the perf_tp_event code to write to ring buffer under different cpu than the code is running on. This results in corrupted ring buffer data demonstrated in following perf commands: # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.383 [sec] [ perf record: Woken up 8 times to write data ] 0x42b890 [0]: failed to process type: -1765585640 [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ] # perf report --stdio 0x42b890 [0]: failed to process type: -1765585640 The reason for the corruption are some of the scheduling tracepoints, that have __perf_task dfined and thus allow to store data to another cpu ring buffer: sched_waking sched_wakeup sched_wakeup_new sched_stat_wait sched_stat_sleep sched_stat_iowait sched_stat_blocked The perf_tp_event function first store samples for current cpu related events defined for tracepoint: hlist_for_each_entry_rcu(event, head, hlist_entry) perf_swevent_event(event, count, &data, regs); And then iterates events of the 'task' and store the sample for any task's event that passes tracepoint checks: ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]); list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { if (event->attr.type != PERF_TYPE_TRACEPOINT) continue; if (event->attr.config != entry->type) continue; perf_swevent_event(event, count, &data, regs); } Above code can race with same code running on another cpu, ending up with 2 cpus trying to store under the same ring buffer, which is specifically not allowed. This patch prevents the problem, by allowing only events with the same current cpu to receive the event. NOTE: this requires the use of (per-task-)per-cpu buffers for this feature to work; perf-record does this. Signed-off-by: Jiri Olsa <jolsa@kernel.org> [peterz: small edits to Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Vagin <avagin@openvz.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events") Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/core: Fix perf_pmu_unregister() lockingPeter Zijlstra
When we unregister a PMU, we fail to serialize the @pmu_idr properly. Fix that by doing the entire thing under pmu_lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-29Merge branch 'perf-urgent-for-linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Thomas writes: "A single fix for a missing sanity check when a pinned event is tried to be read on the wrong CPU due to a legit event scheduling failure." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Add sanity check to deal with pinned event failure
2018-09-28perf/core: Add sanity check to deal with pinned event failureReinette Chatre
It is possible that a failure can occur during the scheduling of a pinned event. The initial portion of perf_event_read_local() contains the various error checks an event should pass before it can be considered valid. Ensure that the potential scheduling failure of a pinned event is checked for and have a credible error. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: fenghua.yu@intel.com Cc: tony.luck@intel.com Cc: acme@kernel.org Cc: gavin.hindman@intel.com Cc: jithu.joseph@intel.com Cc: dave.hansen@intel.com Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
2018-09-28bpf: harden flags check in cgroup_storage_update_elem()Roman Gushchin
cgroup_storage_update_elem() shouldn't accept any flags argument values except BPF_ANY and BPF_EXIST to guarantee the backward compatibility, had a new flag value been added. Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-25dma-mapping: add the missing ARCH_HAS_SYNC_DMA_FOR_CPU_ALL declarationChristoph Hellwig
The patch adding the infrastructure failed to actually add the symbol declaration, oops.. Fixes: faef87723a ("dma-noncoherent: add a arch_sync_dma_for_cpu_all hook") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Paul Burton <paul.burton@mips.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-25Merge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman
Dave writes: "Networking fixes: 1) Fix multiqueue handling of coalesce timer in stmmac, from Jose Abreu. 2) Fix memory corruption in NFC, from Suren Baghdasaryan. 3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi. 4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun. 5) Fix TX done race in mvpp2, from Antoine Tenart. 6) ipv6 metrics leak, from Wei Wang. 7) Adjust firmware version requirements in mlxsw, from Petr Machata. 8) Fix autonegotiation on resume in r8169, from Heiner Kallweit. 9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff Barnhill. 10) Fix double free in devlink, from Dan Carpenter. 11) Fix ethtool regression from UFO feature removal, from Maciej Żenczykowski. 12) Fix drivers that have a ndo_poll_controller() that captures the cpu entirely on loaded hosts by trying to drain all rx and tx queues, from Eric Dumazet. 13) Fix memory corruption with jumbo frames in aquantia driver, from Friedemann Gerold." * gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits) net: mvneta: fix the remaining Rx descriptor unmapping issues ip_tunnel: be careful when accessing the inner header mpls: allow routes on ip6gre devices net: aquantia: memory corruption on jumbo frames tun: remove ndo_poll_controller nfp: remove ndo_poll_controller bnxt: remove ndo_poll_controller bnx2x: remove ndo_poll_controller mlx5: remove ndo_poll_controller mlx4: remove ndo_poll_controller i40evf: remove ndo_poll_controller ice: remove ndo_poll_controller igb: remove ndo_poll_controller ixgb: remove ndo_poll_controller fm10k: remove ndo_poll_controller ixgbevf: remove ndo_poll_controller ixgbe: remove ndo_poll_controller bonding: use netpoll_poll_dev() helper netpoll: make ndo_poll_controller() optional rds: Fix build regression. ...
2018-09-22bpf: sockmap, fix transition through disconnect without closeJohn Fastabend
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE state via tcp_disconnect() without actually calling tcp_close which would then call our bpf_tcp_close() callback. Because of this a user could disconnect a socket then put it in a LISTEN state which would break our assumptions about sockets always being ESTABLISHED state. To resolve this rely on the unhash hook, which is called in the disconnect case, to remove the sock from the sockmap. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-22bpf: sockmap only allow ESTABLISHED sock stateJohn Fastabend
After this patch we only allow socks that are in ESTABLISHED state or are being added via a sock_ops event that is transitioning into an ESTABLISHED state. By allowing sock_ops events we allow users to manage sockmaps directly from sock ops programs. The two supported sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB. Similar to TLS ULP this ensures sk_user_data is correct. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-20kernel/sys.c: remove duplicated includeYueHaibing
Link: http://lkml.kernel.org/r/20180821133424.18716-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20fork: report pid exhaustion correctlyKJ Tsanaktsidis
Make the clone and fork syscalls return EAGAIN when the limit on the number of pids /proc/sys/kernel/pid_max is exceeded. Currently, when the pid_max limit is exceeded, the kernel will return ENOSPC from the fork and clone syscalls. This is contrary to the documented behaviour, which explicitly calls out the pid_max case as one where EAGAIN should be returned. It also leads to really confusing error messages in userspace programs which will complain about a lack of disk space when they fail to create processes/threads for this reason. This error is being returned because alloc_pid() uses the idr api to find a new pid; when there are none available, idr_alloc_cyclic() returns -ENOSPC, and this is being propagated back to userspace. This behaviour has been broken before, and was explicitly fixed in commit 35f71bc0a09a ("fork: report pid reservation failure properly"), so I think -EAGAIN is definitely the right thing to return in this case. The current behaviour change dates from commit 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP") and was I believe unintentional. This patch has no impact on the case where allocating a pid fails because the child reaper for the namespace is dead; that case will still return -ENOMEM. Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP") Signed-off-by: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Gargi Sharma <gs051095@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19Merge tag 'trace-v4.19-rc4' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Steven writes: "Vaibhav Nagarnaik found that modifying the ring buffer size could cause a huge latency in the system because it does a while loop to free pages without releasing the CPU (on non preempt kernels). In a case where there are hundreds of thousands of pages to free it could actually cause a system stall. A properly place cond_resched() solves this issue."
2018-09-18Merge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman
Dave writes: "Various fixes, all over the place: 1) OOB data generation fix in bluetooth, from Matias Karhumaa. 2) BPF BTF boundary calculation fix, from Martin KaFai Lau. 3) Don't bug on excessive frags, to be compatible in situations mixing older and newer kernels on each end. From Juergen Gross. 4) Scheduling in RCU fix in hv_netvsc, from Stephen Hemminger. 5) Zero keying information in TLS layer before freeing copies of them, from Sabrina Dubroca. 6) Fix NULL deref in act_sample, from Davide Caratti. 7) Orphan SKB before GRO in veth to prevent crashes with XDP, from Toshiaki Makita. 8) Fix use after free in ip6_xmit, from Eric Dumazet. 9) Fix VF mac address regression in bnxt_en, from Micahel Chan. 10) Fix MSG_PEEK behavior in TLS layer, from Daniel Borkmann. 11) Programming adjustments to r8169 which fix not being to enter deep sleep states on some machines, from Kai-Heng Feng and Hans de Goede. 12) Fix DST_NOCOUNT flag handling for ipv6 routes, from Peter Oskolkov." * gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (45 commits) net/ipv6: do not copy dst flags on rt init qmi_wwan: set DTR for modems in forced USB2 mode clk: x86: Stop marking clocks as CLK_IS_CRITICAL r8169: Get and enable optional ether_clk clock clk: x86: add "ether_clk" alias for Bay Trail / Cherry Trail r8169: enable ASPM on RTL8106E r8169: Align ASPM/CLKREQ setting function with vendor driver Revert "kcm: remove any offset before parsing messages" kcm: remove any offset before parsing messages net: ethernet: Fix a unused function warning. net: dsa: mv88e6xxx: Fix ATU Miss Violation tls: fix currently broken MSG_PEEK behavior hv_netvsc: pair VF based on serial number PCI: hv: support reporting serial number as slot information bnxt_en: Fix VF mac address regression. ipv6: fix possible use-after-free in ip6_xmit() net: hp100: fix always-true check for link up state ARM: dts: at91: add new compatibility string for macb on sama5d3 net: macb: disable scatter-gather for macb on sama5d3 net: mvpp2: let phylink manage the carrier state ...
2018-09-17ring-buffer: Allow for rescheduling when removing pagesVaibhav Nagarnaik
When reducing ring buffer size, pages are removed by scheduling a work item on each CPU for the corresponding CPU ring buffer. After the pages are removed from ring buffer linked list, the pages are free()d in a tight loop. The loop does not give up CPU until all pages are removed. In a worst case behavior, when lot of pages are to be freed, it can cause system stall. After the pages are removed from the list, the free() can happen while the work is rescheduled. Call cond_resched() in the loop to prevent the system hangup. Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Jason Behmer <jbehmer@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-09-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix end boundary calculation in BTF for the type section, from Martin. 2) Fix and revert subtraction of pointers that was accidentally allowed for unprivileged programs, from Alexei. 3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid a warning in linearizing sg pages into a single one for large allocs, from Tushar. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-15Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: various scheduler metrics corner case fixes, a sched_features deadlock fix, and a topology fix for certain NUMA systems" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix kernel-doc notation warning sched/fair: Fix load_balance redo for !imbalance sched/fair: Fix scale_rt_capacity() for SMT sched/fair: Fix vruntime_normalized() for remote non-migration wakeup sched/pelt: Fix update_blocked_averages() for RT and DL classes sched/topology: Set correct NUMA topology type sched/debug: Fix potential deadlock when writing to sched_features
2018-09-15Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Mostly tooling fixes, but also breakpoint and x86 PMU driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) perf tools: Fix maps__find_symbol_by_name() tools headers uapi: Update tools's copy of linux/if_link.h tools headers uapi: Update tools's copy of linux/vhost.h tools headers uapi: Update tools's copies of kvm headers tools headers uapi: Update tools's copy of drm/drm.h tools headers uapi: Update tools's copy of asm-generic/unistd.h tools headers uapi: Update tools's copy of linux/perf_event.h perf/core: Force USER_DS when recording user stack data perf/UAPI: Clearly mark __PERF_SAMPLE_CALLCHAIN_EARLY as internal use perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUs perf annotate: Fix parsing aarch64 branch instructions after objdump update perf probe powerpc: Ignore SyS symbols irrespective of endianness perf event-parse: Use fixed size string for comms perf util: Fix bad memory access in trace info. perf tools: Streamline bpf examples and headers installation perf evsel: Fix potential null pointer dereference in perf_evsel__new_idx() perf arm64: Fix include path for asm-generic/unistd.h perf/hw_breakpoint: Simplify breakpoint enable in perf_event_modify_breakpoint perf/hw_breakpoint: Enable breakpoint in modify_user_hw_breakpoint perf/hw_breakpoint: Remove superfluous bp->attr.disabled = 0 ...
2018-09-15Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Misc fixes: liblockdep fixes and ww_mutex fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic" locking/lockdep: Delete unnecessary #include tools/lib/lockdep: Add dummy task_struct state member tools/lib/lockdep: Add empty nmi.h tools/lib/lockdep: Update Sasha Levin email to MSFT jump_label: Fix typo in warning message locking/mutex: Fix mutex debug call and ww_mutex documentation
2018-09-13Merge tag 'printk-for-4.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk fix from Petr Mladek: "Revert a commit that caused "quiet", "debug", and "loglevel" early parameters to be ignored for early boot messages" * tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: Revert "printk: make sure to print log on console."
2018-09-12bpf/verifier: disallow pointer subtractionAlexei Starovoitov
Subtraction of pointers was accidentally allowed for unpriv programs by commit 82abbf8d2fc4. Revert that part of commit. Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers") Reported-by: Jann Horn <jannh@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-12bpf: btf: Fix end boundary calculation for type sectionMartin KaFai Lau
The end boundary math for type section is incorrect in btf_check_all_metas(). It just happens that hdr->type_off is always 0 for now because there are only two sections (type and string) and string section must be at the end (ensured in btf_parse_str_sec). However, type_off may not be 0 if a new section would be added later. This patch fixes it. Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-11Revert "printk: make sure to print log on console."Petr Mladek
This reverts commit 375899cddcbb26881b03cb3fbdcfd600e4e67f4a. The visibility of early messages did not longer take into account "quiet", "debug", and "loglevel" early parameters. It would be possible to invalidate and recompute LOG_NOCONS flag for the affected messages. But it would be hairy. Instead this patch just reverts the problematic commit. We could come up with a better solution for the original problem. For example, we could simplify the logic and just mark messages that should always be visible or always invisible on the console. Also this patch reverts the related build fix commit ffaa619af1b06 ("printk: Fix warning about unused suppress_message_printing"). Finally, this patch does not put back the unused LOG_NOCONS flag. Link: http://lkml.kernel.org/r/20180910145747.emvfzv4mzlk5dfqk@pathway.suse.cz Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Maninder Singh <maninder1.s@samsung.com> Reported-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-09-10perf/core: Force USER_DS when recording user stack dataYabin Cui
Perf can record user stack data in response to a synchronous request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we end up reading user stack data using __copy_from_user_inatomic() under set_fs(KERNEL_DS). I think this conflicts with the intention of using set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64 when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used. So fix this by forcing USER_DS when recording user stack data. Signed-off-by: Yabin Cui <yabinc@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()") Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"Colin Ian King
Trivial fix to spelling mistake in pr_err() error message Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/lockdep: Delete unnecessary #includeBen Hutchings
Commit: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") added the inclusion of <trace/events/preemptirq.h>. liblockdep doesn't have a stub version of that header so now fails to build. However, commit: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"") removed the use of functions declared in that header. So delete the #include. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize ...") Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints ...") Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix kernel-doc notation warningRandy Dunlap
Fix kernel-doc warning for missing 'flags' parameter description: ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint") Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label: Fix typo in warning messageBorislav Petkov
There's no 'allocatote' - use the next best thing: 'allocate' :-) Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180907103521.31344-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix load_balance redo for !imbalanceVincent Guittot
It can happen that load_balance() finds a busiest group and then a busiest rq but the calculated imbalance is in fact 0. In such situation, detach_tasks() returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is 0, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: jhugo@codeaurora.org Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix scale_rt_capacity() for SMTVincent Guittot
Since commit: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") scale_rt_capacity() returns the remaining capacity and not a scale factor to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by scale_rt_capacity() so we must take the sched_domain argument. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix vruntime_normalized() for remote non-migration wakeupSteve Muckle
When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias <joaodias@google.com>. Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Steve Muckle <smuckle@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: John Dias <joaodias@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miguel de Dios <migueldedios@google.com> Cc: Morten Rasmussen <Morten.Rasmussen@arm.com> Cc: Patrick Bellasi <Patrick.Bellasi@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: kernel-team@android.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/pelt: Fix update_blocked_averages() for RT and DL classesVincent Guittot
update_blocked_averages() is called to periodiccally decay the stalled load of idle CPUs and to sync all loads before running load balance. When cfs rq is idle, it trigs a load balance during pick_next_task_fair() in order to potentially pull tasks and to use this newly idle CPU. This load balance happens whereas prev task from another class has not been put and its utilization updated yet. This may lead to wrongly account running time as idle time for RT or DL classes. Test that no RT or DL task is running when updating their utilization in update_blocked_averages(). We still update RT and DL utilization instead of simply skipping them to make sure that all metrics are synced when used during load balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking") Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking") Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Set correct NUMA topology typeSrikar Dronamraju
With the following commit: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") the scheduler introduced a new NUMA level. However this leads to the NUMA topology on 2 node systems to not be marked as NUMA_DIRECT anymore. After this commit, it gets reported as NUMA_BACKPLANE, because sched_domains_numa_level is now 2 on 2 node systems. Fix this by allowing setting systems that have up to 2 NUMA levels as NUMA_DIRECT. While here remove code that assumes that level can be 0. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andre Wild <wild@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain" Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/debug: Fix potential deadlock when writing to sched_featuresJiada Wang
The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features: ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted ------------------------------------------------------ sh/3358 is trying to acquire lock: 000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30 but task is already holding lock: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&sb->s_type->i_mutex_key#3){+.+.}: lock_acquire+0xb8/0x148 down_write+0xac/0x140 start_creating+0x5c/0x168 debugfs_create_dir+0x18/0x220 opp_debug_register+0x8c/0x120 _add_opp_dev+0x104/0x1f8 dev_pm_opp_get_opp_table+0x174/0x340 _of_add_opp_table_v2+0x110/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #2 (opp_table_lock){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 _of_add_opp_table_v2+0xb4/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #1 (subsys mutex#6){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 subsys_interface_register+0xd8/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #0 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 other info that might help us debug this: Chain exists of: cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#3); lock(opp_table_lock); lock(&sb->s_type->i_mutex_key#3); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** 2 locks held by sh/3358: #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318 #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 stack backtrace: CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT) Call trace: dump_backtrace+0x0/0x288 show_stack+0x14/0x20 dump_stack+0x13c/0x1ac print_circular_bug.isra.10+0x270/0x438 check_prev_add.constprop.16+0x4dc/0xb98 __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 This is because when loading the cpufreq_dt module we first acquire cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking the &sb->s_type->i_mutex_key lock. But when writing to /sys/kernel/debug/sched_features, the cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock. To fix this bug, reverse the lock acquisition order when writing to sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on &sb->s_type->i_mutex_key. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jiada Wang <jiada_wang@mentor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: George G. Davis <george_davis@mentor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/mutex: Fix mutex debug call and ww_mutex documentationThomas Hellstrom
The following commit: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") introduced a reference in the documentation to a function that was removed in an earlier commit. It also forgot to remove a call to debug_mutex_add_waiter() which is now unconditionally called by __mutex_add_waiter(). Fix those bugs. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dri-devel@lists.freedesktop.org Fixes: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09Merge tag 'perf-urgent-for-mingo-4.19-20180903' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: Kernel: - Modify breakpoint fixes (Jiri Olsa) perf annotate: - Fix parsing aarch64 branch instructions after objdump update (Kim Phillips) - Fix parsing indirect calls in 'perf annotate' (Martin Liška) perf probe: - Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das) perf trace: - Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips) Core libraries: - Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe) - Use fixed size string for comms instead of scanf("%m"), that is not present in the bionic libc and leads to a crash (Chris Phlipot) - Fix bad memory access in trace info on 32-bit systems, we were reading 8 bytes from a 4-byte long variable when saving the command line in the perf.data file. (Chris Phlipot) Build system: - Streamline bpf examples and headers installation, clarifying some install messages. (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fixes from Thomas Gleixner: "Two fixes for timekeeping: - Revert to the previous kthread based update, which is unfortunately required due to lock ordering issues. The removal caused boot failures on old Core2 machines. Add a proper comment why the thread needs to stay to prevent accidental removal in the future. - Fix a silly typo in a function declaration" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Revert "Remove kthread" timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
2018-09-09Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the hotplug state machine code: - Move the misplaces smb() in the hotplug thread function to the proper place, otherwise a half update control struct could be observed - Prevent state corruption on error rollback, which causes the state to advance by one and as a consequence skip it in the bringup sequence" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Prevent state corruption on error rollback cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()