summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21pids: make task_tgid_nr_ns() safeOleg Nesterov
This was reported many times, and this was even mentioned in commit 52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is not safe because task->group_leader points to nowhere after the exiting task passes exit_notify(), rcu_read_lock() can not help. We really need to change __unhash_process() to nullify group_leader, parent, and real_parent, but this needs some cleanups. Until then we can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and fix the problem. Reported-by: Troy Kensinger <tkensinger@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A pile of smallish changes all over the place: - Add a missing ISB in the GIC V1 driver - Remove an ACPI version check in the GIC V3 ITS driver - Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid spurious wakeups - Remove the artifical limitation of ITS instances to the number of NUMA nodes which prevents utilizing the ITS hardware correctly - Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code - Honour the force affinity argument in the GIC-V3 driver which is required to make perf work correctly - Correctly report allocation failures in GIC-V2/V3 to avoid using half allocated and initialized interrupts. - Fixup checks against nr_cpu_ids in the generic IPI code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/ipi: Fixup checks against nr_cpu_ids genirq: Restore trigger settings in irq_modify_status() MAINTAINERS: Remove Jason Cooper's irqchip git tree irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES irqchip: brcmstb-l2: Define an irq_pm_shutdown function irqchip/gic: Ensure we have an ISB between ack and ->handle_irq irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA irqchip/gic-v3: Honor forced affinity setting irqchip/gic-v3: Report failures in gic_irq_domain_alloc irqchip/gic-v2: Report failures in gic_irq_domain_alloc irqchip/atmel-aic: Remove root argument from ->fixup() prototype irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup() irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-20genirq/ipi: Fixup checks against nr_cpu_idsAlexey Dobriyan
Valid CPU ids are [0, nr_cpu_ids-1] inclusive. Fixes: 3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()") Fixes: f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
2017-08-18signal: don't remove SIGNAL_UNKILLABLE for traced tasks.Jamie Iles
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive faults, but this is undesirable when tracing. For example, debugging an init process (whether global or namespace), hitting a breakpoint and SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE. Everything continues fine, but then once debugging has finished, the init process is left killable which is unlikely what the user expects, resulting in either an accidentally killed init or an init that stops reaping zombies. Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.com Signed-off-by: Jamie Iles <jamie.iles@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18kmod: fix wait on recursive loopLuis R. Rodriguez
Recursive loops with module loading were previously handled in kmod by restricting the number of modprobe calls to 50 and if that limit was breached request_module() would return an error and a user would see the following on their kernel dmesg: request_module: runaway loop modprobe binfmt-464c Starting init:/sbin/init exists but couldn't execute it (error -8) This issue could happen for instance when a 64-bit kernel boots a 32-bit userspace on some architectures and has no 32-bit binary format hanlders. This is visible, for instance, when a CONFIG_MODULES enabled 64-bit MIPS kernel boots a into o32 root filesystem and the binfmt handler for o32 binaries is not built-in. After commit 6d7964a722af ("kmod: throttle kmod thread limit") we now don't have any visible signs of an error and the kernel just waits for the loop to end somehow. Although this *particular* recursive loop could also be addressed by doing a sanity check on search_binary_handler() and disallowing a modular binfmt to be required for modprobe, a generic solution for any recursive kernel kmod issues is still needed. This should catch these loops. We can investigate each loop and address each one separately as they come in, this however puts a stop gap for them as before. Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org Fixes: 6d7964a722af ("kmod: throttle kmod thread limit") Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reported-by: Matt Redfearn <matt.redfearn@imgtec.com> Tested-by: Matt Redfearn <matt.redfearn@imgetc.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Daniel Mentz <danielmentz@google.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18genirq: Restore trigger settings in irq_modify_status()Marc Zyngier
irq_modify_status starts by clearing the trigger settings from irq_data before applying the new settings, but doesn't restore them, leaving them to IRQ_TYPE_NONE. That's pretty confusing to the potential request_irq() that could follow. Instead, snapshot the settings before clearing them, and restore them if the irq_modify_status() invocation was not changing the trigger. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Reported-and-tested-by: jeffy <jeffy.chen@rock-chips.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
2017-08-16Merge tag 'audit-pr-20170816' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two small fixes to the audit code, both explained well in the respective patch descriptions, but the quick summary is one use-after-free fix, and one silly fanotify notification flag fix" * tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Receive unmount event audit: Fix use after free in audit_remove_watch_rule()
2017-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel Grumbach. 2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From Vivien Didelot. 3) Fix two regressions with bonding on wireless, from Andreas Born. 4) Fix build when busypoll is disabled, from Daniel Borkmann. 5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet. 6) Set SKB cached route properly in inet_rtm_getroute(), from Florian Westphal. 7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding Tianhong. 8) Fix module refcnt leak in ULP code, from Sabrina Dubroca. 9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric Dumazet. 10) Need to purge socket write queue in dccp_destroy_sock(), also from Eric Dumazet. 11) Make bpf_trace_printk() work properly on 32-bit architectures, from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) bpf: fix bpf_trace_printk on 32 bit archs PCI: fix oops when try to find Root Port for a PCI device sfc: don't try and read ef10 data on non-ef10 NIC net_sched: remove warning from qdisc_hash_add net_sched/sfq: update hierarchical backlog when drop packet net_sched: reset pointers to tcf blocks in classful qdiscs' destructors ipv4: fix NULL dereference in free_fib_info_rcu() net: Fix a typo in comment about sock flags. ipv6: fix NULL dereference in ip6_route_dev_notify() tcp: fix possible deadlock in TCP stack vs BPF filter dccp: purge write queue in dccp_destroy_sock() udp: fix linear skb reception with PEEK_OFF ipv6: release rt6->rt6i_idev properly during ifdown af_key: do not use GFP_KERNEL in atomic contexts tcp: ulp: avoid module refcnt leak in tcp_set_ulp net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag PCI: Disable Relaxed Ordering Attributes for AMD A1100 PCI: Disable Relaxed Ordering for some Intel processors PCI: Disable PCIe Relaxed Ordering if unsupported ...
2017-08-15bpf: fix bpf_trace_printk on 32 bit archsDaniel Borkmann
James reported that on MIPS32 bpf_trace_printk() is currently broken while MIPS64 works fine: bpf_trace_printk() uses conditional operators to attempt to pass different types to __trace_printk() depending on the format operators. This doesn't work as intended on 32-bit architectures where u32 and long are passed differently to u64, since the result of C conditional operators follows the "usual arithmetic conversions" rules, such that the values passed to __trace_printk() will always be u64 [causing issues later in the va_list handling for vscnprintf()]. For example the samples/bpf/tracex5 test printed lines like below on MIPS32, where the fd and buf have come from the u64 fd argument, and the size from the buf argument: [...] 1180.941542: 0x00000001: write(fd=1, buf= (null), size=6258688) Instead of this: [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512) One way to get it working is to expand various combinations of argument types into 8 different combinations for 32 bit and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64 as well that it resolves the issue. Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()") Reported-by: James Hogan <james.hogan@imgtec.com> Tested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15audit: Receive unmount eventJan Kara
Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is not part of AUDIT_FS_WATCH mask and thus such event never gets to audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify subsystem on unmount without audit being notified about that which leads to a strange state of existing audit rules with dead fsnotify marks. Add FS_UNMOUNT to the mask of events to be received so that audit can clean up its state accordingly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-08-15audit: Fix use after free in audit_remove_watch_rule()Jan Kara
audit_remove_watch_rule() drops watch's reference to parent but then continues to work with it. That is not safe as parent can get freed once we drop our reference. The following is a trivial reproducer: mount -o loop image /mnt touch /mnt/file auditctl -w /mnt/file -p wax umount /mnt auditctl -D <crash in fsnotify_destroy_mark()> Grab our own reference in audit_remove_watch_rule() earlier to make sure mark does not get freed under us. CC: stable@vger.kernel.org Reported-by: Tony Jones <tonyj@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Tony Jones <tonyj@suse.de> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-08-10mm: migrate: prevent racy access to tlb_flush_pendingNadav Amit
Patch series "fixes of TLB batching races", v6. It turns out that Linux TLB batching mechanism suffers from various races. Races that are caused due to batching during reclamation were recently handled by Mel and this patch-set deals with others. The more fundamental issue is that concurrent updates of the page-tables allow for TLB flushes to be batched on one core, while another core changes the page-tables. This other core may assume a PTE change does not require a flush based on the updated PTE value, while it is unaware that TLB flushes are still pending. This behavior affects KSM (which may result in memory corruption) and MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED. Memory corruption in KSM is harder to produce in practice, but was observed by hacking the kernel and adding a delay before flushing and replacing the KSM page. Finally, there is also one memory barrier missing, which may affect architectures with weak memory model. This patch (of 7): Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: fix global NR_SLAB_.*CLAIMABLE counter readsJohannes Weiner
As Tetsuo points out: "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") broke "Slab:" field of /proc/meminfo . It shows nearly 0kB" In addition to /proc/meminfo, this problem also affects the slab counters OOM/allocation failure info dumps, can cause early -ENOMEM from overcommit protection, and miscalculate image size requirements during suspend-to-disk. This is because the patch in question switched the slab counters from the zone level to the node level, but forgot to update the global accessor functions to read the aggregate node data instead of the aggregate zone data. Use global_node_page_state() to access the global slab counters. Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Stefan Agner <stefan@agner.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10perf/core: Fix time on IOC_ENABLEPeter Zijlstra
Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task is SIGSTOP'ed state the timestamps go wobbly. It turns out we indeed fail to correctly account time while in 'OFF' state and doing IOC_ENABLE without getting scheduled in exposes the problem. Further thinking about this problem, it occurred to me that we can suffer a similar fate when we migrate an uncore event between CPUs. The perf_event_install() on the 'new' CPU will do add_event_to_ctx() which will reset all the time stamp, resulting in a subsequent update_event_times() to overwrite the total_time_* fields with smaller values. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10perf/x86: Fix RDPMC vs. mm_struct trackingPeter Zijlstra
Vince reported the following rdpmc() testcase failure: > Failing test case: > > fd=perf_event_open(); > addr=mmap(fd); > exec() // without closing or unmapping the event > fd=perf_event_open(); > addr=mmap(fd); > rdpmc() // GPFs due to rdpmc being disabled The problem is of course that exec() plays tricks with what is current->mm, only destroying the old mappings after having installed the new mm. Fix this confusion by passing along vma->vm_mm instead of relying on current->mm. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 1e0fb9ec679c ("perf: Add pmu callbacks to track event mapping and unmapping") Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net [ Minor cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-09futex: Remove unnecessary warning from get_futex_keyMel Gorman
Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in get_futex_key()") removed an unnecessary lock_page() with the side-effect that page->mapping needed to be treated very carefully. Two defensive warnings were added in case any assumption was missed and the first warning assumed a correct application would not alter a mapping backing a futex key. Since merging, it has not triggered for any unexpected case but Mark Rutland reported the following bug triggering due to the first warning. kernel BUG at kernel/futex.c:679! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3 Hardware name: linux,dummy-virt (DT) task: ffff80001e271780 task.stack: ffff000010908000 PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679 LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679 pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145 The fact that it's a bug instead of a warning was due to an unrelated arm64 problem, but the warning itself triggered because the underlying mapping changed. This is an application issue but from a kernel perspective it's a recoverable situation and the warning is unnecessary so this patch removes the warning. The warning may potentially be triggered with the following test program from Mark although it may be necessary to adjust NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the system. #include <linux/futex.h> #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/time.h> #include <unistd.h> #define NR_FUTEX_THREADS 16 pthread_t threads[NR_FUTEX_THREADS]; void *mem; #define MEM_PROT (PROT_READ | PROT_WRITE) #define MEM_SIZE 65536 static int futex_wrapper(int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3) { syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3); } void *poll_futex(void *unused) { for (;;) { futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1); } } int main(int argc, char *argv[]) { int i; mem = mmap(NULL, MEM_SIZE, MEM_PROT, MAP_SHARED | MAP_ANONYMOUS, -1, 0); printf("Mapping @ %p\n", mem); printf("Creating futex threads...\n"); for (i = 0; i < NR_FUTEX_THREADS; i++) pthread_create(&threads[i], NULL, poll_futex, NULL); printf("Flipping mapping...\n"); for (;;) { mmap(mem, MEM_SIZE, MEM_PROT, MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0); } return 0; } Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-06Fix compat_sys_sigpending breakageDmitry V. Levin
The latest change of compat_sys_sigpending in commit 8f13621abced ("sigpending(): move compat to native") has broken it in two ways. First, it tries to write 4 bytes more than userspace expects: sizeof(old_sigset_t) == sizeof(long) == 8 instead of sizeof(compat_old_sigset_t) == sizeof(u32) == 4. Second, on big endian architectures these bytes are being written in the wrong order. This bug was found by strace test suite. Reported-by: Anatoly Pugachev <matorola@gmail.com> Inspired-by: Eugene Syromyatnikov <evgsyr@gmail.com> Fixes: 8f13621abced ("sigpending(): move compat to native") Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-04Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for a multiplication overflow in the timer code on 32bit systems" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix overflow in get_next_timer_interrupt
2017-08-02cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()Dima Zavin
In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a deadlock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to use two different static keys: one for begin (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() always return a valid seqcount if are enabling cpusets. Similarly, when disabling cpusets via cpuset_dec(), we first ensure that callers of cpuset_mems_allowed_retry() will start ignoring the seqcount value before we let cpuset_mems_allowed_begin() return 0. The relevant stack traces of the two stuck threads: CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: smp_call_function+0x3b/0x70 on_each_cpu+0x2f/0x90 text_poke_bp+0x87/0xd0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 __slab_alloc.isra.80+0x54/0x90 copy_process.part.40+0xf7/0x1de0 copy_process.part.40+0xf7/0x1de0 kmem_cache_alloc_node+0x8a/0x280 copy_process.part.40+0xf7/0x1de0 _do_fork+0xe7/0x6c0 _raw_spin_unlock_irq+0x2d/0x60 trace_hardirqs_on_caller+0x136/0x1d0 entry_SYSCALL_64_fastpath+0x5/0xad do_syscall_64+0x27/0x350 SyS_clone+0x19/0x20 do_syscall_64+0x60/0x350 entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com Fixes: 46e700abc44c ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled") Signed-off-by: Dima Zavin <dmitriyz@waymo.com> Reported-by: Cliff Spradlin <cspradlin@waymo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02pid: kill pidhash_size in pidhash_init()Kefeng Wang
After commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag"), drop unused pidhash_size in pidhash_init(). Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Pavel Tatashin <Pasha.Tatashin@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPUSteven Rostedt (VMware)
Chunyu Hu reported: "per_cpu trace directories and files are created for all possible cpus, but only the cpus which have ever been on-lined have their own per cpu ring buffer (allocated by cpuhp threads). While trace_buffers_open, the open handler for trace file 'trace_pipe_raw' is always trying to access field of ring_buffer_per_cpu, and would panic with the NULL pointer. Align the behavior of trace_pipe_raw with trace_pipe, that returns -NODEV when openning it if that cpu does not have trace ring buffer. Reproduce: cat /sys/kernel/debug/tracing/per_cpu/cpu31/trace_pipe_raw (cpu31 is never on-lined, this is a 16 cores x86_64 box) Tested with: 1) boot with maxcpus=14, read trace_pipe_raw of cpu15. Got -NODEV. 2) oneline cpu15, read trace_pipe_raw of cpu15. Get the raw trace data. Call trace: [ 5760.950995] RIP: 0010:ring_buffer_alloc_read_page+0x32/0xe0 [ 5760.961678] tracing_buffers_read+0x1f6/0x230 [ 5760.962695] __vfs_read+0x37/0x160 [ 5760.963498] ? __vfs_read+0x5/0x160 [ 5760.964339] ? security_file_permission+0x9d/0xc0 [ 5760.965451] ? __vfs_read+0x5/0x160 [ 5760.966280] vfs_read+0x8c/0x130 [ 5760.967070] SyS_read+0x55/0xc0 [ 5760.967779] do_syscall_64+0x67/0x150 [ 5760.968687] entry_SYSCALL64_slow_path+0x25/0x25" This was introduced by the addition of the feature to reuse reader pages instead of re-allocating them. The problem is that the allocation of a reader page (which is per cpu) does not check if the cpu is online and set up for the ring buffer. Link: http://lkml.kernel.org/r/1500880866-1177-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: 73a757e63114 ("ring-buffer: Return reader page back into existing ring buffer") Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-02tracing: Missing error code in tracer_alloc_buffers()Dan Carpenter
If ring_buffer_alloc() or one of the next couple function calls fail then we should return -ENOMEM but the current code returns success. Link: http://lkml.kernel.org/r/20170801110201.ajdkct7vwzixahvx@mwanda Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: b32614c03413 ('tracing/rb: Convert to hotplug state machine') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-02tracing: Call clear_boot_tracer() at lateinit_syncSteven Rostedt (VMware)
The clear_boot_tracer function is used to reset the default_bootup_tracer string to prevent it from being accessed after boot, as it originally points to init data. But since clear_boot_tracer() is called via the init_lateinit() call, it races with the initcall for registering the hwlat tracer. If someone adds "ftrace=hwlat" to the kernel command line, depending on how the linker sets up the text, the saved command line may be cleared, and the hwlat tracer never is initialized. Simply have the clear_boot_tracer() be called by initcall_lateinit_sync() as that's for tasks to be called after lateinit. Link: https://bugzilla.kernel.org/show_bug.cgi?id=196551 Cc: stable@vger.kernel.org Fixes: e7c15cd8a ("tracing: Added hardware latency tracer") Reported-by: Zamir SUN <sztsian@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-01timers: Fix overflow in get_next_timer_interruptMatija Glavinic Pecotic
For e.g. HZ=100, timer being 430 jiffies in the future, and 32 bit unsigned int, there is an overflow on unsigned int right-hand side of the expression which results with wrong values being returned. Type cast the multiplier to 64bit to avoid that issue. Fixes: 46c8f0b077a8 ("timers: Fix get_next_timer_interrupt() computation") Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Cc: khilman@baylibre.com Cc: akpm@linux-foundation.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/a7900f04-2a21-c9fd-67be-ab334d459ee5@nokia.com
2017-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle notifier registry failures properly in tun/tap driver, from Tonghao Zhang. 2) Fix bpf verifier handling of subtraction bounds and add a testcase for this, from Edward Cree. 3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt. 4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET, from Cong Wang. 5) Fix SElinux regression due to recent UDP optimizations, from Paolo Abeni. 6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code paths, fix from Stefano Brivio. 7) Fix some mem leaks in dccp, from Xin Long. 8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd Bergmann. 9) Mac address length check and buffer size fixes from Cong Wang. 10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni. 11) Fix return value when copy_from_user() fails in bpf_prog_get_info_by_fd(), from Daniel Borkmann. 12) Handle PHY_HALTED properly in phy library state machine, from Florian Fainelli. 13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel. 14) Fix truesize calculation in virtio_net which led to performance regressions, from Michael S Tsirkin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits) samples/bpf: fix bpf tunnel cleanup udp6: fix jumbogram reception ppp: Fix a scheduling-while-atomic bug in del_chan Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config" virtio_net: fix truesize for mergeable buffers mv643xx_eth: fix of_irq_to_resource() error check MAINTAINERS: Add more files to the PHY LIBRARY section ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev() net: phy: Correctly process PHY_HALTED in phy_stop_machine() sunhme: fix up GREG_STAT and GREG_IMASK register offsets bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len tcp: avoid bogus gcc-7 array-bounds warning net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt" bpf: don't indicate success when copy_from_user fails udp6: fix socket leak on early demux net: thunderx: Fix BGX transmit stall due to underflow Revert "vhost: cache used event for better performance" team: use a larger struct for mac address net: check dev->addr_len for dev_set_mac_address() phy: bcm-ns-usb3: fix MDIO_BUS dependency ...
2017-07-31Merge branch 'for-4.13-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Several cgroup bug fixes. - cgroup core was calling a migration callback on empty migrations, which could make cpuset crash. - There was a very subtle bug where the controller interface files aren't created directly when cgroup2 is mounted. Because later operations create them, this bug didn't get noticed earlier. - Failed writes to cgroup.subtree_control were incorrectly returning zero" * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix error return value from cgroup_subtree_control() cgroup: create dfl_root files on subsys registration cgroup: don't call migration methods if there are no tasks to migrate
2017-07-31Merge branch 'for-4.13-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two notable fixes. - While adding NUMA affinity support to unbound workqueues, the assumption that an unbound workqueue with max_active == 1 is ordered was broken. The plan was to use explicit alloc_ordered_workqueue() for those cases. Unfortunately, I forgot to update the documentation properly and we grew a handful of use cases which depend on that assumption. While we want to convert them to alloc_ordered_workqueue(), we don't really lose anything by enforcing ordered execution on unbound max_active == 1 workqueues and it doesn't make sense to risk subtle bugs. Restore the assumption. - Workqueue assumes that CPU <-> NUMA node mapping remains static. This is a general assumption - we don't have any synchronization mechanism around CPU <-> node mapping. Unfortunately, powerpc may change the mapping dynamically leading to crashes. Michael added a workaround so that we at least don't crash while powerpc hotplug code gets updated" * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Work around edge cases for calc of pool's cpumask workqueue: implicit ordered attribute should be overridable workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
2017-07-30Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two patches addressing build warnings caused by inconsistent kernel doc comments" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/wait: Clean up some documentation warnings sched/core: Fix some documentation build warnings
2017-07-29bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_lenDaniel Borkmann
bpf_prog_size(prog->len) is not the correct length we want to dump back to user space. The code in bpf_prog_get_info_by_fd() uses this to copy prog->insnsi to user space, but bpf_prog_size(prog->len) also includes the size of struct bpf_prog itself plus program instructions and is usually used either in context of accounting or for bpf_prog_alloc() et al, thus we copy out of bounds in bpf_prog_get_info_by_fd() potentially. Use the correct bpf_prog_insn_size() instead. Fixes: 1e2709769086 ("bpf: Add BPF_OBJ_GET_INFO_BY_FD") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29bpf: don't indicate success when copy_from_user failsDaniel Borkmann
err in bpf_prog_get_info_by_fd() still holds 0 at that time from prior check_uarg_tail_zero() check. Explicitly return -EFAULT instead, so user space can be notified of buggy behavior. Fixes: 1e2709769086 ("bpf: Add BPF_OBJ_GET_INFO_BY_FD") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-28workqueue: Work around edge cases for calc of pool's cpumaskMichael Bringmann
There is an underlying assumption/trade-off in many layers of the Linux system that CPU <-> node mapping is static. This is despite the presence of features like NUMA and 'hotplug' that support the dynamic addition/ removal of fundamental system resources like CPUs and memory. PowerPC systems, however, do provide extensive features for the dynamic change of resources available to a system. Currently, there is little or no synchronization protection around the updating of the CPU <-> node mapping, and the export/update of this information for other layers / modules. In systems which can change this mapping during 'hotplug', like PowerPC, the information is changing underneath all layers that might reference it. This patch attempts to ensure that a valid, usable cpumask attribute is used by the workqueue infrastructure when setting up new resource pools. It prevents a crash that has been observed when an 'empty' cpumask is passed along to the worker/task scheduling code. It is intended as a temporary workaround until a more fundamental review and correction of the issue can be done. [With additions to the patch provided by Tejun Hao <tj@kernel.org>] Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-27genirq/cpuhotplug: Revert "Set force affinity flag on hotplug migration"Thomas Gleixner
That commit was part of the changes moving x86 to the generic CPU hotplug interrupt migration code. The force flag was required on x86 before the hierarchical irqdomain rework, but invoking set_affinity() with force=true stayed and had no side effects. At some point in the past, the force flag got repurposed to support the exynos timer interrupt affinity setting to a not yet online CPU, so the interrupt controller callback does not verify the supplied affinity mask against cpu_online_mask. Setting the flag in the CPU hotplug code causes the cpu online masking to be blocked on these irq controllers and results in potentially affining an interrupt to the CPU which is unplugged, i.e. instead of moving it away, it's just reassigned to it. As the force flags is not longer needed on x86, it's safe to revert that patch so the ARM irqchips which use the force flag work again. Add comments to that effect, so this won't happen again. Note: The online mask handling should be done in the generic code and the force flag and the masking in the irq chips removed all together, but that's not a change possible for 4.13. Fixes: 77f85e66aa8b ("genirq/cpuhotplug: Set force affinity flag on hotplug migration") Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: LAK <linux-arm-kernel@lists.infradead.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707271217590.3109@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-07-25workqueue: implicit ordered attribute should be overridableTejun Heo
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automatically enabled ordered attribute for unbound workqueues w/ max_active == 1. Because ordered workqueues reject max_active and some attribute changes, this implicit ordered mode broke cases where the user creates an unbound workqueue w/ max_active == 1 and later explicitly changes the related attributes. This patch distinguishes explicit and implicit ordered setting and overrides from attribute changes if implict. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
2017-07-25sched/core: Fix some documentation build warningsJonathan Corbet
The kerneldoc comments for try_to_wake_up_local() were out of date, leading to these documentation build warnings: ./kernel/sched/core.c:2080: warning: No description found for parameter 'rf' ./kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local' Update the comment to reflect current reality and give us some peace and quiet. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20170724135628.695cecfc@lwn.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-24bpf/verifier: fix min/max handling in BPF_SUBEdward Cree
We have to subtract the src max from the dst min, and vice-versa, since (e.g.) the smallest result comes from the largest subtrahend. Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-23cgroup: fix error return value from cgroup_subtree_control()Tejun Heo
While refactoring, f7b2814bb9b6 ("cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()") broke error return value from the function. The return value from the last operation is always overridden to zero. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21Merge tag 'trace-v4.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Three minor updates - Use the new GFP_RETRY_MAYFAIL to be more aggressive in allocating memory for the ring buffer without causing OOMs - Fix a memory leak in adding and removing instances - Add __rcu annotation to be able to debug RCU usage of function tracing a bit better" * tag 'trace-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: trace: fix the errors caused by incompatible type of RCU variables tracing: Fix kmemleak in instance_rmdir tracing/ring_buffer: Try harder to allocate
2017-07-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A cputime fix and code comments/organization fix to the deadline scheduler" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix confusing comments about selection of top pi-waiter sched/cputime: Don't use smp_processor_id() in preemptible context
2017-07-21Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two hw-enablement patches, two race fixes, three fixes for regressions of semantics, plus a number of tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add proper condition to run sched_task callbacks perf/core: Fix locking for children siblings group read perf/core: Fix scheduling regression of pinned groups perf/x86/intel: Fix debug_store reset field for freq events perf/x86/intel: Add Goldmont Plus CPU PMU support perf/x86/intel: Enable C-state residency events for Apollo Lake perf symbols: Accept zero as the kernel base address Revert "perf/core: Drop kernel samples even though :u is specified" perf annotate: Fix broken arrow at row 0 connecting jmp instruction to its target perf evsel: State in the default event name if attr.exclude_kernel is set perf evsel: Fix attr.exclude_kernel setting for default cycles:p
2017-07-21Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixlet from Ingo Molnar: "Remove an unnecessary priority adjustment in the rtmutex code" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Remove unnecessary priority adjustment
2017-07-21Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A resume_irq() fix, plus a number of static declaration fixes" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/digicolor: Drop unnecessary static irqchip/mips-cpu: Drop unnecessary static irqchip/gic/realview: Drop unnecessary static irqchip/mips-gic: Remove population of irq domain names genirq/PM: Properly pretend disabled state when force resuming interrupts
2017-07-21perf/core: Fix locking for children siblings group readJiri Olsa
We're missing ctx lock when iterating children siblings within the perf_read path for group reading. Following race and crash can happen: User space doing read syscall on event group leader: T1: perf_read lock event->ctx->mutex perf_read_group lock leader->child_mutex __perf_read_group_add(child) list_for_each_entry(sub, &leader->sibling_list, group_entry) ----> sub might be invalid at this point, because it could get removed via perf_event_exit_task_context in T2 Child exiting and cleaning up its events: T2: perf_event_exit_task_context lock ctx->mutex list_for_each_entry_safe(child_event, next, &child_ctx->event_list,... perf_event_exit_event(child) lock ctx->lock perf_group_detach(child) unlock ctx->lock ----> child is removed from sibling_list without any sync with T1 path above ... free_event(child) Before the child is removed from the leader's child_list, (and thus is omitted from perf_read_group processing), we need to ensure that perf_read_group touches child's siblings under its ctx->lock. Peter further notes: | One additional note; this bug got exposed by commit: | | ba5213ae6b88 ("perf/core: Correct event creation with PERF_FORMAT_GROUP") | | which made it possible to actually trigger this code-path. Tested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ba5213ae6b88 ("perf/core: Correct event creation with PERF_FORMAT_GROUP") Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) BPF verifier signed/unsigned value tracking fix, from Daniel Borkmann, Edward Cree, and Josef Bacik. 2) Fix memory allocation length when setting up calls to ->ndo_set_mac_address, from Cong Wang. 3) Add a new cxgb4 device ID, from Ganesh Goudar. 4) Fix FIB refcount handling, we have to set it's initial value before the configure callback (which can bump it). From David Ahern. 5) Fix double-free in qcom/emac driver, from Timur Tabi. 6) A bunch of gcc-7 string format overflow warning fixes from Arnd Bergmann. 7) Fix link level headroom tests in ip_do_fragment(), from Vasily Averin. 8) Fix chunk walking in SCTP when iterating over error and parameter headers. From Alexander Potapenko. 9) TCP BBR congestion control fixes from Neal Cardwell. 10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger. 11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong Wang. 12) xmit_recursion in ppp driver needs to be per-device not per-cpu, from Gao Feng. 13) Cannot release skb->dst in UDP if IP options processing needs it. From Paolo Abeni. 14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander Levin and myself. 15) Revert some rtnetlink notification changes that are causing regressions, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) net: bonding: Fix transmit load balancing in balance-alb mode rds: Make sure updates to cp_send_gen can be observed net: ethernet: ti: cpsw: Push the request_irq function to the end of probe ipv4: initialize fib_trie prior to register_netdev_notifier call. rtnetlink: allocate more memory for dev_set_mac_address() net: dsa: b53: Add missing ARL entries for BCM53125 bpf: more tests for mixed signed and unsigned bounds checks bpf: add test for mixed signed and unsigned bounds checks bpf: fix up test cases with mixed signed/unsigned bounds bpf: allow to specify log level and reduce it for test_verifier bpf: fix mixed signed/unsigned derived min/max value bounds ipv6: avoid overflow of offset in ip6_find_1stfragopt net: tehuti: don't process data if it has not been copied from userspace Revert "rtnetlink: Do not generate notifications for CHANGEADDR event" net: dsa: mv88e6xxx: Enable CMODE config support for 6390X dt-binding: ptp: Add SoC compatibility strings for dte ptp clock NET: dwmac: Make dwmac reset unconditional net: Zero terminate ifr_name in dev_ifname(). wireless: wext: terminate ifr name coming from userspace netfilter: fix netfilter_net_init() return ...
2017-07-20bpf: fix mixed signed/unsigned derived min/max value boundsDaniel Borkmann
Edward reported that there's an issue in min/max value bounds tracking when signed and unsigned compares both provide hints on limits when having unknown variables. E.g. a program such as the following should have been rejected: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff8a94cda93400 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 11: (65) if r1 s> 0x1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 14: (b7) r0 = 0 15: (95) exit What happens is that in the first part ... 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 ... r1 carries an unsigned value, and is compared as unsigned against a register carrying an immediate. Verifier deduces in reg_set_min_max() that since the compare is unsigned and operation is greater than (>), that in the fall-through/false case, r1's minimum bound must be 0 and maximum bound must be r2. Latter is larger than the bound and thus max value is reset back to being 'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now 'R1=inv,min_value=0'. The subsequent test ... 11: (65) if r1 s> 0x1 goto pc+2 ... is a signed compare of r1 with immediate value 1. Here, verifier deduces in reg_set_min_max() that since the compare is signed this time and operation is greater than (>), that in the fall-through/false case, we can deduce that r1's maximum bound must be 1, meaning with prior test, we result in r1 having the following state: R1=inv,min_value=0,max_value=1. Given that the actual value this holds is -8, the bounds are wrongly deduced. When this is being added to r0 which holds the map_value(_adj) type, then subsequent store access in above case will go through check_mem_access() which invokes check_map_access_adj(), that will then probe whether the map memory is in bounds based on the min_value and max_value as well as access size since the actual unknown value is min_value <= x <= max_value; commit fce366a9dd0d ("bpf, verifier: fix alu ops against map_value{, _adj} register types") provides some more explanation on the semantics. It's worth to note in this context that in the current code, min_value and max_value tracking are used for two things, i) dynamic map value access via check_map_access_adj() and since commit 06c1c049721a ("bpf: allow helpers access to variable memory") ii) also enforced at check_helper_mem_access() when passing a memory address (pointer to packet, map value, stack) and length pair to a helper and the length in this case is an unknown value defining an access range through min_value/max_value in that case. The min_value/max_value tracking is /not/ used in the direct packet access case to track ranges. However, the issue also affects case ii), for example, the following crafted program based on the same principle must be rejected as well: 0: (b7) r2 = 0 1: (bf) r3 = r10 2: (07) r3 += -512 3: (7a) *(u64 *)(r10 -16) = -8 4: (79) r4 = *(u64 *)(r10 -16) 5: (b7) r6 = -1 6: (2d) if r4 > r6 goto pc+5 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 7: (65) if r4 s> 0x1 goto pc+4 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 8: (07) r4 += 1 9: (b7) r5 = 0 10: (6a) *(u16 *)(r10 -512) = 0 11: (85) call bpf_skb_load_bytes#26 12: (b7) r0 = 0 13: (95) exit Meaning, while we initialize the max_value stack slot that the verifier thinks we access in the [1,2] range, in reality we pass -7 as length which is interpreted as u32 in the helper. Thus, this issue is relevant also for the case of helper ranges. Resetting both bounds in check_reg_overflow() in case only one of them exceeds limits is also not enough as similar test can be created that uses values which are within range, thus also here learned min value in r1 is incorrect when mixed with later signed test to create a range: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff880ad081fa00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (65) if r1 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 14: (b7) r0 = 0 15: (95) exit This leaves us with two options for fixing this: i) to invalidate all prior learned information once we switch signed context, ii) to track min/max signed and unsigned boundaries separately as done in [0]. (Given latter introduces major changes throughout the whole verifier, it's rather net-next material, thus this patch follows option i), meaning we can derive bounds either from only signed tests or only unsigned tests.) There is still the case of adjust_reg_min_max_vals(), where we adjust bounds on ALU operations, meaning programs like the following where boundaries on the reg get mixed in context later on when bounds are merged on the dst reg must get rejected, too: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff89b2bf87ce00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+6 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (b7) r7 = 1 12: (65) if r7 s> 0x0 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp 13: (b7) r0 = 0 14: (95) exit from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp 15: (0f) r7 += r1 16: (65) if r7 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 17: (0f) r0 += r7 18: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 19: (b7) r0 = 0 20: (95) exit Meaning, in adjust_reg_min_max_vals() we must also reset range values on the dst when src/dst registers have mixed signed/ unsigned derived min/max value bounds with one unbounded value as otherwise they can be added together deducing false boundaries. Once both boundaries are established from either ALU ops or compare operations w/o mixing signed/unsigned insns, then they can safely be added to other regs also having both boundaries established. Adding regs with one unbounded side to a map value where the bounded side has been learned w/o mixing ops is possible, but the resulting map value won't recover from that, meaning such op is considered invalid on the time of actual access. Invalid bounds are set on the dst reg in case i) src reg, or ii) in case dst reg already had them. The only way to recover would be to perform i) ALU ops but only 'add' is allowed on map value types or ii) comparisons, but these are disallowed on pointers in case they span a range. This is fine as only BPF_JEQ and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers which potentially turn them into PTR_TO_MAP_VALUE type depending on the branch, so only here min/max value cannot be invalidated for them. In terms of state pruning, value_from_signed is considered as well in states_equal() when dealing with adjusted map values. With regards to breaking existing programs, there is a small risk, but use-cases are rather quite narrow where this could occur and mixing compares probably unlikely. Joint work with Josef and Edward. [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html Fixes: 484611357c19 ("bpf: allow access into map value arrays") Reported-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-20Merge branch 'stable-4.13' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fix from Paul Moore: "A small audit fix, just a single line, to plug a memory leak in some audit error handling code" * 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit: audit: fix memleak in auditd_send_unicast_skb.
2017-07-20trace: fix the errors caused by incompatible type of RCU variablesChunyan Zhang
The variables which are processed by RCU functions should be annotated as RCU, otherwise sparse will report the errors like below: "error: incompatible types in comparison expression (different address spaces)" Link: http://lkml.kernel.org/r/1496823171-7758-1-git-send-email-zhang.chunyan@linaro.org Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org> [ Updated to not be 100% 80 column strict ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>