summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2018-05-30kdb: make "mdr" command repeatRandy Dunlap
[ Upstream commit 1e0ce03bf142454f38a5fc050bf4fd698d2d36d8 ] The "mdr" command should repeat (continue) when only Enter/Return is pressed, so make it do so. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30rcu: Call touch_nmi_watchdog() while printing stall warningsTejun Heo
[ Upstream commit 3caa973b7a260e7a2a69edc94c300ab9c65148c3 ] When RCU stall warning triggers, it can print out a lot of messages while holding spinlocks. If the console device is slow (e.g. an actual or IPMI serial console), it may end up triggering NMI hard lockup watchdog like the following.
2018-05-30audit: return on memory error to avoid null pointer dereferenceRichard Guy Briggs
[ Upstream commit 23138ead270045f1b3e912e667967b6094244999 ] If there is a memory allocation error when trying to change an audit kernel feature value, the ignored allocation error will trigger a NULL pointer dereference oops on subsequent use of that pointer. Return instead. Passes audit-testsuite. See: https://github.com/linux-audit/audit-kernel/issues/76 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: not necessary (other funcs check for NULL), but a good practice] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/core: Fix perf_output_read_group()Peter Zijlstra
[ Upstream commit 9e5b127d6f33468143d90c8a45ca12410e4c3fa7 ] Mark reported his arm64 perf fuzzer runs sometimes splat like: armv8pmu_read_counter+0x1e8/0x2d8 armpmu_event_update+0x8c/0x188 armpmu_read+0xc/0x18 perf_output_read+0x550/0x11e8 perf_event_read_event+0x1d0/0x248 perf_event_exit_task+0x468/0xbb8 do_exit+0x690/0x1310 do_group_exit+0xd0/0x2b0 get_signal+0x2e8/0x17a8 do_signal+0x144/0x4f8 do_notify_resume+0x148/0x1e8 work_pending+0x8/0x14 which asserts that we only call pmu::read() on ACTIVE events. The above callchain does: perf_event_exit_task() perf_event_exit_task_context() task_ctx_sched_out() // INACTIVE perf_event_exit_event() perf_event_set_state(EXIT) // EXIT sync_child_event() perf_event_read_event() perf_output_read() perf_output_read_group() leader->pmu->read() Which results in doing a pmu::read() on an !ACTIVE event. I _think_ this is 'new' since we added attr.inherit_stat, which added the perf_event_read_event() to the exit path, without that perf_event_read_output() would only trigger from samples and for @event to trigger a sample, it's leader _must_ be ACTIVE too. Still, adding this check makes it consistent with the @sub case for the siblings. Reported-and-Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/core: Fix installing cgroup events on CPUleilei.lin
[ Upstream commit 33801b94741d6c3be9713c10aa627477216c21e2 ] There's two problems when installing cgroup events on CPUs: firstly list_update_cgroup_event() only tries to set cpuctx->cgrp for the first event, if that mismatches on @cgrp we'll not try again for later additions. Secondly, when we install a cgroup event into an active context, only issue an event reprogram when the event matches the current cgroup context. This avoids a pointless event reprogramming. Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com> [ Improved the changelog and comments. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: brendan.d.gregg@gmail.com Cc: eranian@gmail.com Cc: linux-kernel@vger.kernel.org Cc: yang_oliver@hotmail.com Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warningDavidlohr Bueso
[ Upstream commit d29a20645d5e929aa7e8616f28e5d8e1c49263ec ] While running rt-tests' pi_stress program I got the following splat: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20 [...] <IRQ> enqueue_top_rt_rq+0xf4/0x150 ? cpufreq_dbs_governor_start+0x170/0x170 sched_rt_rq_enqueue+0x65/0x80 sched_rt_period_timer+0x156/0x360 ? sched_rt_rq_enqueue+0x80/0x80 __hrtimer_run_queues+0xfa/0x260 hrtimer_interrupt+0xcb/0x220 smp_apic_timer_interrupt+0x62/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> [...] do_idle+0x183/0x1e0 cpu_startup_entry+0x5f/0x70 start_secondary+0x192/0x1d0 secondary_startup_64+0xa5/0xb0 We can get rid of it be the "traditional" means of adding an update_rq_clock() call after acquiring the rq->lock in do_sched_rt_period_timer(). The case for the RT task throttling (which this workload also hits) can be ignored in that the skip_update call is actually bogus and quite the contrary (the request bits are removed/reverted). By setting RQCF_UPDATED we really don't care if the skip is happening or not and will therefore make the assert_clock_updated() check happy. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: linux-kernel@vger.kernel.org Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/cgroup: Fix child event counting bugSong Liu
[ Upstream commit c917e0f259908e75bd2a65877e25f9d90c22c848 ] When a perf_event is attached to parent cgroup, it should count events for all children cgroups: parent_group <---- perf_event \ - child_group <---- process(es) However, in our tests, we found this perf_event cannot report reliable results. Here is an example case: # create cgroups mkdir -p /sys/fs/cgroup/p/c # start perf for parent group perf stat -e instructions -G "p" # on another console, run test process in child cgroup: stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs # after the test process is done, stop perf in the first console shows <not counted> instructions p The instruction should not be "not counted" as the process runs in the child cgroup. We found this is because perf_event->cgrp and cpuctx->cgrp are not identical, thus perf_event->cgrp are not updated properly. This patch fixes this by updating perf_cgroup properly for ancestor cgroup(s). Reported-by: Ephraim Park <ephiepark@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <jolsa@redhat.com> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30workqueue: use put_device() instead of kfree()Arvind Yadav
[ Upstream commit 537f4146c53c95aac977852b371bafb9c6755ee1 ] Never directly free @dev after calling device_register(), even if it returned an error! Always use put_device() to give up the reference initialized in this function instead. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZEDavid Rientjes
[ Upstream commit 88913bd8ea2a75d7e460a4bed5f75e1c32660d7e ] chan->n_subbufs is set by the user and relay_create_buf() does a kmalloc() of chan->n_subbufs * sizeof(size_t *). kmalloc_slab() will generate a warning when this fails if chan->subbufs * sizeof(size_t *) > KMALLOC_MAX_SIZE. Limit chan->n_subbufs to the maximum allowed kmalloc() size. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802061216100.122576@chino.kir.corp.google.com Fixes: f6302f1bcd75 ("relay: prevent integer overflow in relay_open()") Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30kernel/sys.c: fix potential Spectre v1 issueGustavo A. R. Silva
commit 23d6aef74da86a33fa6bb75f79565e0a16ee97c2 upstream. `resource' can be controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) Fix this by sanitizing *resource* before using it to index current->signal->rlim Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22seccomp: Move speculation migitation control to arch codeThomas Gleixner
commit 8bf37d8c067bb7eb8e7c381bdadf9bd89182b6bc upstream The migitation control is simpler to implement in architecture code as it avoids the extra function call to check the mode. Aside of that having an explicit seccomp enabled mode in the architecture mitigations would require even more workarounds. Move it into architecture code and provide a weak function in the seccomp code. Remove the 'which' argument as this allows the architecture to decide which mitigations are relevant for seccomp. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22seccomp: Add filter flag to opt-out of SSB mitigationKees Cook
commit 00a02d0c502a06d15e07b857f8ff921e3e402675 upstream If a seccomp user is not interested in Speculative Store Bypass mitigation by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when adding filters. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22seccomp: Use PR_SPEC_FORCE_DISABLEThomas Gleixner
commit b849a812f7eb92e96d1c8239b06581b2cfd8b275 upstream Use PR_SPEC_FORCE_DISABLE in seccomp() because seccomp does not allow to widen restrictions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22seccomp: Enable speculation flaw mitigationsKees Cook
commit 5c3070890d06ff82eecb808d02d2ca39169533ef upstream When speculation flaw mitigations are opt-in (via prctl), using seccomp will automatically opt-in to these protections, since using seccomp indicates at least some level of sandboxing is desired. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22nospec: Allow getting/setting on non-current taskKees Cook
commit 7bbf1373e228840bb0295a2ca26d548ef37f448e upstream Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than current. This is needed both for /proc/$pid/status queries and for seccomp (since thread-syncing can trigger seccomp in non-current threads). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22prctl: Add speculation control prctlsThomas Gleixner
commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream Add two new prctls to control aspects of speculation related vulnerabilites and their mitigations to provide finer grained control over performance impacting mitigations. PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature which is selected with arg2 of prctl(2). The return value uses bit 0-2 with the following meaning: Bit Define Description 0 PR_SPEC_PRCTL Mitigation can be controlled per task by PR_SET_SPECULATION_CTRL 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is disabled 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is enabled If all bits are 0 the CPU is not affected by the speculation misfeature. If PR_SPEC_PRCTL is set, then the per task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail. PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of prctl(2) per task. arg3 is used to hand in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. The common return values are: EINVAL prctl is not implemented by the architecture or the unused prctl() arguments are not 0 ENODEV arg2 is selecting a not supported speculation misfeature PR_SET_SPECULATION_CTRL has these additional return values: ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE ENXIO prctl control of the selected speculation misfeature is disabled The first supported controlable speculation misfeature is PR_SPEC_STORE_BYPASS. Add the define so this can be shared between architectures. Based on an initial patch from Tim Chen and mostly rewritten. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22tick/broadcast: Use for_each_cpu() specially on UP kernelsDexuan Cui
commit 5596fe34495cf0f645f417eb928ef224df3e3cb4 upstream. for_each_cpu() unintuitively reports CPU0 as set independent of the actual cpumask content on UP kernels. This causes an unexpected PIT interrupt storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as a result, the virtual machine can suffer from a strange random delay of 1~20 minutes during boot-up, and sometimes it can hang forever. Protect if by checking whether the cpumask is empty before entering the for_each_cpu() loop. [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org Cc: Rakib Mullick <rakib.mullick@gmail.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KY Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]Peter Zijlstra
commit 4411ec1d1993e8dbff2898390e3fed280d88e446 upstream. > kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages' Userspace controls @pgoff through the fault address. Sanitize the array index before doing the array dereference. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16tracing/uprobe_event: Fix strncpy corner caseMasami Hiramatsu
commit 50268a3d266ecfdd6c5873d62b2758d9732fc598 upstream. Fix string fetch function to terminate with NUL. It is OK to drop the rest of string. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org Cc: 范龙飞 <long7573@126.com> Fixes: 5baaa59ef09e ("tracing/probes: Implement 'memory' fetch method for uprobes") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]Peter Zijlstra
commit 354d7793070611b4df5a79fbb0f12752d0ed0cc5 upstream. > kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight' Userspace controls @nice, sanitize the array index. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16cpufreq: schedutil: Avoid using invalid next_freqRafael J. Wysocki
commit 97739501f207efe33145b918817f305b822987f8 upstream. If the next_freq field of struct sugov_policy is set to UINT_MAX, it shouldn't be used for updating the CPU frequency (this is a special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely) it may be passed as the new frequency to sugov_update_commit() in sugov_update_single(). Fix that by adding an extra check for the special UINT_MAX value of next_freq to sugov_update_single(). Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely) Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.12+ <stable@vger.kernel.org> # 4.12+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16tracing: Fix regex_match_front() to not over compare the test stringSteven Rostedt (VMware)
commit dc432c3d7f9bceb3de6f5b44fb9c657c9810ed6d upstream. The regex match function regex_match_front() in the tracing filter logic, was fixed to test just the pattern length from testing the entire test string. That is, it went from strncmp(str, r->pattern, len) to strcmp(str, r->pattern, r->len). The issue is that str is not guaranteed to be nul terminated, and if r->len is greater than the length of str, it can access more memory than is allocated. The solution is to add a simple test if (len < r->len) return 0. Cc: stable@vger.kernel.org Fixes: 285caad415f45 ("tracing/filters: Fix MATCH_FRONT_ONLY filter matching") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16compat: fix 4-byte infoleak via uninitialized struct fieldJann Horn
commit 0a0b98734479aa5b3c671d5190e86273372cab95 upstream. Commit 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") removed the memset() in compat_get_timex(). Since then, the compat adjtimex syscall can invoke do_adjtimex() with an uninitialized ->tai. If do_adjtimex() doesn't write to ->tai (e.g. because the arguments are invalid), compat_put_timex() then copies the uninitialized ->tai field to userspace. Fix it by adding the memset() back. Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16perf: Remove superfluous allocation error checkJiri Olsa
commit bfb3d7b8b906b66551424d7636182126e1d134c8 upstream. If the get_callchain_buffers fails to allocate the buffer it will decrease the nr_callchain_events right away. There's no point of checking the allocation error for nr_callchain_events > 1. Removing that check. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: syzkaller-bugs@googlegroups.com Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20180415092352.12403-3-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-09tracing: Fix bad use of igrab in trace_uprobe.cSong Liu
commit 0c92c7a3c5d416f47b32c5f20a611dfeca5d5f2e upstream. As Miklos reported and suggested: This pattern repeats two times in trace_uprobe.c and in kernel/events/core.c as well: ret = kern_path(filename, LOOKUP_FOLLOW, &path); if (ret) goto fail_address_parse; inode = igrab(d_inode(path.dentry)); path_put(&path); And it's wrong. You can only hold a reference to the inode if you have an active ref to the superblock as well (which is normally through path.mnt) or holding s_umount. This way unmounting the containing filesystem while the tracepoint is active will give you the "VFS: Busy inodes after unmount..." message and a crash when the inode is finally put. Solution: store path instead of inode. This patch fixes two instances in trace_uprobe.c. struct path is added to struct trace_uprobe to keep the inode and containing mount point referenced. Link: http://lkml.kernel.org/r/20180423172135.4050588-1-songliubraving@fb.com Fixes: f3f096cfedf8 ("tracing: Provide trace events interface for uprobes") Fixes: 33ea4b24277b ("perf/core: Implement the 'perf_uprobe' PMU") Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Howard McLauchlan <hmclauchlan@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-09tracepoint: Do not warn on ENOMEMMathieu Desnoyers
commit d66a270be3310d7aa132fec0cea77d3d32a0ff75 upstream. Tracepoint should only warn when a kernel API user does not respect the required preconditions (e.g. same tracepoint enabled twice, or called to remove a tracepoint that does not exist). Silence warning in out-of-memory conditions, given that the error is returned to the caller. This ensures that out-of-memory error-injection testing does not trigger warnings in tracepoint.c, which were seen by syzbot. Link: https://lkml.kernel.org/r/001a114465e241a8720567419a72@google.com Link: https://lkml.kernel.org/r/001a1140e0de15fc910567464190@google.com Link: http://lkml.kernel.org/r/20180315124424.32319-1-mathieu.desnoyers@efficios.com CC: Peter Zijlstra <peterz@infradead.org> CC: Jiri Olsa <jolsa@redhat.com> CC: Arnaldo Carvalho de Melo <acme@kernel.org> CC: Alexander Shishkin <alexander.shishkin@linux.intel.com> CC: Namhyung Kim <namhyung@kernel.org> CC: stable@vger.kernel.org Fixes: de7b2973903c6 ("tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints") Reported-by: syzbot+9c0d616860575a73166a@syzkaller.appspotmail.com Reported-by: syzbot+4e9ae7fa46233396f64d@syzkaller.appspotmail.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-01tick/sched: Do not mess with an enqueued hrtimerThomas Gleixner
commit 1f71addd34f4c442bec7d7c749acc1beb58126f2 upstream. Kaike reported that in tests rdma hrtimers occasionaly stopped working. He did great debugging, which provided enough context to decode the problem. CPU 3 CPU 2 idle start sched_timer expires = 712171000000 queue->next = sched_timer start rdmavt timer. expires = 712172915662 lock(baseof(CPU3)) tick_nohz_stop_tick() tick = 716767000000 timerqueue_add(tmr) hrtimer_set_expires(sched_timer, tick); sched_timer->expires = 716767000000 <---- FAIL if (tmr->expires < queue->next->expires) hrtimer_start(sched_timer) queue->next = tmr; lock(baseof(CPU3)) unlock(baseof(CPU3)) timerqueue_remove() timerqueue_add() ts->sched_timer is queued and queue->next is pointing to it, but then ts->sched_timer.expires is modified. This not only corrupts the ordering of the timerqueue RB tree, it also makes CPU2 see the new expiry time of timerqueue->next->expires when checking whether timerqueue->next needs to be updated. So CPU2 sees that the rdma timer is earlier than timerqueue->next and sets the rdma timer as new next. Depending on whether it had also seen the new time at RB tree enqueue, it might have queued the rdma timer at the wrong place and then after removing the sched_timer the RB tree is completely hosed. The problem was introduced with a commit which tried to solve inconsistency between the hrtimer in the tick_sched data and the underlying hardware clockevent. It split out hrtimer_set_expires() to store the new tick time in both the NOHZ and the NOHZ + HIGHRES case, but missed the fact that in the NOHZ + HIGHRES case the hrtimer might still be queued. Use hrtimer_start(timer, tick...) for the NOHZ + HIGHRES case which sets timer->expires after canceling the timer and move the hrtimer_set_expires() invocation into the NOHZ only code path which is not affected as it merily uses the hrtimer as next event storage so code pathes can be shared with the NOHZ + HIGHRES case. Fixes: d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") Reported-by: "Wan Kaike" <kaike.wan@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Cc: "Marciniszyn Mike" <mike.marciniszyn@intel.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: linux-rdma@vger.kernel.org Cc: "Dalessandro Dennis" <dennis.dalessandro@intel.com> Cc: "Fleck John" <john.fleck@intel.com> Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Weiny Ira" <ira.weiny@intel.com> Cc: "linux-rdma@vger.kernel.org" Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804241637390.1679@nanos.tec.linutronix.de Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804242119210.1597@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-29tracing: Fix missing tab for hwlat_detector print formatPeter Xu
commit 9a0fd675304d410f3a9586e1b333e16f4658d56c upstream. It's been missing for a while but no one is touching that up. Fix it. Link: http://lkml.kernel.org/r/20180315060639.9578-1-peterx@redhat.com CC: Ingo Molnar <mingo@kernel.org> Cc:stable@vger.kernel.org Fixes: 7b2c86250122d ("tracing: Add NMI tracing in hwlat detector") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26perf: Return proper values for user stack errorsJiri Olsa
commit 78b562fbfa2cf0a9fcb23c3154756b690f4905c1 upstream. Return immediately when we find issue in the user stack checks. The error value could get overwritten by following check for PERF_SAMPLE_REGS_INTR. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: syzkaller-bugs@googlegroups.com Cc: x86@kernel.org Fixes: 60e2364e60e8 ("perf: Add ability to sample machine state on interrupt") Link: http://lkml.kernel.org/r/20180415092352.12403-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26perf: Fix sample_max_stack maximum checkJiri Olsa
commit 5af44ca53d019de47efe6dbc4003dd518e5197ed upstream. The syzbot hit KASAN bug in perf_callchain_store having the entry stored behind the allocated bounds [1]. We miss the sample_max_stack check for the initial event that allocates callchain buffers. This missing check allows to create an event with sample_max_stack value bigger than the global sysctl maximum: # sysctl -a | grep perf_event_max_stack kernel.perf_event_max_stack = 127 # perf record -vv -C 1 -e cycles/max-stack=256/ kill ... perf_event_attr: size 112 ... sample_max_stack 256 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 4 Note the '-C 1', which forces perf record to create just single event. Otherwise it opens event for every cpu, then the sample_max_stack check fails on the second event and all's fine. The fix is to run the sample_max_stack check also for the first event with callchains. [1] https://marc.info/?l=linux-kernel&m=152352732920874&w=2 Reported-by: syzbot+7c449856228b63ac951e@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: syzkaller-bugs@googlegroups.com Cc: x86@kernel.org Fixes: 97c79a38cd45 ("perf core: Per event callchain limit") Link: http://lkml.kernel.org/r/20180415092352.12403-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26alarmtimer: Init nanosleep alarm timer on stackThomas Gleixner
commit bd03143007eb9b03a7f2316c677780561b68ba2a upstream. syszbot reported the following debugobjects splat: ODEBUG: object is on stack, but not annotated WARNING: CPU: 0 PID: 4185 at lib/debugobjects.c:328 RIP: 0010:debug_object_is_on_stack lib/debugobjects.c:327 [inline] debug_object_init+0x17/0x20 lib/debugobjects.c:391 debug_hrtimer_init kernel/time/hrtimer.c:410 [inline] debug_init kernel/time/hrtimer.c:458 [inline] hrtimer_init+0x8c/0x410 kernel/time/hrtimer.c:1259 alarm_init kernel/time/alarmtimer.c:339 [inline] alarm_timer_nsleep+0x164/0x4d0 kernel/time/alarmtimer.c:787 SYSC_clock_nanosleep kernel/time/posix-timers.c:1226 [inline] SyS_clock_nanosleep+0x235/0x330 kernel/time/posix-timers.c:1204 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 This happens because the hrtimer for the alarm nanosleep is on stack, but the code does not use the proper debug objects initialization. Split out the code for the allocated use cases and invoke hrtimer_init_on_stack() for the nanosleep related functions. Reported-by: syzbot+a3e0726462b2e346a31d@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: syzkaller-bugs@googlegroups.com Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1803261528270.1585@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26locking/qspinlock: Ensure node->count is updated before initialising nodeWill Deacon
[ Upstream commit 11dc13224c975efcec96647a4768a6f1bb7a19a8 ] When queuing on the qspinlock, the count field for the current CPU's head node is incremented. This needn't be atomic because locking in e.g. IRQ context is balanced and so an IRQ will return with node->count as it found it. However, the compiler could in theory reorder the initialisation of node[idx] before the increment of the head node->count, causing an IRQ to overwrite the initialised node and potentially corrupt the lock state. Avoid the potential for this harmful compiler reordering by placing a barrier() between the increment of the head node->count and the subsequent node initialisation. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26bpf: sockmap, fix leaking maps with attached but not detached progsJohn Fastabend
[ Upstream commit 3d9e952697de89b53227f06d4241f275eb99cfc4 ] When a program is attached to a map we increment the program refcnt to ensure that the program is not removed while it is potentially being referenced from sockmap side. However, if this same program also references the map (this is a reasonably common pattern in my programs) then the verifier will also increment the maps refcnt from the verifier. This is to ensure the map doesn't get garbage collected while the program has a reference to it. So we are left in a state where the map holds the refcnt on the program stopping it from being removed and releasing the map refcnt. And vice versa the program holds a refcnt on the map stopping it from releasing the refcnt on the prog. All this is fine as long as users detach the program while the map fd is still around. But, if the user omits this detach command we are left with a dangling map we can no longer release. To resolve this when the map fd is released decrement the program references and remove any reference from the map to the program. This fixes the issue with possibly dangling map and creates a user side API constraint. That is, the map fd must be held open for programs to be attached to a map. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26x86/power: Fix swsusp_arch_resume prototypeArnd Bergmann
[ Upstream commit 328008a72d38b5bde6491e463405c34a81a65d3e ] The declaration for swsusp_arch_resume marks it as 'asmlinkage', but the definition in x86-32 does not, and it fails to include the header with the declaration. This leads to a warning when building with link-time-optimizations: kernel/power/power.h:108:23: error: type of 'swsusp_arch_resume' does not match original declaration [-Werror=lto-type-mismatch] extern asmlinkage int swsusp_arch_resume(void); ^ arch/x86/power/hibernate_32.c:148:0: note: 'swsusp_arch_resume' was previously declared here int swsusp_arch_resume(void) This moves the declaration into a globally visible header file and fixes up both x86 definitions to match it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Len Brown <len.brown@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Nicolas Pitre <nico@linaro.org> Cc: linux-pm@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Bart Van Assche <bart.vanassche@wdc.com> Link: https://lkml.kernel.org/r/20180202145634.200291-2-arnd@arndb.de Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24ring-buffer: Check if memory is available before allocationSteven Rostedt (VMware)
commit 2a872fa4e9c8adc79c830e4009e1cc0c013a9d8a upstream. The ring buffer is made up of a link list of pages. When making the ring buffer bigger, it will allocate all the pages it needs before adding to the ring buffer, and if it fails, it frees them and returns an error. This makes increasing the ring buffer size an all or nothing action. When this was first created, the pages were allocated with "NORETRY". This was to not cause any Out-Of-Memory (OOM) actions from allocating the ring buffer. But NORETRY was too strict, as the ring buffer would fail to expand even when there's memory available, but was taken up in the page cache. Commit 848618857d253 ("tracing/ring_buffer: Try harder to allocate") changed the allocating from NORETRY to RETRY_MAYFAIL. The RETRY_MAYFAIL would allocate from the page cache, but if there was no memory available, it would simple fail the allocation and not trigger an OOM. This worked fine, but had one problem. As the ring buffer would allocate one page at a time, it could take up all memory in the system before it failed to allocate and free that memory. If the allocation is happening and the ring buffer allocates all memory and then tries to take more than available, its allocation will not trigger an OOM, but if there's any allocation that happens someplace else, that could trigger an OOM, even though once the ring buffer's allocation fails, it would free up all the previous memory it tried to allocate, and allow other memory allocations to succeed. Commit d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a separate function") separated out si_mem_availble() as a separate function that could be used to see how much memory is available in the system. Using this function to make sure that the ring buffer could be allocated before it tries to allocate pages we can avoid allocating all memory in the system and making it vulnerable to OOMs if other allocations are taking place. Link: http://lkml.kernel.org/r/1522320104-6573-1-git-send-email-zhaoyang.huang@spreadtrum.com CC: stable@vger.kernel.org Cc: linux-mm@kvack.org Fixes: 848618857d253 ("tracing/ring_buffer: Try harder to allocate") Requires: d02bd27bd33dd ("mm/page_alloc.c: calculate 'available' memory in a separate function") Reported-by: Zhaoyang Huang <huangzhaoyang@gmail.com> Tested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24resource: fix integer overflow at reallocationTakashi Iwai
commit 60bb83b81169820c691fbfa33a6a4aef32aa4b0b upstream. We've got a bug report indicating a kernel panic at booting on an x86-32 system, and it turned out to be the invalid PCI resource assigned after reallocation. __find_resource() first aligns the resource start address and resets the end address with start+size-1 accordingly, then checks whether it's contained. Here the end address may overflow the integer, although resource_contains() still returns true because the function validates only start and end address. So this ends up with returning an invalid resource (start > end). There was already an attempt to cover such a problem in the commit 47ea91b4052d ("Resource: fix wrong resource window calculation"), but this case is an overseen one. This patch adds the validity check of the newly calculated resource for avoiding the integer overflow problem. Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739 Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de Fixes: 23c570a67448 ("resource: ability to resize an allocated resource") Signed-off-by: Takashi Iwai <tiwai@suse.de> Reported-by: Michael Henders <hendersm@shaw.ca> Tested-by: Michael Henders <hendersm@shaw.ca> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-19perf/core: Fix use-after-free in uprobe_perf_close()Prashant Bhole
commit 621b6d2ea297d0fb6030452c5bcd221f12165fcf upstream. A use-after-free bug was caught by KASAN while running usdt related code (BCC project. bcc/tests/python/test_usdt2.py): ================================================================== BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0 Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870 CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G W 4.16.0-next-20180409 #215 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0xc7/0x15b ? show_regs_print_info+0x5/0x5 ? printk+0x9c/0xc3 ? kmsg_dump_rewind_nolock+0x6e/0x6e ? uprobe_perf_close+0x222/0x3b0 print_address_description+0x83/0x3a0 ? uprobe_perf_close+0x222/0x3b0 kasan_report+0x1dd/0x460 ? uprobe_perf_close+0x222/0x3b0 uprobe_perf_close+0x222/0x3b0 ? probes_open+0x180/0x180 ? free_filters_list+0x290/0x290 trace_uprobe_register+0x1bb/0x500 ? perf_event_attach_bpf_prog+0x310/0x310 ? probe_event_disable+0x4e0/0x4e0 perf_uprobe_destroy+0x63/0xd0 _free_event+0x2bc/0xbd0 ? lockdep_rcu_suspicious+0x100/0x100 ? ring_buffer_attach+0x550/0x550 ? kvm_sched_clock_read+0x1a/0x30 ? perf_event_release_kernel+0x3e4/0xc00 ? __mutex_unlock_slowpath+0x12e/0x540 ? wait_for_completion+0x430/0x430 ? lock_downgrade+0x3c0/0x3c0 ? lock_release+0x980/0x980 ? do_raw_spin_trylock+0x118/0x150 ? do_raw_spin_unlock+0x121/0x210 ? do_raw_spin_trylock+0x150/0x150 perf_event_release_kernel+0x5d4/0xc00 ? put_event+0x30/0x30 ? fsnotify+0xd2d/0xea0 ? sched_clock_cpu+0x18/0x1a0 ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0 ? pvclock_clocksource_read+0x152/0x2b0 ? pvclock_read_flags+0x80/0x80 ? kvm_sched_clock_read+0x1a/0x30 ? sched_clock_cpu+0x18/0x1a0 ? pvclock_clocksource_read+0x152/0x2b0 ? locks_remove_file+0xec/0x470 ? pvclock_read_flags+0x80/0x80 ? fcntl_setlk+0x880/0x880 ? ima_file_free+0x8d/0x390 ? lockdep_rcu_suspicious+0x100/0x100 ? ima_file_check+0x110/0x110 ? fsnotify+0xea0/0xea0 ? kvm_sched_clock_read+0x1a/0x30 ? rcu_note_context_switch+0x600/0x600 perf_release+0x21/0x40 __fput+0x264/0x620 ? fput+0xf0/0xf0 ? do_raw_spin_unlock+0x121/0x210 ? do_raw_spin_trylock+0x150/0x150 ? SyS_fchdir+0x100/0x100 ? fsnotify+0xea0/0xea0 task_work_run+0x14b/0x1e0 ? task_work_cancel+0x1c0/0x1c0 ? copy_fd_bitmaps+0x150/0x150 ? vfs_read+0xe5/0x260 exit_to_usermode_loop+0x17b/0x1b0 ? trace_event_raw_event_sys_exit+0x1a0/0x1a0 do_syscall_64+0x3f6/0x490 ? syscall_return_slowpath+0x2c0/0x2c0 ? lockdep_sys_exit+0x1f/0xaa ? syscall_return_slowpath+0x1a3/0x2c0 ? lockdep_sys_exit+0x1f/0xaa ? prepare_exit_to_usermode+0x11c/0x1e0 ? enter_from_user_mode+0x30/0x30 random: crng init done ? __put_user_4+0x1c/0x30 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f41d95f9340 RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340 RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330 R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290 Allocated by task 870: kasan_kmalloc+0xa0/0xd0 kmem_cache_alloc_node+0x11a/0x430 copy_process.part.19+0x11a0/0x41c0 _do_fork+0x1be/0xa20 do_syscall_64+0x198/0x490 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Freed by task 0: __kasan_slab_free+0x12e/0x180 kmem_cache_free+0x102/0x4d0 free_task+0xfe/0x160 __put_task_struct+0x189/0x290 delayed_put_task_struct+0x119/0x250 rcu_process_callbacks+0xa6c/0x1b60 __do_softirq+0x238/0x7ae The buggy address belongs to the object at ffff880384f9b480 which belongs to the cache task_struct of size 12928 It occurs because task_struct is freed before perf_event which refers to the task and task flags are checked while teardown of the event. perf_event_alloc() assigns task_struct to hw.target of perf_event, but there is no reference counting for it. As a fix we get_task_struct() in perf_event_alloc() at above mentioned assignment and put_task_struct() in _free_event(). Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 63b6da39bb38e8f1a1ef3180d32a39d6 ("perf: Fix perf_event_exit_task() race") Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08perf/hwbp: Simplify the perf-hwbp code, fix documentationLinus Torvalds
commit f67b15037a7a50c57f72e69a6d59941ad90a0f0f upstream. Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the modification of a breakpoint - simplify it and remove the pointless local variables. Also update the stale Docbook while at it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28bpf: skip unnecessary capability checkChenbo Feng
commit 0fa4fe85f4724fff89b09741c437cbee9cf8b008 upstream. The current check statement in BPF syscall will do a capability check for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This code path will trigger unnecessary security hooks on capability checking and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN access. This can be resolved by simply switch the order of the statement and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is allowed. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28perf/core: Fix ctx_event_type in ctx_resched()Song Liu
commit bd903afeb504db5655a45bb4cf86f38be5b1bf62 upstream. In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is added. However, ctx_resched() calculates ctx_event_type before checking this condition. As a result, pinned events will NOT get higher priority than flexible events. The following shows this issue on an Intel CPU (where ref-cycles can only use one hardware counter). 1. First start: perf stat -C 0 -e ref-cycles -I 1000 2. Then, in the second console, run: perf stat -C 0 -e ref-cycles:D -I 1000 The second perf uses pinned events, which is expected to have higher priority. However, because it failed in ctx_resched(). It is never run. This patch fixes this by calculating ctx_event_type after re-evaluating event_type. Reported-by: Ephraim Park <ephiepark@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <jolsa@redhat.com> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 487f05e18aa4 ("perf/core: Optimize event rescheduling on active contexts") Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28posix-timers: Protect posix clock array access against speculationThomas Gleixner
commit 19b558db12f9f4e45a22012bae7b4783e62224da upstream. The clockid argument of clockid_to_kclock() comes straight from user space via various syscalls and is used as index into the posix_clocks array. Protect it against spectre v1 array out of bounds speculation. Remove the redundant check for !posix_clock[id] as this is another source for speculation and does not provide any advantage over the return posix_clock[id] path which returns NULL in that case anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1802151718320.1296@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28tracing: probeevent: Fix to support minus offset from symbolMasami Hiramatsu
commit c5d343b6b7badd1f5fe0873eff2e8d63a193e732 upstream. In Documentation/trace/kprobetrace.txt, it says @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) However, the parser doesn't parse minus offset correctly, since commit 2fba0c8867af ("tracing/kprobes: Fix probe offset to be unsigned") drops minus ("-") offset support for kprobe probe address usage. This fixes the traceprobe_split_symbol_offset() to parse minus offset again with checking the offset range, and add a minus offset check in kprobe probe address usage. Link: http://lkml.kernel.org/r/152129028983.31874.13419301530285775521.stgit@devbox Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Fixes: 2fba0c8867af ("tracing/kprobes: Fix probe offset to be unsigned") Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28cgroup: fix rule checking for threaded mode switchingTejun Heo
commit d1897c9538edafd4ae6bbd03cc075962ddde2c21 upstream. A domain cgroup isn't allowed to be turned threaded if its subtree is populated or domain controllers are enabled. cgroup_enable_threaded() depended on cgroup_can_be_thread_root() test to enforce this rule. A parent which has populated domain descendants or have domain controllers enabled can't become a thread root, so the above rules are enforced automatically. However, for the root cgroup which can host mixed domain and threaded children, cgroup_can_be_thread_root() doesn't check any of those conditions and thus first level cgroups ends up escaping those rules. This patch fixes the bug by adding explicit checks for those rules in cgroup_enable_threaded(). Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19locking/locktorture: Fix num reader/writer corner casesDavidlohr Bueso
[ Upstream commit 2ce77d16db4240dd2e422fc0a5c26d3e2ec03446 ] Things can explode for locktorture if the user does combinations of nwriters_stress=0 nreaders_stress=0. Fix this by not assuming we always want to torture writer threads. Reported-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19sched: Stop resched_cpu() from sending IPIs to offline CPUsPaul E. McKenney
[ Upstream commit a0982dfa03efca6c239c52cabebcea4afb93ea6b ] The rcutorture test suite occasionally provokes a splat due to invoking resched_cpu() on an offline CPU: WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40 Modules linked in: CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff902ede9daf00 task.stack: ffff96c50010c000 RIP: 0010:native_smp_send_reschedule+0x37/0x40 RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096 RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003 RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001 R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240 FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0 Call Trace: resched_curr+0x8f/0x1c0 resched_cpu+0x2c/0x40 rcu_implicit_dynticks_qs+0x152/0x220 force_qs_rnp+0x147/0x1d0 ? sync_rcu_exp_select_cpus+0x450/0x450 rcu_gp_kthread+0x5a9/0x950 kthread+0x142/0x180 ? force_qs_rnp+0x1d0/0x1d0 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x27/0x40 Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48 ---[ end trace 26df9e5df4bba4ac ]--- This splat cannot be generated by expedited grace periods because they always invoke resched_cpu() on the current CPU, which is good because expedited grace periods require that resched_cpu() unconditionally succeed. However, other parts of RCU can tolerate resched_cpu() acting as a no-op, at least as long as it doesn't happen too often. This commit therefore makes resched_cpu() invoke resched_curr() only if the CPU is either online or is the current CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19sched: Stop switched_to_rt() from sending IPIs to offline CPUsPaul E. McKenney
[ Upstream commit 2fe2582649aa2355f79acddb86bd4d6c5363eb63 ] The rcutorture test suite occasionally provokes a splat due to invoking rt_mutex_lock() which needs to boost the priority of a task currently sitting on a runqueue that belongs to an offline CPU: WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40 Modules linked in: CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000 RIP: 0010:native_smp_send_reschedule+0x37/0x40 RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082 RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004 RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200 FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0 Call Trace: resched_curr+0x61/0xd0 switched_to_rt+0x8f/0xa0 rt_mutex_setprio+0x25c/0x410 task_blocks_on_rt_mutex+0x1b3/0x1f0 rt_mutex_slowlock+0xa9/0x1e0 rt_mutex_lock+0x29/0x30 rcu_boost_kthread+0x127/0x3c0 kthread+0x104/0x140 ? rcu_report_unblock_qs_rnp+0x90/0x90 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x22/0x30 Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48 But the target task's priority has already been adjusted, so the only purpose of switched_to_rt() invoking resched_curr() is to wake up the CPU running some task that needs to be preempted by the boosted task. But the CPU is offline, which presumably means that the task must be migrated to some other CPU, and that this other CPU will undertake any needed preemption at the time of migration. Because the runqueue lock is held when resched_curr() is invoked, we know that the boosted task cannot go anywhere, so it is not necessary to invoke resched_curr() in this particular case. This commit therefore makes switched_to_rt() refrain from invoking resched_curr() when the target CPU is offline. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15workqueue: Allow retrieval of current task's work structLukas Wunner
commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream. Introduce a helper to retrieve the current task's work struct if it is a workqueue worker. This allows us to fix a long-standing deadlock in several DRM drivers wherein the ->runtime_suspend callback waits for a specific worker to finish and that worker in turn calls a function which waits for runtime suspend to finish. That function is invoked from multiple call sites and waiting for runtime suspend to finish is the correct thing to do except if it's executing in the context of the worker. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11bpf: allow xadd only on aligned memoryDaniel Borkmann
[ upstream commit ca36960211eb228bcbc7aaebfa0d027368a94c60 ] The requirements around atomic_add() / atomic64_add() resp. their JIT implementations differ across architectures. E.g. while x86_64 seems just fine with BPF's xadd on unaligned memory, on arm64 it triggers via interpreter but also JIT the following crash: [ 830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703 [...] [ 830.916161] Internal error: Oops: 96000021 [#1] SMP [ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8 [ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017 [ 830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO) [ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18 [ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588 [ 831.012485] sp : ffff00001ccabc20 [ 831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00 [ 831.021087] x27: 0000000000000001 x26: 0000000000000000 [ 831.026387] x25: 000000c168d9db98 x24: 0000000000000000 [ 831.031686] x23: ffff000008203878 x22: ffff000009488000 [ 831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0 [ 831.042286] x19: ffff0000097b5080 x18: 0000000000000a03 [ 831.047585] x17: 0000000000000000 x16: 0000000000000000 [ 831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000 [ 831.058184] x13: 0000000000000000 x12: 0000000000000000 [ 831.063484] x11: 0000000000000001 x10: 0000000000000000 [ 831.068783] x9 : 0000000000000000 x8 : 0000000000000000 [ 831.074083] x7 : 0000000000000000 x6 : 000580d428000000 [ 831.079383] x5 : 0000000000000018 x4 : 0000000000000000 [ 831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001 [ 831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001 [ 831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044) [ 831.102577] Call trace: [ 831.105012] __ll_sc_atomic_add+0x4/0x18 [ 831.108923] __bpf_prog_run32+0x4c/0x70 [ 831.112748] bpf_test_run+0x78/0xf8 [ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120 [ 831.120567] SyS_bpf+0x77c/0x1110 [ 831.123873] el0_svc_naked+0x30/0x34 [ 831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31) Reason for this is because memory is required to be aligned. In case of BPF, we always enforce alignment in terms of stack access, but not when accessing map values or packet data when the underlying arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set. xadd on packet data that is local to us anyway is just wrong, so forbid this case entirely. The only place where xadd makes sense in fact are map values; xadd on stack is wrong as well, but it's been around for much longer. Specifically enforce strict alignment in case of xadd, so that we handle this case generically and avoid such crashes in the first place. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11bpf: add schedule points in percpu arrays managementEric Dumazet
[ upstream commit 32fff239de37ef226d5b66329dd133f64d63b22d ] syszbot managed to trigger RCU detected stalls in bpf_array_free_percpu() It takes time to allocate a huge percpu map, but even more time to free it. Since we run in process context, use cond_resched() to yield cpu if needed. Fixes: a10423b87a7e ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11bpf: fix rcu lockdep warning for lpm_trie map_free callbackYonghong Song
[ upstream commit 6c5f61023c5b0edb0c8a64c902fe97c6453b1852 ] Commit 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback function") fixed a memory leak and removed unnecessary locks in map_free callback function. Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on, running tools/testing/selftests/bpf/test_lpm_map will have: [ 98.294321] ============================= [ 98.294807] WARNING: suspicious RCU usage [ 98.295359] 4.16.0-rc2+ #193 Not tainted [ 98.295907] ----------------------------- [ 98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage! [ 98.297657] [ 98.297657] other info that might help us debug this: [ 98.297657] [ 98.298663] [ 98.298663] rcu_scheduler_active = 2, debug_locks = 1 [ 98.299536] 2 locks held by kworker/2:1/54: [ 98.300152] #0: ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0 [ 98.301381] #1: ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0 Since actual trie tree removal happens only after no other accesses to the tree are possible, replacing rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock)) with rcu_dereference_protected(*slot, 1) fixed the issue. Fixes: 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback function") Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yonghong Song <yhs@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>