summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2021-09-16Revert "posix-cpu-timers: Force next expiration recalc after itimer reset"Greg Kroah-Hartman
This reverts commit 564005805aadec9cb7e5dc4e14071b8f87cd6b58 which is commit 406dd42bd1ba0c01babf9cde169bb319e52f6147 upstream. It is reported to cause regressions. A proposed fix has been posted, but it is not in a released kernel yet. So just revert this from the stable release so that the bug is fixed. If it's really needed we can add it back in in a future release. Link: https://lore.kernel.org/r/87ilz1pwaq.fsf@wylie.me.uk Reported-by: "Alan J. Wylie" <alan@wylie.me.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15bpf: Fix possible out of bound write in narrow load handlingAndrey Ignatov
[ Upstream commit d7af7e497f0308bc97809cc48b58e8e0f13887e1 ] Fix a verifier bug found by smatch static checker in [0]. This problem has never been seen in prod to my best knowledge. Fixing it still seems to be a good idea since it's hard to say for sure whether it's possible or not to have a scenario where a combination of convert_ctx_access() and a narrow load would lead to an out of bound write. When narrow load is handled, one or two new instructions are added to insn_buf array, but before it was only checked that cnt >= ARRAY_SIZE(insn_buf) And it's safe to add a new instruction to insn_buf[cnt++] only once. The second try will lead to out of bound write. And this is what can happen if `shift` is set. Fix it by making sure that if the BPF_RSH instruction has to be added in addition to BPF_AND then there is enough space for two more instructions in insn_buf. The full report [0] is below: kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array kernel/bpf/verifier.c 12282 12283 insn->off = off & ~(size_default - 1); 12284 insn->code = BPF_LDX | BPF_MEM | size_code; 12285 } 12286 12287 target_size = 0; 12288 cnt = convert_ctx_access(type, insn, insn_buf, env->prog, 12289 &target_size); 12290 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) || ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Bounds check. 12291 (ctx_field_size && !target_size)) { 12292 verbose(env, "bpf verifier is misconfigured\n"); 12293 return -EINVAL; 12294 } 12295 12296 if (is_narrower_load && size < target_size) { 12297 u8 shift = bpf_ctx_narrow_access_offset( 12298 off, size, size_default) * 8; 12299 if (ctx_field_size <= 4) { 12300 if (shift) 12301 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, ^^^^^ increment beyond end of array 12302 insn->dst_reg, 12303 shift); --> 12304 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, ^^^^^ out of bounds write 12305 (1 << size * 8) - 1); 12306 } else { 12307 if (shift) 12308 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, 12309 insn->dst_reg, 12310 shift); 12311 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg, ^^^^^^^^^^^^^^^ Same. 12312 (1ULL << size * 8) - 1); 12313 } 12314 } 12315 12316 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); 12317 if (!new_prog) 12318 return -ENOMEM; 12319 12320 delta += cnt - 1; 12321 12322 /* keep walking new program and skip insns we just inserted */ 12323 env->prog = new_prog; 12324 insn = new_prog->insnsi + i + delta; 12325 } 12326 12327 return 0; 12328 } [0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/ v1->v2: - clarify that problem was only seen by static checker but not in prod; Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15PM: cpu: Make notifier chain use a raw_spinlock_tValentin Schneider
[ Upstream commit b2f6662ac08d0e7c25574ce53623c71bdae9dd78 ] Invoking atomic_notifier_chain_notify() requires acquiring a spinlock_t, which can block under CONFIG_PREEMPT_RT. Notifications for members of the cpu_pm notification chain will be issued by the idle task, which can never block. Making *all* atomic_notifiers use a raw_spinlock is too big of a hammer, as only notifications issued by the idle task are problematic. Special-case cpu_pm_notifier_chain by kludging a raw_notifier and raw_spinlock_t together, matching the atomic_notifier behavior with a raw_spinlock_t. Fixes: 70d932985757 ("notifier: Fix broken error handling pattern") Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15cgroup/cpuset: Fix violation of cpuset locking ruleWaiman Long
[ Upstream commit 6ba34d3c73674e46d9e126e4f0cee79e5ef2481c ] The cpuset fields that manage partition root state do not strictly follow the cpuset locking rule that update to cpuset has to be done with both the callback_lock and cpuset_mutex held. This is now fixed by making sure that the locking rule is upheld. Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition") Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15cgroup/cpuset: Miscellaneous code cleanupWaiman Long
[ Upstream commit 0f3adb8a1e5f36e792598c1d77a2cfac9c90a4f9 ] Use more descriptive variable names for update_prstate(), remove unnecessary code and fix some typos. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15PM: EM: Increase energy calculation precisionLukasz Luba
[ Upstream commit 7fcc17d0cb12938d2b3507973a6f93fc9ed2c7a1 ] The Energy Model (EM) provides useful information about device power in each performance state to other subsystems like: Energy Aware Scheduler (EAS). The energy calculation in EAS does arithmetic operation based on the EM em_cpu_energy(). Current implementation of that function uses em_perf_state::cost as a pre-computed cost coefficient equal to: cost = power * max_frequency / frequency. The 'power' is expressed in milli-Watts (or in abstract scale). There are corner cases when the EAS energy calculation for two Performance Domains (PDs) return the same value. The EAS compares these values to choose smaller one. It might happen that this values are equal due to rounding error. In such scenario, we need better resolution, e.g. 1000 times better. To provide this possibility increase the resolution in the em_perf_state::cost for 64-bit architectures. The cost of increasing resolution on 32-bit is pretty high (64-bit division) and is not justified since there are no new 32bit big.LITTLE EAS systems expected which would benefit from this higher resolution. This patch allows to avoid the rounding to milli-Watt errors, which might occur in EAS energy estimation for each PD. The rounding error is common for small tasks which have small utilization value. There are two places in the code where it makes a difference: 1. In the find_energy_efficient_cpu() where we are searching for best_delta. We might suffer there when two PDs return the same result, like in the example below. Scenario: Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There are quite a few small tasks ~10-15 util. These tasks would suffer for the rounding error. These utilization values are typical when running games on Android. One of our partners has reported 5..10mA less battery drain when running with increased resolution. Some details: We have two PDs: PD0 (big) and PD1 (little) Let's compare w/o patch set ('old') and w/ patch set ('new') We are comparing energy w/ task and w/o task placed in the PDs a) 'old' w/o patch set, PD0 task_util = 13 cost = 480 sum_util_w/o_task = 215 sum_util_w_task = 228 scale_cpu = 1024 energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100 energy_w_task = 480 * 228 / 1024 = 106.87 => 106 energy_diff = 106 - 100 = 6 (this is equal to 'old' PD1's energy_diff in 'c)') b) 'new' w/ patch set, PD0 task_util = 13 cost = 480 * 1000 = 480000 sum_util_w/o_task = 215 sum_util_w_task = 228 energy_w/o_task = 480000 * 215 / 1024 = 100781 energy_w_task = 480000 * 228 / 1024 = 106875 energy_diff = 106875 - 100781 = 6094 (this is not equal to 'new' PD1's energy_diff in 'd)') c) 'old' w/o patch set, PD1 task_util = 13 cost = 160 sum_util_w/o_task = 283 sum_util_w_task = 293 scale_cpu = 355 energy_w/o_task = 160 * 283 / 355 = 127.55 => 127 energy_w_task = 160 * 296 / 355 = 133.41 => 133 energy_diff = 133 - 127 = 6 (this is equal to 'old' PD0's energy_diff in 'a)') d) 'new' w/ patch set, PD1 task_util = 13 cost = 160 * 1000 = 160000 sum_util_w/o_task = 283 sum_util_w_task = 293 scale_cpu = 355 energy_w/o_task = 160000 * 283 / 355 = 127549 energy_w_task = 160000 * 296 / 355 = 133408 energy_diff = 133408 - 127549 = 5859 (this is not equal to 'new' PD0's energy_diff in 'b)') 2. Difference in the 6% energy margin filter at the end of find_energy_efficient_cpu(). With this patch the margin comparison also has better resolution, so it's possible to have better task placement thanks to that. Fixes: 27871f7a8a341ef ("PM: Introduce an Energy Model management framework") Reported-by: CCJ Yeh <CCj.Yeh@mediatek.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15cgroup/cpuset: Fix a partition bug with hotplugWaiman Long
[ Upstream commit 15d428e6fe77fffc3f4fff923336036f5496ef17 ] In cpuset_hotplug_workfn(), the detection of whether the cpu list has been changed is done by comparing the effective cpus of the top cpuset with the cpu_active_mask. However, in the rare case that just all the CPUs in the subparts_cpus are offlined, the detection fails and the partition states are not updated correctly. Fix it by forcing the cpus_updated flag to true in this particular case. Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15bpf: Fix potential memleak and UAF in the verifier.He Fengqing
[ Upstream commit 75f0fc7b48ad45a2e5736bcf8de26c8872fe8695 ] In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to insert new instructions, then use adjust_insn_aux_data() to adjust insn_aux_data. If the old env->prog have no enough room for new inserted instructions, we use bpf_prog_realloc to construct new_prog and free the old env->prog. There have two errors here. First, if adjust_insn_aux_data() return ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data() return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has been freed in bpf_prog_realloc, but we will use it in bpf_check(). So in this patch, we make the adjust_insn_aux_data() never fails. In bpf_patch_insn_data(), we first pre-malloc memory for the new insn_aux_data, then call bpf_patch_insn_single() to insert new instructions, at last call adjust_insn_aux_data() to adjust insn_aux_data. Fixes: 8041902dae52 ("bpf: adjust insn_aux_data when patching insns") Signed-off-by: He Fengqing <hefengqing@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15genirq/timings: Fix error return code in irq_timings_test_irqs()Zhen Lei
[ Upstream commit 290fdc4b7ef14e33d0e30058042b0e9bfd02b89b ] Return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: f52da98d900e ("genirq/timings: Add selftest for irqs circular buffer") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lockYanfei Xu
[ Upstream commit dc87740c8a6806bd2162bfb441770e4e53be5601 ] If rcu_print_task_stall() is invoked on an rcu_node structure that does not contain any tasks blocking the current grace period, it takes an early exit that fails to release that rcu_node structure's lock. This results in a self-deadlock, which is detected by lockdep. To reproduce this bug: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0" This will also result in other complaints, including RCU's scheduler hook complaining about blocking rather than preemption and an rcutorture writer stall. Only a partial RCU CPU stall warning message will be printed because of the self-deadlock. This commit therefore releases the lock on the rcu_print_task_stall() function's early exit path. Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Tested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15rcu: Fix to include first blocked task in stall warningYanfei Xu
[ Upstream commit e6a901a44f76878ed1653626c9ff4cfc5a3f58f8 ] The for loop in rcu_print_task_stall() always omits ts[0], which points to the first task blocking the stalled grace period. This in turn fails to count this first task, which means that ndetected will be equal to zero when all CPUs have passed through their quiescent states and only one task is blocking the stalled grace period. This zero value for ndetected will in turn result in an incorrect "All QSes seen" message: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23): (detected by 15, t=6504 jiffies, g=164777, q=9011209) rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2 BUG: sleeping function called from invalid context at include/linux/uaccess.h:156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04 INFO: lockdep is turned off. Preemption disabled at: [<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0 CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted 5.12.2-yoctodev-standard #1 Hardware name: Marvell OcteonTX CN96XX board (DT) Call trace: dump_backtrace+0x0/0x2cc show_stack+0x24/0x30 dump_stack+0x110/0x188 ___might_sleep+0x214/0x2d0 __might_sleep+0x7c/0xe0 This commit therefore fixes the loop to include ts[0]. Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Tested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched: Fix UCLAMP_FLAG_IDLE settingQuentin Perret
[ Upstream commit ca4984a7dd863f3e1c0df775ae3e744bff24c303 ] The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last uclamp active task (that is, when buckets.tasks reaches 0 for all buckets) to maintain the last uclamp.max and prevent blocked util from suddenly becoming visible. However, there is an asymmetry in how the flag is set and cleared which can lead to having the flag set whilst there are active tasks on the rq. Specifically, the flag is cleared in the uclamp_rq_inc() path, which is called at enqueue time, but set in uclamp_rq_dec_id() which is called both when dequeueing a task _and_ in the update_uclamp_active() path. As a result, when both uclamp_rq_{dec,ind}_id() are called from update_uclamp_active(), the flag ends up being set but not cleared, hence leaving the runqueue in a broken state. Fix this by clearing the flag in update_uclamp_active() as well. Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched/numa: Fix is_core_idle()Mika Penttilä
[ Upstream commit 1c6829cfd3d5124b125e6df41158665aea413b35 ] Use the loop variable instead of the function argument to test the other SMT siblings for idle. Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Signed-off-by: Mika Penttilä <mika.penttila@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com> Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched/debug: Don't update sched_domain debug directories before ↵Valentin Schneider
sched_debug_init() [ Upstream commit 459b09b5a3254008b63382bf41a9b36d0b590f57 ] Since CPU capacity asymmetry can stem purely from maximum frequency differences (e.g. Pixel 1), a rebuild of the scheduler topology can be issued upon loading cpufreq, see: arch_topology.c::init_cpu_capacity_callback() Turns out that if this rebuild happens *before* sched_debug_init() is run (which is a late initcall), we end up messing up the sched_domain debug directory: passing a NULL parent to debugfs_create_dir() ends up creating the directory at the debugfs root, which in this case creates /sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains). This currently doesn't happen on asymmetric systems which use cpufreq-scpi or cpufreq-dt drivers, as those are loaded via deferred_probe_initcall() (it is also a late initcall, but appears to be ordered *after* sched_debug_init()). Ionela has been working on detecting maximum frequency asymmetry via ACPI, and that actually happens via a *device* initcall, thus before sched_debug_init(), and causes the aforementionned debugfs mayhem. One option would be to punt sched_debug_init() down to fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running before sched_debug_init() appears to be the safer option. Fixes: 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs") Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched/topology: Skip updating masks for non-online nodesValentin Schneider
[ Upstream commit 0083242c93759dde353a963a90cb351c5c283379 ] The scheduler currently expects NUMA node distances to be stable from init onwards, and as a consequence builds the related data structures once-and-for-all at init (see sched_init_numa()). Unfortunately, on some architectures node distance is unreliable for offline nodes and may very well change upon onlining. Skip over offline nodes during sched_init_numa(). Track nodes that have been onlined at least once, and trigger a build of a node's NUMA masks when it is first onlined post-init. Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15hrtimer: Ensure timerfd notification for HIGHRES=nThomas Gleixner
[ Upstream commit 8c3b5e6ec0fee18bc2ce38d1dfe913413205f908 ] If high resolution timers are disabled the timerfd notification about a clock was set event is not happening for all cases which use clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong. Make clock_was_set_delayed() unconditially available to fix that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()Thomas Gleixner
[ Upstream commit 627ef5ae2df8eeccb20d5af0e4cfa4df9e61ed28 ] If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then the timer has to be canceled first and then added back. If the timer is the first expiring timer then on removal the clockevent device is reprogrammed to the next expiring timer to avoid that the pending expiry fires needlessly. If the new expiry time ends up to be the first expiry again then the clock event device has to reprogrammed again. Avoid this by checking whether the timer is the first to expire and in that case, keep the timer on the current CPU and delay the reprogramming up to the point where the timer has been enqueued again. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15posix-cpu-timers: Force next expiration recalc after itimer resetFrederic Weisbecker
[ Upstream commit 406dd42bd1ba0c01babf9cde169bb319e52f6147 ] When an itimer deactivates a previously armed expiration, it simply doesn't do anything. As a result the process wide cputime counter keeps running and the tick dependency stays set until it reaches the old ghost expiration value. This can be reproduced with the following snippet: void trigger_process_counter(void) { struct itimerval n = {}; n.it_value.tv_sec = 100; setitimer(ITIMER_VIRTUAL, &n, NULL); n.it_value.tv_sec = 0; setitimer(ITIMER_VIRTUAL, &n, NULL); } Fix this with resetting the relevant base expiration. This is similar to disarming a timer. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15rcu/tree: Handle VM stoppage in stall detectionSergey Senozhatsky
[ Upstream commit ccfc9dd6914feaa9a81f10f9cce56eb0f7712264 ] The soft watchdog timer function checks if a virtual machine was suspended and hence what looks like a lockup in fact is a false positive. This is what kvm_check_and_clear_guest_paused() does: it tests guest PVCLOCK_GUEST_STOPPED (which is set by the host) and if it's set then we need to touch all watchdogs and bail out. Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED check works fine. There is, however, one more watchdog that runs from IRQ, so watchdog timer fn races with it, and that watchdog is not aware of PVCLOCK_GUEST_STOPPED - RCU stall detector. apic_timer_interrupt() smp_apic_timer_interrupt() hrtimer_interrupt() __hrtimer_run_queues() tick_sched_timer() tick_sched_handle() update_process_times() rcu_sched_clock_irq() This triggers RCU stalls on our devices during VM resume. If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU before watchdog_timer_fn()->kvm_check_and_clear_guest_paused() then there is nothing on this VCPU that touches watchdogs and RCU reads stale gp stall timestamp and new jiffies value, which makes it think that RCU has stalled. Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and don't report RCU stalls when we resume the VM. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched/deadline: Fix missing clock update in migrate_task_rq_dl()Dietmar Eggemann
[ Upstream commit b4da13aa28d4fd0071247b7b41c579ee8a86c81a ] A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453 sub_running_bw.isra.0+0x190/0x1a0 ... CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1 Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP: 1.06.20210526) 2021/05/26 ... Call trace: sub_running_bw.isra.0+0x190/0x1a0 migrate_task_rq_dl+0xf8/0x1e0 set_task_cpu+0xa8/0x1f0 try_to_wake_up+0x150/0x3d4 wake_up_q+0x64/0xc0 __up_write+0xd0/0x1c0 up_write+0x4c/0x2b0 cppc_set_perf+0x120/0x2d0 cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq] __cpufreq_driver_target+0x74/0x140 sugov_work+0x64/0x80 kthread_worker_fn+0xe0/0x230 kthread+0x138/0x140 ret_from_fork+0x10/0x18 The task causing this is the `cppc_fie` DL task introduced by commit 1eb5dde674f5 ("cpufreq: CPPC: Add support for frequency invariance"). With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm Server): DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter is in `non_contending` state, migrate_task_rq_dl() calls sub_running_bw()->__sub_running_bw()->cpufreq_update_util()-> rq_clock()->assert_clock_updated() on p. Fix this by updating the clock for a non_contending task in migrate_task_rq_dl() before calling sub_running_bw(). Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15sched/deadline: Fix reset_on_fork reporting of DL tasksQuentin Perret
[ Upstream commit f95091536f78971b269ec321b057b8d630b0ad8a ] It is possible for sched_getattr() to incorrectly report the state of the reset_on_fork flag when called on a deadline task. Indeed, if the flag was set on a deadline task using sched_setattr() with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then p->sched_reset_on_fork will be set, but __setscheduler() will bail out early, which means that the dl_se->flags will not get updated by __setscheduler_params()->__setparam_dl(). Consequently, if sched_getattr() is then called on the task, __getparam_dl() will override kattr.sched_flags with the now out-of-date copy in dl_se->flags and report the stale value to userspace. To fix this, make sure to only copy the flags that are relevant to sched_deadline to and from the dl_se->flags field. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15locking/mutex: Fix HANDOFF conditionPeter Zijlstra
[ Upstream commit 048661a1f963e9517630f080687d48af79ed784c ] Yanfei reported that setting HANDOFF should not depend on recomputing @first, only on @first state. Which would then give: if (ww_ctx || !first) first = __mutex_waiter_is_first(lock, &waiter); if (first) __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF); But because 'ww_ctx || !first' is basically 'always' and the test for first is relatively cheap, omit that first branch entirely. Reported-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Yanfei Xu <yanfei.xu@windriver.com> Link: https://lore.kernel.org/r/20210630154114.896786297@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-03audit: move put_tree() to avoid trim_trees refcount underflow and UAFRichard Guy Briggs
commit 67d69e9d1a6c889d98951c1d74b19332ce0565af upstream. AUDIT_TRIM is expected to be idempotent, but multiple executions resulted in a refcount underflow and use-after-free. git bisect fingered commit fb041bb7c0a9 ("locking/refcount: Consolidate implementations of refcount_t") but this patch with its more thorough checking that wasn't in the x86 assembly code merely exposed a previously existing tree refcount imbalance in the case of tree trimming code that was refactored with prune_one() to remove a tree introduced in commit 8432c7006297 ("audit: Simplify locking around untag_chunk()") Move the put_tree() to cover only the prune_one() case. Passes audit-testsuite and 3 passes of "auditctl -t" with at least one directory watch. Cc: Jan Kara <jack@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Seiji Nishikawa <snishika@redhat.com> Cc: stable@vger.kernel.org Fixes: 8432c7006297 ("audit: Simplify locking around untag_chunk()") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> [PM: reformatted/cleaned-up the commit description] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-29Merge tag 'sched_urgent_for_v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Have get_push_task() check whether current has migration disabled and thus avoid useless invocations of the migration thread - Rework initialization flow so that all rq->core's are initialized, even of CPUs which have not been onlined yet, so that iterating over them all works as expected * tag 'sched_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix get_push_task() vs migrate_disable() sched: Fix Core-wide rq->lock for uninitialized CPUs
2021-08-26Merge tag 'net-5.14-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from can and bpf. Closing three hw-dependent regressions. Any fixes of note are in the 'old code' category. Nothing blocking release from our perspective. Current release - regressions: - stmmac: revert "stmmac: align RX buffers" - usb: asix: ax88772: move embedded PHY detection as early as possible - usb: asix: do not call phy_disconnect() for ax88178 - Revert "net: really fix the build...", from Kalle to fix QCA6390 Current release - new code bugs: - phy: mediatek: add the missing suspend/resume callbacks Previous releases - regressions: - qrtr: fix another OOB Read in qrtr_endpoint_post - stmmac: dwmac-rk: fix unbalanced pm_runtime_enable warnings Previous releases - always broken: - inet: use siphash in exception handling - ip_gre: add validation for csum_start - bpf: fix ringbuf helper function compatibility - rtnetlink: return correct error on changing device netns - e1000e: do not try to recover the NVM checksum on Tiger Lake" * tag 'net-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) Revert "net: really fix the build..." net: hns3: fix get wrong pfc_en when query PFC configuration net: hns3: fix GRO configuration error after reset net: hns3: change the method of getting cmd index in debugfs net: hns3: fix duplicate node in VLAN list net: hns3: fix speed unknown issue in bond 4 net: hns3: add waiting time before cmdq memory is released net: hns3: clear hardware resource when loading driver net: fix NULL pointer reference in cipso_v4_doi_free rtnetlink: Return correct error on changing device netns net: dsa: hellcreek: Adjust schedule look ahead window net: dsa: hellcreek: Fix incorrect setting of GCL cxgb4: dont touch blocked freelist bitmap after free ipv4: use siphash instead of Jenkins in fnhe_hashfun() ipv6: use siphash in rt6_exception_hash() can: usb: esd_usb2: esd_usb2_rx_event(): fix the interchange of the CAN RX and TX error counters net: usb: asix: ax88772: fix boolconv.cocci warnings net/sched: ets: fix crash when flipping from 'strict' to 'quantum' qede: Fix memset corruption net: stmmac: fix kernel panic due to NULL pointer dereference of buf->xdp ...
2021-08-26sched: Fix get_push_task() vs migrate_disable()Sebastian Andrzej Siewior
push_rt_task() attempts to move the currently running task away if the next runnable task has migration disabled and therefore is pinned on the current CPU. The current task is retrieved via get_push_task() which only checks for nr_cpus_allowed == 1, but does not check whether the task has migration disabled and therefore cannot be moved either. The consequence is a pointless invocation of the migration thread which correctly observes that the task cannot be moved. Return NULL if the task has migration disabled and cannot be moved to another CPU. Fixes: a7c81556ec4d3 ("sched: Fix migrate_disable() vs rt/dl balancing") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210826133738.yiotqbtdaxzjsnfj@linutronix.de
2021-08-25Merge branch 'for-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucount fixes from Eric Biederman: "This branch fixes a regression that made it impossible to increase rlimits that had been converted to the ucount infrastructure, and also fixes a reference counting bug where the reference was not incremented soon enough. The fixes are trivial and the bugs have been encountered in the wild, and the fixes have been tested" * 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Increase ucounts reference counter before the security hook ucounts: Fix regression preventing increasing of rlimits in init_user_ns
2021-08-23ucounts: Increase ucounts reference counter before the security hookAlexey Gladkov
We need to increment the ucounts reference counter befor security_prepare_creds() because this function may fail and abort_creds() will try to decrement this reference. [ 96.465056][ T8641] FAULT_INJECTION: forcing a failure. [ 96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0 [ 96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0 [ 96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 96.497254][ T8641] Call Trace: [ 96.500517][ T8641] dump_stack_lvl+0x1d3/0x29f [ 96.505758][ T8641] ? show_regs_print_info+0x12/0x12 [ 96.510944][ T8641] ? log_buf_vmcoreinfo_setup+0x498/0x498 [ 96.516652][ T8641] should_fail+0x384/0x4b0 [ 96.521141][ T8641] prepare_alloc_pages+0x1d1/0x5a0 [ 96.526236][ T8641] __alloc_pages+0x14d/0x5f0 [ 96.530808][ T8641] ? __rmqueue_pcplist+0x2030/0x2030 [ 96.536073][ T8641] ? lockdep_hardirqs_on_prepare+0x3e2/0x750 [ 96.542056][ T8641] ? alloc_pages+0x3f3/0x500 [ 96.546635][ T8641] allocate_slab+0xf1/0x540 [ 96.551120][ T8641] ___slab_alloc+0x1cf/0x350 [ 96.555689][ T8641] ? kzalloc+0x1d/0x30 [ 96.559740][ T8641] __kmalloc+0x2e7/0x390 [ 96.563980][ T8641] ? kzalloc+0x1d/0x30 [ 96.568029][ T8641] kzalloc+0x1d/0x30 [ 96.571903][ T8641] security_prepare_creds+0x46/0x220 [ 96.577174][ T8641] prepare_creds+0x411/0x640 [ 96.581747][ T8641] __sys_setfsuid+0xe2/0x3a0 [ 96.586333][ T8641] do_syscall_64+0x3d/0xb0 [ 96.590739][ T8641] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 96.596611][ T8641] RIP: 0033:0x445a69 [ 96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a [ 96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69 [ 96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000 [ 96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000 [ 96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc [ 96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000 Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-08-23ucounts: Fix regression preventing increasing of rlimits in init_user_nsEric W. Biederman
"Ma, XinjianX" <xinjianx.ma@intel.com> reported: > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests > in kselftest failed with following message. > > # selftests: mqueue: mq_perf_tests > # > # Initial system state: > # Using queue path: /mq_perf_tests > # RLIMIT_MSGQUEUE(soft): 819200 > # RLIMIT_MSGQUEUE(hard): 819200 > # Maximum Message Size: 8192 > # Maximum Queue Size: 10 > # Nice value: 0 > # > # Adjusted system state for testing: > # RLIMIT_MSGQUEUE(soft): (unlimited) > # RLIMIT_MSGQUEUE(hard): (unlimited) > # Maximum Message Size: 16777216 > # Maximum Queue Size: 65530 > # Nice value: -20 > # Continuous mode: (disabled) > # CPUs to pin: 3 > # ./mq_perf_tests: mq_open() at 296: Too many open files > not ok 2 selftests: mqueue: mq_perf_tests # exit=1 > ``` > > Test env: > rootfs: debian-10 > gcc version: 9 After investigation the problem turned out to be that ucount_max for the rlimits in init_user_ns was being set to the initial rlimit value. The practical problem is that ucount_max provides a limit that applications inside the user namespace can not exceed. Which means in practice that rlimits that have been converted to use the ucount infrastructure were not able to exceend their initial rlimits. Solve this by setting the relevant values of ucount_max to RLIM_INIFINITY. A limit in init_user_ns is pointless so the code should allow the values to grow as large as possible without riscking an underflow or an overflow. As the ltp test case was a bit of a pain I have reproduced the rlimit failure and tested the fix with the following little C program: > #include <stdio.h> > #include <fcntl.h> > #include <sys/stat.h> > #include <mqueue.h> > #include <sys/time.h> > #include <sys/resource.h> > #include <errno.h> > #include <string.h> > #include <stdlib.h> > #include <limits.h> > #include <unistd.h> > > int main(int argc, char **argv) > { > struct mq_attr mq_attr; > struct rlimit rlim; > mqd_t mqd; > int ret; > > ret = getrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > printf("RLIMIT_MSGQUEUE %lu %lu\n", > rlim.rlim_cur, rlim.rlim_max); > rlim.rlim_cur = RLIM_INFINITY; > rlim.rlim_max = RLIM_INFINITY; > ret = setrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > memset(&mq_attr, 0, sizeof(struct mq_attr)); > mq_attr.mq_maxmsg = 65536 - 1; > mq_attr.mq_msgsize = 16*1024*1024 - 1; > > mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr); > if (mqd == (mqd_t)-1) { > fprintf(stderr, "mq_open failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > ret = mq_close(mqd); > if (ret) { > fprintf(stderr, "mq_close failed; %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > return EXIT_SUCCESS; > } Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts") Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Reported-by: kernel test robot lkp@intel.com Acked-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-23bpf: Fix ringbuf helper function compatibilityDaniel Borkmann
Commit 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") extended check_map_func_compatibility() by enforcing map -> helper function match, but not helper -> map type match. Due to this all of the bpf_ringbuf_*() helper functions could be used with a wrong map type such as array or hash map, leading to invalid access due to type confusion. Also, both BPF_FUNC_ringbuf_{submit,discard} have ARG_PTR_TO_ALLOC_MEM as argument and not a BPF map. Therefore, their check_map_func_compatibility() presence is incorrect since it's only for map type checking. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Ryota Shiga (Flatt Security) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-08-20sched: Fix Core-wide rq->lock for uninitialized CPUsPeter Zijlstra
Eugene tripped over the case where rq_lock(), as called in a for_each_possible_cpu() loop came apart because rq->core hadn't been setup yet. This is a somewhat unusual, but valid case. Rework things such that rq->core is initialized to point at itself. IOW initialize each CPU as a single threaded Core. CPU online will then join the new CPU (thread) to an existing Core where needed. For completeness sake, have CPU offline fully undo the state so as to not presume the topology will match the next time it comes online. Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock") Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
2021-08-19Merge tag 'net-5.14-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from bpf, wireless and mac80211 trees. Current release - regressions: - tipc: call tipc_wait_for_connect only when dlen is not 0 - mac80211: fix locking in ieee80211_restart_work() Current release - new code bugs: - bpf: add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() - ethernet: ice: fix perout start time rounding - wwan: iosm: prevent underflow in ipc_chnl_cfg_get() Previous releases - regressions: - bpf: clear zext_dst of dead insns - sch_cake: fix srchost/dsthost hashing mode - vrf: reset skb conntrack connection on VRF rcv - net/rds: dma_map_sg is entitled to merge entries Previous releases - always broken: - ethernet: bnxt: fix Tx path locking and races, add Rx path barriers" * tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits) net: dpaa2-switch: disable the control interface on error path Revert "flow_offload: action should not be NULL when it is referenced" iavf: Fix ping is lost after untrusted VF had tried to change MAC i40e: Fix ATR queue selection r8152: fix the maximum number of PLA bp for RTL8153C r8152: fix writing USB_BP2_EN mptcp: full fully established support after ADD_ADDR mptcp: fix memory leak on address flush net/rds: dma_map_sg is entitled to merge entries net: mscc: ocelot: allow forwarding from bridge ports to the tag_8021q CPU port net: asix: fix uninit value bugs ovs: clear skb->tstamp in forwarding path net: mdio-mux: Handle -EPROBE_DEFER correctly net: mdio-mux: Don't ignore memory allocation errors net: mdio-mux: Delete unnecessary devm_kfree net: dsa: sja1105: fix use-after-free after calling of_find_compatible_node, or worse sch_cake: fix srchost/dsthost hashing mode ixgbe, xsk: clean up the resources in ixgbe_xsk_pool_enable error path net: qlcnic: add missed unlock in qlcnic_83xx_flash_read32 mac80211: fix locking in ieee80211_restart_work() ...
2021-08-19Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-08-19 We've added 3 non-merge commits during the last 3 day(s) which contain a total of 3 files changed, 29 insertions(+), 6 deletions(-). The main changes are: 1) Fix to clear zext_dst for dead instructions which was causing invalid program rejections on JITs with bpf_jit_needs_zext such as s390x, from Ilya Leoshkevich. 2) Fix RCU splat in bpf_get_current_{ancestor_,}cgroup_id() helpers when they are invoked from sleepable programs, from Yonghong Song. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests, bpf: Test that dead ldx_w insns are accepted bpf: Clear zext_dst of dead insns bpf: Add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() helpers ==================== Link: https://lore.kernel.org/r/20210819144904.20069-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-18Merge tag 'cfi-v5.14-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull clang cfi fix from Kees Cook: - Use rcu_read_{un}lock_sched_notrace to avoid recursion (Elliot Berman) * tag 'cfi-v5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: cfi: Use rcu_read_{un}lock_sched_notrace
2021-08-17Merge tag 'trace-v5.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Limit the shooting in the foot of tp_printk The "tp_printk" option redirects the trace event output to printk at boot up. This is useful when a machine crashes before boot where the trace events can not be retrieved by the in kernel ring buffer. But it can be "dangerous" because trace events can be located in high frequency locations such as interrupts and the scheduler, where a printk can slow it down that it live locks the machine (because by the time the printk finishes, the next event is triggered). Thus tp_printk must be used with care. It was discovered that the filter logic to trace events does not apply to the tp_printk events. This can cause a surprise and live lock when the user expects it to be filtered to limit the amount of events printed to the console when in fact it still prints everything" * tag 'trace-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Apply trace filters on all output channels
2021-08-16Merge tag 'trace-v5.14-rc5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Fixes and clean ups to tracing: - Fix header alignment when PREEMPT_RT is enabled for osnoise tracer - Inject "stop" event to see where osnoise stopped the trace - Define DYNAMIC_FTRACE_WITH_ARGS as some code had an #ifdef for it - Fix erroneous message for bootconfig cmdline parameter - Fix crash caused by not found variable in histograms" * tag 'trace-v5.14-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name init: Suppress wrong warning for bootconfig cmdline parameter tracing: define needed config DYNAMIC_FTRACE_WITH_ARGS trace/osnoise: Print a stop tracing message trace/timerlat: Add a header with PREEMPT_RT additional fields trace/osnoise: Add a header with PREEMPT_RT additional fields
2021-08-16tracing: Apply trace filters on all output channelsPingfan Liu
The event filters are not applied on all of the output, which results in the flood of printk when using tp_printk. Unfolding event_trigger_unlock_commit_regs() into trace_event_buffer_commit(), so the filters can be applied on every output. Link: https://lkml.kernel.org/r/20210814034538.8428-1-kernelfans@gmail.com Cc: stable@vger.kernel.org Fixes: 0daa2302968c1 ("tracing: Add tp_printk cmdline to have tracepoints go to printk()") Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-15Merge tag 'irq-urgent-2021-08-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes for PCI/MSI and x86 interrupt startup: - Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked entries stay around e.g. when a crashkernel is booted. - Enforce masking of a MSI-X table entry when updating it, which mandatory according to speification - Ensure that writes to MSI[-X} tables are flushed. - Prevent invalid bits being set in the MSI mask register - Properly serialize modifications to the mask cache and the mask register for multi-MSI. - Cure the violation of the affinity setting rules on X86 during interrupt startup which can cause lost and stale interrupts. Move the initial affinity setting ahead of actualy enabling the interrupt. - Ensure that MSI interrupts are completely torn down before freeing them in the error handling case. - Prevent an array out of bounds access in the irq timings code" * tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: driver core: Add missing kernel doc for device::msi_lock genirq/msi: Ensure deactivation on teardown genirq/timings: Prevent potential array overflow in __irq_timings_store() x86/msi: Force affinity setup before startup x86/ioapic: Force affinity setup before startup genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP PCI/MSI: Protect msi_desc::masked for multi-MSI PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown() PCI/MSI: Correct misleading comments PCI/MSI: Do not set invalid bits in MSI mask PCI/MSI: Enforce MSI[X] entry updates to be visible PCI/MSI: Enforce that MSI-X table entry is masked for update PCI/MSI: Mask all unused MSI-X entries PCI/MSI: Enable and mask MSI-X early
2021-08-15Merge tag 'locking_urgent_for_v5.14_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix a CONFIG symbol's spelling * tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use the correct rtmutex debugging config option
2021-08-13bpf: Clear zext_dst of dead insnsIlya Leoshkevich
"access skb fields ok" verifier test fails on s390 with the "verifier bug. zext_dst is set, but no reg is defined" message. The first insns of the test prog are ... 0: 61 01 00 00 00 00 00 00 ldxw %r0,[%r1+0] 8: 35 00 00 01 00 00 00 00 jge %r0,0,1 10: 61 01 00 08 00 00 00 00 ldxw %r0,[%r1+8] ... and the 3rd one is dead (this does not look intentional to me, but this is a separate topic). sanitize_dead_code() converts dead insns into "ja -1", but keeps zext_dst. When opt_subreg_zext_lo32_rnd_hi32() tries to parse such an insn, it sees this discrepancy and bails. This problem can be seen only with JITs whose bpf_jit_needs_zext() returns true. Fix by clearning dead insns' zext_dst. The commits that contributed to this problem are: 1. 5aa5bd14c5f8 ("bpf: add initial suite for selftests"), which introduced the test with the dead code. 2. 5327ed3d44b7 ("bpf: verifier: mark verified-insn with sub-register zext flag"), which introduced the zext_dst flag. 3. 83a2881903f3 ("bpf: Account for BPF_FETCH in insn_has_def32()"), which introduced the sanity check. 4. 9183671af6db ("bpf: Fix leakage under speculation on mispredicted branches"), which bisect points to. It's best to fix this on stable branches that contain the second one, since that's the point where the inconsistency was introduced. Fixes: 5327ed3d44b7 ("bpf: verifier: mark verified-insn with sub-register zext flag") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210812151811.184086-2-iii@linux.ibm.com
2021-08-12Merge tag 'net-5.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from netfilter, bpf, can and ieee802154. The size of this is pretty normal, but we got more fixes for 5.14 changes this week than last week. Nothing major but the trend is the opposite of what we like. We'll see how the next week goes.. Current release - regressions: - r8169: fix ASPM-related link-up regressions - bridge: fix flags interpretation for extern learn fdb entries - phy: micrel: fix link detection on ksz87xx switch - Revert "tipc: Return the correct errno code" - ptp: fix possible memory leak caused by invalid cast Current release - new code bugs: - bpf: add missing bpf_read_[un]lock_trace() for syscall program - bpf: fix potentially incorrect results with bpf_get_local_storage() - page_pool: mask the page->signature before the checking, avoid dma mapping leaks - netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps - bnxt_en: fix firmware interface issues with PTP - mlx5: Bridge, fix ageing time Previous releases - regressions: - linkwatch: fix failure to restore device state across suspend/resume - bareudp: fix invalid read beyond skb's linear data Previous releases - always broken: - bpf: fix integer overflow involving bucket_size - ppp: fix issues when desired interface name is specified via netlink - wwan: mhi_wwan_ctrl: fix possible deadlock - dsa: microchip: ksz8795: fix number of VLAN related bugs - dsa: drivers: fix broken backpressure in .port_fdb_dump - dsa: qca: ar9331: make proper initial port defaults Misc: - bpf: add lockdown check for probe_write_user helper - netfilter: conntrack: remove offload_pickup sysctl before 5.14 is out - netfilter: conntrack: collect all entries in one cycle, heuristically slow down garbage collection scans on idle systems to prevent frequent wake ups" * tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits) vsock/virtio: avoid potential deadlock when vsock device remove wwan: core: Avoid returning NULL from wwan_create_dev() net: dsa: sja1105: unregister the MDIO buses during teardown Revert "tipc: Return the correct errno code" net: mscc: Fix non-GPL export of regmap APIs net: igmp: increase size of mr_ifc_count MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets net: pcs: xpcs: fix error handling on failed to allocate memory net: linkwatch: fix failure to restore device state across suspend/resume net: bridge: fix memleak in br_add_if() net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge net: bridge: fix flags interpretation for extern learn fdb entries net: dsa: sja1105: fix broken backpressure in .port_fdb_dump net: dsa: lantiq: fix broken backpressure in .port_fdb_dump net: dsa: lan9303: fix broken backpressure in .port_fdb_dump net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump bpf, core: Fix kernel-doc notation net: igmp: fix data-race in igmp_ifc_timer_expire() net: Fix memory leak in ieee802154_raw_deliver ...
2021-08-12tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event nameSteven Rostedt (VMware)
The following commands: # echo 'read_max u64 size;' > synthetic_events # echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger Causes: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:strcmp+0xc/0x20 Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0 c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07 3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89 RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000 RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8 R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210 FS: 00007f1fa6785740(0000) GS:ffff9714da400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0 Call Trace: __find_event_file+0x4e/0x80 action_create+0x6b7/0xeb0 ? kstrdup+0x44/0x60 event_hist_trigger_func+0x1a07/0x2130 trigger_process_regex+0xbd/0x110 event_trigger_write+0x71/0xd0 vfs_write+0xe9/0x310 ksys_write+0x68/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f1fa6879e87 The problem was the "trace(read_max,count)" where the "count" should be "$count" as "onmax()" only handles variables (although it really should be able to figure out that "count" is a field of sys_enter_read). But there's a path that does not find the variable and ends up passing a NULL for the event, which ends up getting passed to "strcmp()". Add a check for NULL to return and error on the command with: # cat error_log hist:syscalls:sys_enter_read: error: Couldn't create or find variable Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count) ^ Link: https://lkml.kernel.org/r/20210808003011.4037f8d0@oasis.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 50450603ec9cb tracing: Add 'onmax' hist trigger action support Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12tracing: define needed config DYNAMIC_FTRACE_WITH_ARGSLukas Bulwahn
Commit 2860cd8a2353 ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available") intends to enable config LIVEPATCH when ftrace with ARGS is available. However, the chain of configs to enable LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing in the commit. Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns: DYNAMIC_FTRACE_WITH_ARGS Referencing files: kernel/livepatch/Kconfig So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already existing similar configs, DYNAMIC_FTRACE_WITH_REGS and DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the chain of configs. Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20210806195027.16808-1-lukas.bulwahn@gmail.com Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: stable@vger.kernel.org Fixes: 2860cd8a2353 ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12trace/osnoise: Print a stop tracing messageDaniel Bristot de Oliveira
When using osnoise/timerlat with stop tracing, sometimes it is not clear in which CPU the stop condition was hit, mainly when using some extra events. Print a message informing in which CPU the trace stopped, like in the example below: <idle>-0 [006] d.h. 2932.676616: #1672599 context irq timer_latency 34689 ns <idle>-0 [006] dNh. 2932.676618: irq_noise: local_timer:236 start 2932.676615639 duration 2391 ns <idle>-0 [006] dNh. 2932.676620: irq_noise: virtio0-output.0:47 start 2932.676620180 duration 86 ns <idle>-0 [003] d.h. 2932.676621: #1673374 context irq timer_latency 1200 ns <idle>-0 [006] d... 2932.676623: thread_noise: swapper/6:0 start 2932.676615964 duration 4339 ns <idle>-0 [003] dNh. 2932.676623: irq_noise: local_timer:236 start 2932.676620597 duration 1881 ns <idle>-0 [006] d... 2932.676623: sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/6 next_pid=852 next_prio=4 timerlat/6-852 [006] .... 2932.676623: #1672599 context thread timer_latency 41931 ns <idle>-0 [003] d... 2932.676623: thread_noise: swapper/3:0 start 2932.676620854 duration 880 ns <idle>-0 [003] d... 2932.676624: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/3 next_pid=849 next_prio=4 timerlat/6-852 [006] .... 2932.676624: timerlat_main: stop tracing hit on cpu 6 timerlat/3-849 [003] .... 2932.676624: #1673374 context thread timer_latency 4310 ns Link: https://lkml.kernel.org/r/b30a0d7542adba019185f44ee648e60e14923b11.1626598844.git.bristot@kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12trace/timerlat: Add a header with PREEMPT_RT additional fieldsDaniel Bristot de Oliveira
Some extra flags are printed to the trace header when using the PREEMPT_RT config. The extra flags are: need-resched-lazy, preempt-lazy-depth, and migrate-disable. Without printing these fields, the timerlat specific fields are shifted by three positions, for example: # tracer: timerlat # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # || / # |||| ACTIVATION # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY # | | | |||| | | | | <idle>-0 [000] d..h... 3279.798871: #1 context irq timer_latency 830 ns <...>-807 [000] ....... 3279.798881: #1 context thread timer_latency 11301 ns Add a new header for timerlat with the missing fields, to be used when the PREEMPT_RT is enabled. Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12trace/osnoise: Add a header with PREEMPT_RT additional fieldsDaniel Bristot de Oliveira
Some extra flags are printed to the trace header when using the PREEMPT_RT config. The extra flags are: need-resched-lazy, preempt-lazy-depth, and migrate-disable. Without printing these fields, the osnoise specific fields are shifted by three positions, for example: # tracer: osnoise # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth MAX # || / SINGLE Interference counters: # |||| RUNTIME NOISE %% OF CPU NOISE +-----------------------------+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD # | | | |||| | | | | | | | | | | <...>-741 [000] ....... 1105.690909: 1000000 234 99.97660 36 21 0 1001 22 3 <...>-742 [001] ....... 1105.691923: 1000000 281 99.97190 197 7 0 1012 35 14 <...>-743 [002] ....... 1105.691958: 1000000 1324 99.86760 118 11 0 1016 155 143 <...>-744 [003] ....... 1105.691998: 1000000 109 99.98910 21 4 0 1004 33 7 <...>-745 [004] ....... 1105.692015: 1000000 2023 99.79770 97 37 0 1023 52 18 Add a new header for osnoise with the missing fields, to be used when the PREEMPT_RT is enabled. Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12Merge branch 'for-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fix from Eric Biederman: "This fixes the ucount sysctls on big endian architectures. The counts were expanded to be longs instead of ints, and the sysctl code was overlooked, so only the low 32bit were being processed. On litte endian just processing the low 32bits is fine, but on 64bit big endian processing just the low 32bits results in the high order bits instead of the low order bits being processed and nothing works proper. This change took a little bit to mature as we have the SYSCTL_ZERO, and SYSCTL_INT_MAX macros that are only usable for sysctls operating on ints, but unfortunately are not obviously broken. Which resulted in the versions of this change working on big endian and not on little endian, because the int SYSCTL_ZERO when extended 64bit wound up being 0x100000000. So we only allowed values greater than 0x100000000 and less than 0faff. Which unfortunately broken everything that tried to set the sysctls. (First reported with the windows subsystem for linux). I have tested this on x86_64 64bit after first reproducing the problems with the earlier version of this change, and then verifying the problems do not exist when we use appropriate long min and max values for extra1 and extra2" * 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: add missing data type changes
2021-08-11Merge tag 'seccomp-v5.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp fixes from Kees Cook: - Fix typo in user notification documentation (Rodrigo Campos) - Fix userspace counter report when using TSYNC (Hsuan-Chi Kuo, Wiktor Garbacz) * tag 'seccomp-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Fix setting loaded filter count during TSYNC Documentation: seccomp: Fix typo in user notification
2021-08-11cfi: Use rcu_read_{un}lock_sched_notraceElliot Berman
If rcu_read_lock_sched tracing is enabled, the tracing subsystem can perform a jump which needs to be checked by CFI. For example, stm_ftrace source is enabled as a module and hooks into enabled ftrace events. This can cause an recursive loop where find_shadow_check_fn -> rcu_read_lock_sched -> (call to stm_ftrace generates cfi slowpath) -> find_shadow_check_fn -> rcu_read_lock_sched -> ... To avoid the recursion, either the ftrace codes needs to be marked with __no_cfi or CFI should not trace. Use the "_notrace" in CFI to avoid tracing so that CFI can guard ftrace. Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: stable@vger.kernel.org Fixes: cf68fffb66d6 ("add support for Clang CFI") Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210811155914.19550-1-quic_eberman@quicinc.com
2021-08-11seccomp: Fix setting loaded filter count during TSYNCHsuan-Chi Kuo
The desired behavior is to set the caller's filter count to thread's. This value is reported via /proc, so this fixes the inaccurate count exposed to userspace; it is not used for reference counting, etc. Signed-off-by: Hsuan-Chi Kuo <hsuanchikuo@gmail.com> Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com Co-developed-by: Wiktor Garbacz <wiktorg@google.com> Signed-off-by: Wiktor Garbacz <wiktorg@google.com> Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Fixes: c818c03b661c ("seccomp: Report number of loaded filters in /proc/$pid/status")