aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/cgroup
AgeCommit message (Collapse)Author
2017-08-11cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()Dima Zavin
commit 89affbf5d9ebb15c6460596822e8857ea2f9e735 upstream. In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a deadlock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to use two different static keys: one for begin (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() always return a valid seqcount if are enabling cpusets. Similarly, when disabling cpusets via cpuset_dec(), we first ensure that callers of cpuset_mems_allowed_retry() will start ignoring the seqcount value before we let cpuset_mems_allowed_begin() return 0. The relevant stack traces of the two stuck threads: CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: smp_call_function+0x3b/0x70 on_each_cpu+0x2f/0x90 text_poke_bp+0x87/0xd0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 __slab_alloc.isra.80+0x54/0x90 copy_process.part.40+0xf7/0x1de0 copy_process.part.40+0xf7/0x1de0 kmem_cache_alloc_node+0x8a/0x280 copy_process.part.40+0xf7/0x1de0 _do_fork+0xe7/0x6c0 _raw_spin_unlock_irq+0x2d/0x60 trace_hardirqs_on_caller+0x136/0x1d0 entry_SYSCALL_64_fastpath+0x5/0xad do_syscall_64+0x27/0x350 SyS_clone+0x19/0x20 do_syscall_64+0x60/0x350 entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com Fixes: 46e700abc44c ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled") Signed-off-by: Dima Zavin <dmitriyz@waymo.com> Reported-by: Cliff Spradlin <cspradlin@waymo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11cgroup: fix error return value from cgroup_subtree_control()Tejun Heo
commit 3c74541777302eec43a0d1327c4d58b8659a776b upstream. While refactoring, f7b2814bb9b6 ("cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()") broke error return value from the function. The return value from the last operation is always overridden to zero. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11cgroup: create dfl_root files on subsys registrationTejun Heo
commit 7af608e4f9530372aec6e940552bf76595f2e265 upstream. On subsystem registration, css_populate_dir() is not called on the new root css, so the interface files for the subsystem on cgrp_dfl_root aren't created on registration. This is a residue from the days when cgrp_dfl_root was used only as the parking spot for unused subsystems, which no longer is true as it's used as the root for cgroup2. This is often fine as later operations tend to create them as a part of mount (cgroup1) or subtree_control operations (cgroup2); however, it's not difficult to mount cgroup2 with the controller interface files missing as Waiman found out. Fix it by invoking css_populate_dir() on the root css on subsys registration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11cgroup: don't call migration methods if there are no tasks to migrateTejun Heo
commit 610467270fb368584b74567edd21c8cc5104490f upstream. Subsystem migration methods shouldn't be called for empty migrations. cgroup_migrate_execute() implements this guarantee by bailing early if there are no source css_sets. This used to be correct before a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer since the commit because css_sets can stay pinned without tasks in them. This caused cgroup_migrate_execute() call into cpuset migration methods with an empty cgroup_taskset. cpuset migration methods correctly assume that cgroup_taskset_first() never returns NULL; however, due to the bug, it can, leading to the following oops. Unable to handle kernel paging request for data at address 0x00000960 Faulting instruction address: 0xc0000000001d6868 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W 4.12.0-rc4-next-20170609 #2 Workqueue: events cpuset_hotplug_workfn task: c00000000ca60580 task.stack: c00000000c728000 NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810 REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000 CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1 GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000 GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678 GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000 GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180 GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0 GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20 GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20 GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000 NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0 LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 Call Trace: [c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable) [c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450 [c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360 [c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20 [c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580 [c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0 [c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0 [c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74 Instruction dump: f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25 60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200 ---[ end trace dcaaf98fb36d9e64 ]--- This patch fixes the bug by adding an explicit nr_tasks counter to cgroup_taskset and skipping calling the migration methods if the counter is zero. While at it, remove the now spurious check on no source css_sets. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Roman Gushchin <guro@fb.com> Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-24cpuset: consider dying css as offlineTejun Heo
In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-17cgroup: Prevent kill_css() from being called more than onceWaiman Long
The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.5+
2017-05-01Merge branch 'for-4.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing major. Two notable fixes are Li's second stab at fixing the long-standing race condition in the mount path and suppression of spurious warning from cgroup_get(). All other changes are trivial" * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark cgroup_get() with __maybe_unused cgroup: avoid attaching a cgroup root to two different superblocks, take 2 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() cgroup: move cgroup_subsys_state parent field for cache locality cpuset: Remove cpuset_update_active_cpus()'s parameter. cgroup: switch to BUG_ON() cgroup: drop duplicate header nsproxy.h kernel: convert css_set.refcount from atomic_t to refcount_t kernel: convert cgroup_namespace.count from atomic_t to refcount_t
2017-05-01cgroup: mark cgroup_get() with __maybe_unusedTejun Heo
a590b90d472f ("cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()") converted most cgroup_get() usages to cgroup_get_live() leaving cgroup_sk_alloc() the sole user of cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering unused warning for cgroup_get(). Silence the warning by adding __maybe_unused to cgroup_get(). Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28cgroup: avoid attaching a cgroup root to two different superblocks, take 2Zefan Li
Commit bfb0b80db5f9 ("cgroup: avoid attaching a cgroup root to two different superblocks") is broken. Now we try to fix the race by delaying the initialization of cgroup root refcnt until a superblock has been allocated. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrei Vagin <avagin@virtuozzo.com> Tested-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()Tejun Heo
cgroup_get() expected to be called only on live cgroups and triggers warning on a dead cgroup; however, cgroup_sk_alloc() may be called while cloning a socket which is left in an empty and removed cgroup and thus may legitimately duplicate its reference on a dead cgroup. This currently triggers the following warning spuriously. WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60 ... [<ffffffff8107e123>] __warn+0xd3/0xf0 [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20 [<ffffffff810ff465>] cgroup_get+0x55/0x60 [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0 [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390 [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0 [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0 [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670 [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0 [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00 [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0 [<ffffffff81837cb2>] ip6_input+0x32/0xb0 [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0 [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0 [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0 [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70 [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80 [<ffffffff817787d8>] napi_gro_frags+0x208/0x270 [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40 [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90 [<ffffffff81778b30>] net_rx_action+0x210/0x350 [<ffffffff8188c426>] __do_softirq+0x106/0x2c7 [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0 [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI> [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20 [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0 [<ffffffff8103edd1>] start_secondary+0xf1/0x100 This patch renames the existing cgroup_get() with the dead cgroup warning to cgroup_get_live() after cgroup_kn_lock_live() and introduces the new cgroup_get() which doesn't check whether the cgroup is live or dead. All existing cgroup_get() users except for cgroup_sk_alloc() are converted to use cgroup_get_live(). Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Cc: stable@vger.kernel.org # v4.5+ Cc: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-16Merge branch 'for-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "Unfortunately, the commit to fix the cgroup mount race in the previous pull request can lead to hangs. The original bug has been around for a while and isn't too likely to be triggered in usual use cases. Revert the commit for now" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
2017-04-16Revert "cgroup: avoid attaching a cgroup root to two different superblocks"Tejun Heo
This reverts commit bfb0b80db5f9dca5ac0a5fd0edb765ee555e5a8e. Andrei reports CRIU test hangs with the patch applied. The bug fixed by the patch isn't too likely to trigger in actual uses. Revert the patch for now. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrei Vagin <avagin@virtuozzo.com> Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
2017-04-11Merge branch 'for-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This contains fixes for two long standing subtle bugs: - kthread_bind() on a new kthread binds it to specific CPUs and prevents userland from messing with the affinity or cgroup membership. Unfortunately, for cgroup membership, there's a window between kthread creation and kthread_bind*() invocation where the kthread can be moved into a non-root cgroup by userland. Depending on what controllers are in effect, this can assign the kthread unexpected attributes. For example, in the reported case, workqueue workers ended up in a non-root cpuset cgroups and had their CPU affinities overridden. This broke workqueue invariants and led to workqueue stalls. Fixed by closing the window between kthread creation and kthread_bind() as suggested by Oleg. - There was a bug in cgroup mount path which could allow two competing mount attempts to attach the same cgroup_root to two different superblocks. This was caused by mishandling return value from kernfs_pin_sb(). Fixed" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: avoid attaching a cgroup root to two different superblocks cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
2017-04-11cgroup: avoid attaching a cgroup root to two different superblocksZefan Li
Run this: touch file0 for ((; ;)) { mount -t cpuset xxx file0 } And this concurrently: touch file1 for ((; ;)) { mount -t cpuset xxx file1 } We'll trigger a warning like this: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0 percpu_ref_kill_and_confirm called more than once on css_release! CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 Call Trace: dump_stack+0x63/0x84 __warn+0xd1/0xf0 warn_slowpath_fmt+0x5f/0x80 percpu_ref_kill_and_confirm+0x92/0xb0 cgroup_kill_sb+0x95/0xb0 deactivate_locked_super+0x43/0x70 deactivate_super+0x46/0x60 ... ---[ end trace a79f61c2a2633700 ]--- Here's a race: Thread A Thread B cgroup1_mount() # alloc a new cgroup root cgroup_setup_root() cgroup1_mount() # no sb yet, returns NULL kernfs_pin_sb() # but succeeds in getting the refcnt, # so re-use cgroup root percpu_ref_tryget_live() # alloc sb with cgroup root cgroup_do_mount() cgroup_kill_sb() # alloc another sb with same root cgroup_do_mount() cgroup_kill_sb() We end up using the same cgroup root for two different superblocks, so percpu_ref_kill() will be called twice on the same root when the two superblocks are destroyed. We should fix to make sure the superblock pinning is really successful. Cc: stable@vger.kernel.org # 3.16+ Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-11cpuset: Remove cpuset_update_active_cpus()'s parameter.Rakib Mullick
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove it. Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-27cgroup: switch to BUG_ON()Nicholas Mc Guire
Use BUG_ON() rather than an explicit if followed by BUG() for improved readability and also consistency. Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-17cgroup, kthread: close race window where new kthreads can be migrated to ↵Tejun Heo
non-root cgroups Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3) Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-14Merge branch 'for-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three cgroup fixes. Nothing critical: - the pids controller could trigger suspicious RCU warning spuriously. Fixed. - in the debug controller, %p -> %pK to protect kernel pointer from getting exposed. - documentation formatting fix" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroups: censor kernel pointer in debug files cgroup/pids: remove spurious suspicious RCU usage warning cgroup: Fix indenting in PID controller documentation
2017-03-09scripts/spelling.txt: add "disble(d)" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: disble||disable disbled||disabled I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c untouched. The macro is not referenced at all, but this commit is touching only comment blocks just in case. Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08kernel: convert css_set.refcount from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroups: censor kernel pointer in debug filesKees Cook
As found in grsecurity, this avoids exposing a kernel pointer through the cgroup debug entries. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroup/pids: remove spurious suspicious RCU usage warningTejun Heo
pids_can_fork() is special in that the css association is guaranteed to be stable throughout the function and thus doesn't need RCU protection around task_css access. When determining the css to charge the pid, task_css_check() is used to override the RCU sanity check. While adding a warning message on fork rejection from pids limit, 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") incorrectly added a task_css access which is neither RCU protected or explicitly annotated. This triggers the following suspicious RCU usage warning when RCU debugging is enabled. cgroup: fork rejected by pids controller in =============================== [ ERR: suspicious RCU usage. ] 4.10.0-work+ #1 Not tainted ------------------------------- ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by bash/1748: #0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0 stack backtrace: CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 Call Trace: dump_stack+0x68/0x93 lockdep_rcu_suspicious+0xd7/0x110 pids_can_fork+0x1c7/0x1d0 cgroup_can_fork+0x67/0xc0 copy_process.part.58+0x1709/0x1e90 _do_fork+0xe6/0x6e0 SyS_clone+0x19/0x20 do_syscall_64+0x5c/0x140 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f7853fab93a RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700 R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4 R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d /asdf There's no reason to dereference task_css again here when the associated css is already available. Fix it by replacing the task_cgroup() call with css->cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Fixes: 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") Cc: Kenny Yu <kennyyu@fb.com> Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06kernel: convert cgroup_namespace.count from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-03sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>Ingo Molnar
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h> needs it to pick up STACK_END_MAGIC. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h>Ingo Molnar
The task_lock()/task_unlock() APIs are not realated to core scheduling, they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>. Move them. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Move task_struct::signal and task_struct::sighand types and ↵Ingo Molnar
accessors into <linux/sched/signal.h> task_struct::signal and task_struct::sighand are pointers, which would normally make it straightforward to not define those types in sched.h. That is not so, because the types are accompanied by a myriad of APIs (macros and inline functions) that dereference them. Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>. With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore, trying to put accessors into sched.h as a test fails the following way: ./include/linux/sched.h: In function ‘test_signal_types’: ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’ ^ This reduces the size and complexity of sched.h significantly. Update all headers and .c code that relied on getting the signal handling functionality from <linux/sched.h> to include <linux/sched/signal.h>. The list of affected files in the preparatory patch was partly generated by grepping for the APIs, and partly by doing coverage build testing, both all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of cross-architecture builds. Nevertheless some (trivial) build breakage is still expected related to rare Kconfig combinations and in-flight patches to various kernel code, but most of it should be handled by this patch. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move the task_lock()/unlock() APIs to ↵Ingo Molnar
<linux/sched/task.h> But first update the code that uses these facilities with the new header. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/task.h> We are going to split <linux/sched/task.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers, cgroups: Remove the threadgroup_change_*() wrapperyIngo Molnar
threadgroup_change_begin()/end() is a pointless wrapper around cgroup_threadgroup_change_begin()/end(), minus a might_sleep() in the !CONFIG_CGROUPS=y case. Remove the wrappery, move the might_sleep() (the down_read() already has a might_sleep() check). This debloats <linux/sched.h> a bit and simplifies this API. Update all call sites. No change in functionality. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27Merge branch 'for-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several noteworthy changes. - Parav's rdma controller is finally merged. It is very straight forward and can limit the abosolute numbers of common rdma constructs used by different cgroups. - kernel/cgroup.c got too chubby and disorganized. Created kernel/cgroup/ subdirectory and moved all cgroup related files under kernel/ there and reorganized the core code. This hurts for backporting patches but was long overdue. - cgroup v2 process listing reimplemented so that it no longer depends on allocating a buffer large enough to cache the entire result to sort and uniq the output. v2 has always mangled the sort order to ensure that users don't depend on the sorted output, so this shouldn't surprise anybody. This makes the pid listing functions use the same iterators that are used internally, which have to have the same iterating capabilities anyway. - perf cgroup filtering now works automatically on cgroup v2. This patch was posted a long time ago but somehow fell through the cracks. - misc fixes asnd documentation updates" * 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits) kernfs: fix locking around kernfs_ops->release() callback cgroup: drop the matching uid requirement on migration for cgroup v2 cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy cgroup: misc cleanups cgroup: call subsys->*attach() only for subsystems which are actually affected by migration cgroup: track migration context in cgroup_mgctx cgroup: cosmetic update to cgroup_taskset_add() rdmacg: Fixed uninitialized current resource usage cgroup: Add missing cgroup-v2 PID controller documentation. rdmacg: Added documentation for rdmacg IB/core: added support to use rdma cgroup controller rdmacg: Added rdma cgroup controller cgroup: fix a comment typo cgroup: fix RCU related sparse warnings cgroup: move namespace code to kernel/cgroup/namespace.c cgroup: rename functions for consistency cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c cgroup: separate out cgroup1_kf_syscall_ops cgroup: refactor mount path and clearly distinguish v1 and v2 paths cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c ...
2017-02-02Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11Tejun Heo
Merge in to resolve conflicts in Documentation/cgroup-v2.txt. The conflicts are from multiple section additions and trivial to resolve. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-02cgroup: drop the matching uid requirement on migration for cgroup v2Tejun Heo
Along with the write access to the cgroup.procs or tasks file, cgroup has required the writer's euid, unless root, to match [s]uid of the target process or task. On cgroup v1, this is necessary because there's nothing preventing a delegatee from pulling in tasks or processes from all over the system. If a user has a cgroup subdirectory delegated to it, the user would have write access to the cgroup.procs or tasks file. If there are no further checks than file write access check, the user would be able to pull processes from all over the system into its subhierarchy which is clearly not the intended behavior. The matching [s]uid requirement partially prevents this problem by allowing a delegatee to pull in the processes that belongs to it. This isn't a sufficient protection however, because a user would still be able to jump processes across two disjoint sub-hierarchies that has been delegated to them. cgroup v2 resolves the issue by requiring the writer to have access to the common ancestor of the cgroup.procs file of the source and target cgroups. This confines each delegatee to their own sub-hierarchy proper and bases all permission decisions on the cgroup filesystem rather than having to pull in explicit uid matching. cgroup v2 has still been applying the matching [s]uid requirement just for historical reasons. On cgroup2, the requirement doesn't serve any purpose while unnecessarily complicating the permission model. Let's drop it. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-30cgroup: misc cleanupsTejun Heo
* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other ss_masks. Make it a u16. * Move have_canfork_callback together with other callback ss_masks. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-26Merge branch 'for-4.10-fixes' into for-4.11Tejun Heo
2017-01-15cgroup: call subsys->*attach() only for subsystems which are actually ↵Tejun Heo
affected by migration Currently, subsys->*attach() callbacks are called for all subsystems which are attached to the hierarchy on which the migration is taking place. With cgroup_migrate_prepare_dst() filtering out identity migrations, v1 hierarchies can avoid spurious ->*attach() callback invocations where the source and destination csses are identical; however, this isn't enough on v2 as only a subset of the attached controllers can be affected on controller enable/disable. While spurious ->*attach() invocations aren't critically broken, they're unnecessary overhead and can lead to temporary overcharges on certain controllers. Fix it by tracking which subsystems are affected by a migration and invoking ->*attach() callbacks only on those subsystems. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-15cgroup: track migration context in cgroup_mgctxTejun Heo
cgroup migration is performed in four steps - css_set preloading, addition of target tasks, actual migration, and clean up. A list named preloaded_csets is used to track the preloading. This is a bit too restricted and the code is already depending on the subtlety that all source css_sets appear before destination ones. Let's create struct cgroup_mgctx which keeps track of everything during migration. Currently, it has separate preload lists for source and destination csets and also embeds cgroup_taskset which is used during the actual migration. This moves struct cgroup_taskset definition to cgroup-internal.h. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-15cgroup: cosmetic update to cgroup_taskset_add()Tejun Heo
cgroup_taskset_add() was using list_add_tail() when for source csets but list_move_tail() for destination. As the operations are gated by list_empty() test, list_move_tail() is equivalent to list_add_tail() here. Use list_add_tail() too for destination csets too. This doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-10rdmacg: Fixed uninitialized current resource usageParav Pandit
Fixed warning reported by kbuild test robot. When reading current resource usage value, when no resources are allocated, its possible that it can report a uninitialized value for current resource usage. This fix avoids it by initializing it to zero as no resource is allocated. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-10rdmacg: Added rdma cgroup controllerParav Pandit
Added rdma cgroup controller that does accounting, limit enforcement on rdma/IB resources. Added rdma cgroup header file which defines its APIs to perform charging/uncharging functionality. It also defined APIs for RDMA/IB stack for device registration. Devices which are registered will participate in controller functions of accounting and limit enforcements. It define rdmacg_device structure to bind IB stack and RDMA cgroup controller. RDMA resources are tracked using resource pool. Resource pool is per device, per cgroup entity which allows setting up accounting limits on per device basis. Currently resources are defined by the RDMA cgroup. Resource pool is created/destroyed dynamically whenever charging/uncharging occurs respectively and whenever user configuration is done. Its a tradeoff of memory vs little more code space that creates resource pool object whenever necessary, instead of creating them during cgroup creation and device registration time. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-12-27cgroup: fix RCU related sparse warningsTejun Heo
kn->priv which is a void * is used as a RCU pointer by cgroup. When dereferencing it, it was passing kn->priv to rcu_derefreence() without casting it into a RCU pointer triggering address space mismatch warning from sparse. Fix them. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: move namespace code to kernel/cgroup/namespace.cTejun Heo
get/put_css_set() get exposed in cgroup-internal.h in the process. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: rename functions for consistencyTejun Heo
Now that v1 functions are separated out, rename some functions for consistency. cgroup_dfl_base_files -> cgroup_base_files cgroup_legacy_base_files -> cgroup1_base_files cgroup_ssid_no_v1() -> cgroup1_ssid_disabled() cgroup_pidlist_destroy_all -> cgroup1_pidlist_destroy_all() cgroup_release_agent() -> cgroup1_release_agent() check_for_release() -> cgroup1_check_for_release() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.cTejun Heo
Now that the v1 mount code is split into separate functions, move them to kernel/cgroup/cgroup-v1.c along with the mount option handling code. As this puts all v1-only kernfs_syscall_ops in cgroup-v1.c, move cgroup1_kf_syscall_ops to cgroup-v1.c too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: separate out cgroup1_kf_syscall_opsTejun Heo
Currently, cgroup_kf_syscall_ops is shared by v1 and v2 and the specific methods test the version and take different actions. Split out v1 functions and put them in cgroup1_kf_syscall_ops and remove the now unnecessary explicit branches in specific methods. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: refactor mount path and clearly distinguish v1 and v2 pathsTejun Heo
While sharing some mechanisms, the mount paths of v1 and v2 are substantially different. Their implementations were mixed in cgroup_mount(). This patch splits them out so that they're easier to follow and organize. This patch causes one functional change - the WARN_ON(new_sb) gets lost. This is because the actual mounting gets moved to cgroup_do_mount() and thus @new_sb is no longer accessible by default to cgroup1_mount(). While we can add it as an explicit out parameter to cgroup_do_mount(), this part of code hasn't changed and the warning hasn't triggered for quite a while. Dropping it should be fine. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.cTejun Heo
cgroup.c is getting too unwieldy. Let's move out cgroup v1 specific code along with the debug controller into kernel/cgroup/cgroup-v1.c. v2: cgroup_mutex and css_set_lock made available in cgroup-internal.h regardless of CONFIG_PROVE_RCU. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27cgroup: move cgroup files under kernel/cgroup/Tejun Heo
They're growing to be too many and planned to get split further. Move them under their own directory. kernel/cgroup.c -> kernel/cgroup/cgroup.c kernel/cgroup_freezer.c -> kernel/cgroup/freezer.c kernel/cgroup_pids.c -> kernel/cgroup/pids.c kernel/cpuset.c -> kernel/cgroup/cpuset.c Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>