aboutsummaryrefslogtreecommitdiffstats
path: root/fs/proc/base.c
AgeCommit message (Collapse)Author
2019-03-08proc, oom: do not report alien mms when setting oom_score_adjMichal Hocko
commit b2b469939e93458753cfbf8282ad52636495965e upstream. Tetsuo has reported that creating a thousands of processes sharing MM without SIGHAND (aka alien threads) and setting /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1] to finish. This is especially worrisome that all that printing is done under RCU lock and this can potentially trigger RCU stall or softlockup detector. The primary reason for the printk was to catch potential users who might depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") but after more than 2 years without a single report I guess it is safe to simply remove the printk altogether. The next step should be moving oom_score_adj over to the mm struct and remove all the tasks crawling as suggested by [2] [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2018-10-10proc: restrict kernel stack dumps to rootJann Horn
commit f8a00cef17206ecd1b30d3d9f99e10d9fa707aa7 upstream. Currently, you can use /proc/self/task/*/stack to cause a stack walk on a task you control while it is running on another CPU. That means that the stack can change under the stack walker. The stack walker does have guards against going completely off the rails and into random kernel memory, but it can interpret random data from your kernel stack as instruction pointers and stack pointers. This can cause exposure of kernel stack contents to userspace. Restrict the ability to inspect kernel stacks of arbitrary tasks to root in order to prevent a local attacker from exploiting racy stack unwinding to leak kernel task stack contents. See the added comment for a longer rationale. There don't seem to be any users of this userspace API that can't gracefully bail out if reading from the file fails. Therefore, I believe that this change is unlikely to break things. In the case that this patch does end up needing a revert, the next-best solution might be to fake a single-entry stack based on wchan. Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com Fixes: 2ec220e27f50 ("proc: add /proc/*/stack") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ken Chen <kenchen@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-20proc: fix missing final NUL in get_mm_cmdline() rewriteLinus Torvalds
The rewrite of the cmdline fetching missed the fact that we used to also return the final terminating NUL character of the last argument. I hadn't noticed, and none of the tools I tested cared, but something obviously must care, because Michal Kubecek noticed the change in behavior. Tweak the "find the end" logic to actually include the NUL character, and once past the eend of argv, always start the strnlen() at the expected (original) argument end. This whole "allow people to rewrite their arguments in place" is a nasty hack and requires that odd slop handling at the end of the argv array, but it's our traditional model, so we continue to support it. Repored-and-bisected-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-and-tested-by: Michal Kubecek <mkubecek@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15proc: skip branch in /proc/*/* lookupAlexey Dobriyan
Code is structured like this: for ( ... p < last; p++) { if (memcmp == 0) break; } if (p >= last) ERROR OK gcc doesn't see that if if lookup succeeds than post loop branch will never be taken and skip it. [akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*] Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-12treewide: kmalloc() -> kmalloc_array()Kees Cook
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-09Merge branch 'proc-cmdline'Linus Torvalds
Merge proc_cmdline simplifications. This re-writes the get_mm_cmdline() logic to be rather simpler than it used to be, and makes the semantics for "cmdline goes past the end of the original area" more natural. You _can_ use prctl(PR_SET_MM) to just point your command line somewhere else entirely, but the traditional model is to just edit things in place and that still needs to continue to work. At least this way the code makes some sense. * proc-cmdline: fs/proc: simplify and clarify get_mm_cmdline() function fs/proc: re-factor proc_pid_cmdline_read() a bit
2018-06-08Merge branch 'work.lookup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull proc_fill_cache regression fix from Al Viro: "Regression fix for proc_fill_cache() braino introduced when switching instantiate() callback to d_splice_alias()" * 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix proc_fill_cache() in case of d_alloc_parallel() failure
2018-06-08fix proc_fill_cache() in case of d_alloc_parallel() failureAl Viro
If d_alloc_parallel() returns ERR_PTR(...), we don't want to dput() that. Small reorganization allows to have all error-in-lookup cases rejoin the main codepath after dput(child), avoiding the entire problem. Spotted-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Fixes: 0168b9e38c42 "procfs: switch instantiate_t to d_splice_alias()" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-07proc: use "unsigned int" for /proc/*/stackAlexey Dobriyan
struct stack_trace::nr_entries is defined as "unsigned int" (YAY!) so the iterator should be unsigned as well. It saves 1 byte of code or something like that. Link: http://lkml.kernel.org/r/20180423215248.GG9043@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: use "unsigned int" in proc_fill_cache()Alexey Dobriyan
All those lengths are unsigned as they should be. Link: http://lkml.kernel.org/r/20180423213751.GC9043@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: smaller RCU section in ->getattr()Alexey Dobriyan
struct kstat is thread local. Link: http://lkml.kernel.org/r/20180423213626.GB9043@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: deduplicate /proc/*/cmdline implementationAlexey Dobriyan
Code can be sonsolidated if a dummy region of 0 length is used in normal case of \0-separated command line: 1) [arg_start, arg_end) + [dummy len=0] 2) [arg_start, arg_end) + [env_start, env_end) Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: simpler iterations for /proc/*/cmdlineAlexey Dobriyan
"rv" variable is used both as a counter of bytes transferred and an error value holder but it can be reduced solely to error values if original start of userspace buffer is stashed and used at the very end. [akpm@linux-foundation.org: simplify cleanup code] Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: somewhat simpler code for /proc/*/cmdlineAlexey Dobriyan
"final" variable is OK but we can get away with less lines. Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07proc: more "unsigned int" in /proc/*/cmdlineAlexey Dobriyan
access_remote_vm() doesn't return negative errors, it returns number of bytes read/written (0 if error occurs). This allows to delete some comparisons which never trigger. Reuse "nr_read" variable while I'm at it. Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_structYang Shi
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-05Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: - make kworkers report the workqueue it is executing or has executed most recently in /proc/PID/comm (so they show up in ps/top) - CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds inside CONFIG_SMP. * 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: move function definitions within CONFIG_SMP block workqueue: Make sure struct worker is accessible for wq_worker_comm() workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status} proc: Consolidate task->comm formatting into proc_task_name() workqueue: Set worker->desc to workqueue name by default workqueue: Make worker_attach/detach_pool() update worker->pool workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
2018-06-04Merge branch 'work.lookup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache lookup cleanups from Al Viro: "Cleaning ->lookup() instances up - mostly d_splice_alias() conversions" * 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits) switch the rest of procfs lookups to d_splice_alias() procfs: switch instantiate_t to d_splice_alias() don't bother with tid_fd_revalidate() in lookups proc_lookupfd_common(): don't bother with instantiate unless the file is open procfs: get rid of ancient BS in pid_revalidate() uses cifs_lookup(): switch to d_splice_alias() cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL 9p: unify paths in v9fs_vfs_lookup() ncp_lookup(): use d_splice_alias() hfsplus: switch to d_splice_alias() hfs: don't allow mounting over .../rsrc hfs: use d_splice_alias() omfs_lookup(): report IO errors, use d_splice_alias() orangefs_lookup: simplify openpromfs: switch to d_splice_alias() xfs_vn_lookup: simplify a bit adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias() adfs_lookup_byname: .. *is* taken care of in fs/namei.c romfs_lookup: switch to d_splice_alias() qnx6_lookup: switch to d_splice_alias() ...
2018-06-04Merge branch 'hch.procfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
2018-05-26procfs: switch instantiate_t to d_splice_alias()Al Viro
... and get rid of pointless struct inode *dir argument of those, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-22procfs: get rid of ancient BS in pid_revalidate() usesAl Viro
First of all, calling pid_revalidate() in the end of <pid>/* lookups is *not* about closing any kind of races; that used to be true once upon a time, but these days those comments are actively misleading. Especially since pid_revalidate() doesn't even do d_drop() on failure anymore. It doesn't matter, anyway, since once pid_revalidate() starts returning false, ->d_delete() of those dentries starts saying "don't keep"; they won't get stuck in dcache any longer than they are pinned. These calls cannot be just removed, though - the side effect of pid_revalidate() (updating i_uid/i_gid/etc.) is what we are calling it for here. Let's separate the "update ownership" into a new helper (pid_update_inode()) and use it, both in lookups and in pid_revalidate() itself. The comments in pid_revalidate() are also out of date - they refer to the time when pid_revalidate() used to call d_drop() directly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-18proc: Consolidate task->comm formatting into proc_task_name()Tejun Heo
proc shows task->comm in three places - comm, stat, status - and each is fetching and formatting task->comm slighly differently. This patch renames task_name() to proc_task_name(), makes it more generic, and updates all three paths to use it. This will enable expanding comm reporting for workqueue workers. Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-17fs/proc: simplify and clarify get_mm_cmdline() functionLinus Torvalds
We have some very odd semantics for reading the command line through /proc, because we allow people to rewrite their own command line pretty much at will, and things get positively funky when you extend your command line past the point that used to be the end of the command line, and is now in the environment variable area. But our weird semantics doesn't mean that we should write weird and complex code to handle them. So re-write get_mm_cmdline() to be much simpler, and much more explicit about what it is actually doing and why. And avoid the extra check for "is there a NUL character at the end of the command line where I expect one to be", by simply making the NUL character handling be part of the normal "once you hit the end of the command line, stop at the first NUL character" logic. It's quite possible that we should stop the crazy "walk into environment" entirely, but happily it's not really the usual case. NOTE! We tried to really simplify and limit our odd cmdline parsing some time ago, but people complained. See commit c2c0bb44620d ("proc: fix PAGE_SIZE limit of /proc/$PID/cmdline") for details about why we have this complexity. Cc: Tejun Heo <tj@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-17fs/proc: re-factor proc_pid_cmdline_read() a bitLinus Torvalds
This is a pure refactoring of the function, preparing for some further cleanups. The thing was pretty illegible, and the core functionality still is, but now the core loop is a bit more isolated from the thing that goes on around it. This was "inspired" by the confluence of kworker workqueue name cleanups by Tejun, currently scheduled for 4.18, and commit 7f7ccc2ccc2e ("proc: do not access cmdline nor environ from file-backed areas"). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-17proc: do not access cmdline nor environ from file-backed areasWilly Tarreau
proc_pid_cmdline_read() and environ_read() directly access the target process' VM to retrieve the command line and environment. If this process remaps these areas onto a file via mmap(), the requesting process may experience various issues such as extra delays if the underlying device is slow to respond. Let's simply refuse to access file-backed areas in these functions. For this we add a new FOLL_ANON gup flag that is passed to all calls to access_remote_vm(). The code already takes care of such failures (including unmapped areas). Accesses via /proc/pid/mem were not changed though. This was assigned CVE-2018-1120. Note for stable backports: the patch may apply to kernels prior to 4.11 but silently miss one location; it must be checked that no call to access_remote_vm() keeps zero as the last argument. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-16proc: introduce a proc_pid_ns helperChristoph Hellwig
Factor out retrieving the per-sb pid namespaces from the sb private data into an easier to understand helper. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-04-20proc: revalidate kernel thread inodes to root:rootAlexey Dobriyan
task_dump_owner() has the following code: mm = task->mm; if (mm) { if (get_dumpable(mm) != SUID_DUMP_USER) { uid = ... } } Check for ->mm is buggy -- kernel thread might be borrowing mm and inode will go to some random uid:gid pair. Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: do mmput ASAP for /proc/*/map_filesAlexey Dobriyan
mm_struct is not needed while printing as all the data was already extracted. Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: fix /proc/*/map_files lookup some moreAlexey Dobriyan
I totally forgot that _parse_integer() accepts arbitrary amount of leading zeroes leading to the following lookups: OK # readlink /proc/1/map_files/56427ecba000-56427eddc000 /lib/systemd/systemd bogus # readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000 /lib/systemd/systemd # readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000 /lib/systemd/systemd Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: use set_puts() at /proc/*/wchanAlexey Dobriyan
Link: http://lkml.kernel.org/r/20180217072011.GB16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: check permissions earlier for /proc/*/wchanAlexey Dobriyan
get_wchan() accesses stack page before permissions are checked, let's not play this game. Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-16procfs: remove CONFIG_HARDWALL dependencyArnd Bergmann
Hardwall is a tile specific feature, and with the removal of the tile architecture, this has become dead code, so let's remove it. Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-02-06fs/proc: use __ro_after_initAlexey Dobriyan
/proc/self inode numbers, value of proc_inode_cache and st_nlink of /proc/$TGID are fixed constants. Link: http://lkml.kernel.org/r/20180103184707.GA31849@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06proc: less memory for /proc/*/map_files readdirAlexey Dobriyan
dentry name can be evaluated later, right before calling into VFS. Also, spend less time under ->mmap_sem. Link: http://lkml.kernel.org/r/20171110163034.GA2534@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06proc: fix /proc/*/map_files lookupAlexey Dobriyan
Current code does: if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2) However sscanf() is broken garbage. It silently accepts whitespace between format specifiers (did you know that?). It silently accepts valid strings which result in integer overflow. Do not use sscanf() for any even remotely reliable parsing code. OK # readlink '/proc/1/map_files/55a23af39000-55a23b05b000' /lib/systemd/systemd broken # readlink '/proc/1/map_files/ 55a23af39000-55a23b05b000' /lib/systemd/systemd broken # readlink '/proc/1/map_files/55a23af39000-55a23b05b000 ' /lib/systemd/systemd very broken # readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000' /lib/systemd/systemd Andrei said: : This patch breaks criu. It was a bug in criu. And this bug is on a minor : path, which works when memfd_create() isn't available. It is a reason why : I ask to not backport this patch to stable kernels. : : In CRIU this bug can be triggered, only if this patch will be backported : to a kernel which version is lower than v3.16. Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06proc: don't use READ_ONCE/WRITE_ONCE for /proc/*/fail-nthAlexey Dobriyan
READ_ONCE and WRITE_ONCE are useless when there is only one read/write is being made. Link: http://lkml.kernel.org/r/20171120204033.GA9446@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06proc: use %u for pid printing and slightly less stackAlexey Dobriyan
PROC_NUMBUF is 13 which is enough for "negative int + \n + \0". However PIDs and TGIDs are never negative and newline is not a concern, so use just 10 per integer. Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexander Viro <viro@ftp.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-06proc: show si_ptr in /proc/<pid>/timers without hashingLinus Torvalds
It's a user pointer, and while the permissions of the file are pretty questionable (should it really be readable to everybody), hashing the pointer isn't going to be the solution. We should take a closer look at more of the /proc/<pid> file permissions in general. Sure, we do want many of them to often be readable (for 'ps' and friends), but I think we should probably do a few conversions from S_IRUGO to S_IRUSR. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27proc: don't report kernel addresses in /proc/<pid>/stackLinus Torvalds
This just changes the file to report them as zero, although maybe even that could be removed. I checked, and at least procps doesn't actually seem to parse the 'stack' file at all. And since the file doesn't necessarily even exist (it requires CONFIG_STACKTRACE), possibly other tools don't really use it either. That said, in case somebody parses it with tools, just having that zero there should keep such tools happy. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-22Merge branch 'work.whack-a-mole' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mode_t whack-a-mole from Al Viro: "For all internal uses we want umode_t, which is arch-independent; mode_t (or __kernel_mode_t, for that matter) is wrong outside of userland ABI. Unfortunately, that crap keeps coming back and needs to be put down from time to time..." * 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mode_t whack-a-mole: task_dump_owner()
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-30mode_t whack-a-mole: task_dump_owner()Al Viro
should be umode_t... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-13mm: treewide: remove GFP_TEMPORARY allocation flagMichal Hocko
GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL | __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org [akpm@linux-foundation.org: fix typo] [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: drm/i915: fix up] Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: add /proc/pid/smaps_rollupDaniel Colascione
/proc/pid/smaps_rollup is a new proc file that improves the performance of user programs that determine aggregate memory statistics (e.g., total PSS) of a process. Android regularly "samples" the memory usage of various processes in order to balance its memory pool sizes. This sampling process involves opening /proc/pid/smaps and summing certain fields. For very large processes, sampling memory use this way can take several hundred milliseconds, due mostly to the overhead of the seq_printf calls in task_mmu.c. smaps_rollup improves the situation. It contains most of the fields of /proc/pid/smaps, but instead of a set of fields for each VMA, smaps_rollup instead contains one synthetic smaps-format entry representing the whole process. In the single smaps_rollup synthetic entry, each field is the summation of the corresponding field in all of the real-smaps VMAs. Using a common format for smaps_rollup and smaps allows userspace parsers to repurpose parsers meant for use with non-rollup smaps for smaps_rollup, and it allows userspace to switch between smaps_rollup and smaps at runtime (say, based on the availability of smaps_rollup in a given kernel) with minimal fuss. By using smaps_rollup instead of smaps, a caller can avoid the significant overhead of formatting, reading, and parsing each of a large process's potentially very numerous memory mappings. For sampling system_server's PSS in Android, we measured a 12x speedup, representing a savings of several hundred milliseconds. One alternative to a new per-process proc file would have been including PSS information in /proc/pid/status. We considered this option but thought that PSS would be too expensive (by a few orders of magnitude) to collect relative to what's already emitted as part of /proc/pid/status, and slowing every user of /proc/pid/status for the sake of readers that happen to want PSS feels wrong. The code itself works by reusing the existing VMA-walking framework we use for regular smaps generation and keeping the mem_size_stats structure around between VMA walks instead of using a fresh one for each VMA. In this way, summation happens automatically. We let seq_file walk over the VMAs just as it does for regular smaps and just emit nothing to the seq_file until we hit the last VMA. Benchmarks: using smaps: iterations:1000 pid:1163 pss:220023808 0m29.46s real 0m08.28s user 0m20.98s system using smaps_rollup: iterations:1000 pid:1163 pss:220702720 0m04.39s real 0m00.03s user 0m04.31s system We're using the PSS samples we collect asynchronously for system-management tasks like fine-tuning oom_adj_score, memory use tracking for debugging, application-level memory-use attribution, and deciding whether we want to kill large processes during system idle maintenance windows. Android has been using PSS for these purposes for a long time; as the average process VMA count has increased and and devices become more efficiency-conscious, PSS-collection inefficiency has started to matter more. IMHO, it'd be a lot safer to optimize the existing PSS-collection model, which has been fine-tuned over the years, instead of changing the memory tracking approach entirely to work around smaps-generation inefficiency. Tim said: : There are two main reasons why Android gathers PSS information: : : 1. Android devices can show the user the amount of memory used per : application via the settings app. This is a less important use case. : : 2. We log PSS to help identify leaks in applications. We have found : an enormous number of bugs (in the Android platform, in Google's own : apps, and in third-party applications) using this data. : : To do this, system_server (the main process in Android userspace) will : sample the PSS of a process three seconds after it changes state (for : example, app is launched and becomes the foreground application) and about : every ten minutes after that. The net result is that PSS collection is : regularly running on at least one process in the system (usually a few : times a minute while the screen is on, less when screen is off due to : suspend). PSS of a process is an incredibly useful stat to track, and we : aren't going to get rid of it. We've looked at some very hacky approaches : using RSS ("take the RSS of the target process, subtract the RSS of the : zygote process that is the parent of all Android apps") to reduce the : accounting time, but it regularly overestimated the memory used by 20+ : percent. Accordingly, I don't think that there's a good alternative to : using PSS. : : We started looking into PSS collection performance after we noticed random : frequency spikes while a phone's screen was off; occasionally, one of the : CPU clusters would ramp to a high frequency because there was 200-300ms of : constant CPU work from a single thread in the main Android userspace : process. The work causing the spike (which is reasonable governor : behavior given the amount of CPU time needed) was always PSS collection. : As a result, Android is burning more power than we should be on PSS : collection. : : The other issue (and why I'm less sure about improving smaps as a : long-term solution) is that the number of VMAs per process has increased : significantly from release to release. After trying to figure out why we : were seeing these 200-300ms PSS collection times on Android O but had not : noticed it in previous versions, we found that the number of VMAs in the : main system process increased by 50% from Android N to Android O (from : ~1800 to ~2700) and varying increases in every userspace process. Android : M to N also had an increase in the number of VMAs, although not as much. : I'm not sure why this is increasing so much over time, but thinking about : ASLR and ways to make ASLR better, I expect that this will continue to : increase going forward. I would not be surprised if we hit 5000 VMAs on : the main Android process (system_server) by 2020. : : If we assume that the number of VMAs is going to increase over time, then : doing anything we can do to reduce the overhead of each VMA during PSS : collection seems like the right way to go, and that means outputting an : aggregate statistic (to avoid whatever overhead there is per line in : writing smaps and in reading each line from userspace). Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com Signed-off-by: Daniel Colascione <dancol@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sonny Rao <sonnyrao@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10sched/debug: Use task_pid_nr_ns in /proc/$pid/schedAleksa Sarai
It appears as though the addition of the PID namespace did not update the output code for /proc/*/sched, which resulted in it providing PIDs that were not self-consistent with the /proc mount. This additionally made it trivial to detect whether a process was inside &init_pid_ns from userspace, making container detection trivial: https://github.com/jessfraz/amicontained This leads to situations such as: % unshare -pmf % mount -t proc proc /proc % head -n1 /proc/1/sched head (10047, #threads: 1) Fix this by just using task_pid_nr_ns for the output of /proc/*/sched. All of the other uses of task_pid_nr in kernel/sched/debug.c are from a sysctl context and thus don't need to be namespaced. Signed-off-by: Aleksa Sarai <asarai@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Jess Frazelle <acidburn@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cyphar@cyphar.com Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-14fault-inject: add /proc/<pid>/fail-nthAkinobu Mita
fail-nth interface is only created in /proc/self/task/<current-tid>/. This change also adds it in /proc/<pid>/. This makes shell based tool a bit simpler. $ bash -c "builtin echo 100 > /proc/self/fail-nth && exec ls /" Link: http://lkml.kernel.org/r/1491490561-10485-6-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: simplify access check for fail-nthAkinobu Mita
The fail-nth file is created with 0666 and the access is permitted if and only if the task is current. This file is owned by the currnet user. So we can create it with 0644 and allow the owner to write it. This enables to watch the status of task->fail_nth from another processes. [akinobu.mita@gmail.com: don't convert unsigned type value as signed int] Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com [akinobu.mita@gmail.com: avoid unwanted data race to task->fail_nth] Link: http://lkml.kernel.org/r/1499962492-8931-1-git-send-email-akinobu.mita@gmail.com Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: make fail-nth read/write interface symmetricAkinobu Mita
The read interface for fail-nth looks a bit odd. Read from this file returns "NYYYY..." or "YYYYY..." (this makes me surprise when cat this file). Because there is no EOF condition. The first character indicates current->fail_nth is zero or not, and then current->fail_nth is reset to zero. Just returning task->fail_nth value is more natural to understand. Link: http://lkml.kernel.org/r/1491490561-10485-4-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: parse as natural 1-based value for fail-nth write interfaceAkinobu Mita
The value written to fail-nth file is parsed as 0-based. Parsing as one-based is more natural to understand and it enables to cancel the previous setup by simply writing '0'. This change also converts task->fail_nth from signed to unsigned int. Link: http://lkml.kernel.org/r/1491490561-10485-3-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: automatically detect the number base for fail-nth write interfaceAkinobu Mita
Automatically detect the number base to use when writing to fail-nth file instead of always parsing as a decimal number. Link: http://lkml.kernel.org/r/1491490561-10485-2-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>