summaryrefslogtreecommitdiffstats
path: root/mm/shmem.c
AgeCommit message (Collapse)Author
2014-06-26tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supportedHugh Dickins
commit 13ace4d0d9db40e10ecd66dfda14e297571be813 upstream. I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE support being added to fallocate(); but didn't realize until now that I had been too stupid to future-proof shmem_fallocate() against new additions. -EOPNOTSUPP instead of going on to ordinary fallocation. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-13mm: Initialize error in shmem_file_aio_read()Geert Uytterhoeven
Some versions of gcc even warn about it: mm/shmem.c: In function ‘shmem_file_aio_read’: mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function If the loop is aborted during the first iteration by one of the two first break statements, error will be uninitialized. Introduced by commit 6e58e79db8a1 ("introduce copy_page_to_iter, kill loop over iovec in generic_file_aio_read()"). Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-12missing bits of "splice: fix racy pipe->buffers uses"Al Viro
that commit has fixed only the parts of that mess in fs/splice.c itself; there had been more in several other ->splice_read() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-07memcg: rename high level charging functionsMichal Hocko
mem_cgroup_newpage_charge is used only for charging anonymous memory so it is better to rename it to mem_cgroup_charge_anon. mem_cgroup_cache_charge is used for file backed memory so rename it to mem_cgroup_charge_file. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm: implement ->map_pages for shmem/tmpfsNing Qu
In shmem/tmpfs, we also use the generic filemap_map_pages, seems the additional checking is not worth a separate version of map_pages for it. Signed-off-by: Ning Qu <quning@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm: shmem: save one radix tree lookup when truncating swapped pagesJohannes Weiner
Page cache radix tree slots are usually stabilized by the page lock, but shmem's swap cookies have no such thing. Because the overall truncation loop is lockless, the swap entry is currently confirmed by a tree lookup and then deleted by another tree lookup under the same tree lock region. Use radix_tree_delete_item() instead, which does the verification and deletion with only one lookup. This also allows removing the delete-only special case from shmem_radix_tree_replace(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-01introduce copy_page_to_iter, kill loop over iovec in generic_file_aio_read()Al Viro
generic_file_aio_read() was looping over the target iovec, with loop over (source) pages nested inside that. Just set an iov_iter up and pass *that* to do_generic_file_aio_read(). With copy_page_to_iter() doing all work of mapping and copying a page to iovec and advancing iov_iter. Switch shmem_file_aio_read() to the same and kill file_read_actor(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01do_shmem_file_read(): call file_read_actor() directlyAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
2014-01-26fs: remove generic_aclChristoph Hellwig
And instead convert tmpfs to use the new generic ACL code, with two stub methods provided for in-memory filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-23mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02security: shmem: implement kernel private shmem inodesEric Paris
We have a problem where the big_key key storage implementation uses a shmem backed inode to hold the key contents. Because of this detail of implementation LSM checks are being done between processes trying to read the keys and the tmpfs backed inode. The LSM checks are already being handled on the key interface level and should not be enforced at the inode level (since the inode is an implementation detail, not a part of the security model) This patch implements a new function shmem_kernel_file_setup() which returns the equivalent to shmem_file_setup() only the underlying inode has S_PRIVATE set. This means that all LSM checks for the inode in question are skipped. It should only be used for kernel internal operations where the inode is not exposed to userspace without proper LSM checking. It is possible that some other users of shmem_file_setup() should use the new interface, but this has not been explored. Reproducing this bug is a little bit difficult. The steps I used on Fedora are: (1) Turn off selinux enforcing: setenforce 0 (2) Create a huge key k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s` (3) Access the key in another context: runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null (4) Examine the audit logs: ausearch -m AVC -i --subject httpd_t | audit2allow If the last command's output includes a line that looks like: allow httpd_t user_tmpfs_t:file { open read }; There was an inode check between httpd and the tmpfs filesystem. With this patch no such denial will be seen. (NOTE! you should clear your audit log if you have tested for this previously) (Please return you box to enforcing) Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hugh Dickins <hughd@google.com> cc: linux-mm@kvack.org
2013-09-11initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabledRob Landley
Conditionally call the appropriate fs_init function and fill_super functions. Add a use once guard to shmem_init() to simply succeed on a second call. (Note that IS_ENABLED() is a compile time constant so dead code elimination removes unused function calls when CONFIG_TMPFS is disabled.) Signed-off-by: Rob Landley <rob@landley.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-03shm_mnt is as longterm as it gets, TYVM...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24cope with potentially long ->d_dname() output for shmem/hugetlbAl Viro
dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-04tmpfs: fix SEEK_DATA/SEEK_HOLE regressionHugh Dickins
Commit 46a1c2c7ae53 ("vfs: export lseek_execute() to modules") broke the tmpfs SEEK_DATA/SEEK_HOLE implementation, because vfs_setpos() converts the carefully prepared -ENXIO to -EINVAL. Other filesystems avoid it in error cases: do the same in tmpfs. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "In this update, Smack learns to love IPv6 and to mount a filesystem with a transmutable hierarchy (i.e. security labels are inherited from parent directory upon creation rather than creating process). The rest of the changes are maintenance" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (37 commits) tpm/tpm_i2c_infineon: Remove unused header file tpm: tpm_i2c_infinion: Don't modify i2c_client->driver evm: audit integrity metadata failures integrity: move integrity_audit_msg() evm: calculate HMAC after initializing posix acl on tmpfs maintainers: add Dmitry Kasatkin Smack: Fix the bug smackcipso can't set CIPSO correctly Smack: Fix possible NULL pointer dereference at smk_netlbl_mls() Smack: Add smkfstransmute mount option Smack: Improve access check performance Smack: Local IPv6 port based controls tpm: fix regression caused by section type conflict of tpm_dev_release() in ppc builds maintainers: Remove Kent from maintainers tpm: move TPM_DIGEST_SIZE defintion tpm_tis: missing platform_driver_unregister() on error in init_tis() security: clarify cap_inode_getsecctx description apparmor: no need to delay vfree() apparmor: fix fully qualified name parsing apparmor: fix setprocattr arg processing for onexec apparmor: localize getting the security context to a few macros ...
2013-07-03vfs: export lseek_execute() to modulesJie Liu
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK ↵Al Viro
now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-20evm: calculate HMAC after initializing posix acl on tmpfsMimi Zohar
Included in the EVM hmac calculation is the i_mode. Any changes to the i_mode need to be reflected in the hmac. shmem_mknod() currently calls generic_acl_init(), which modifies the i_mode, after calling security_inode_init_security(). This patch reverses the order in which they are called. Reported-by: Sven Vermeulen <sven.vermeulen@siphos.be> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Hugh Dickins <hughd@google.com>
2013-05-07aio: don't include aio.h in sched.hKent Overstreet
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm/shmem.c: remove an ifdefAndrew Morton
Create a CONFIG_MMU=y stub for ramfs_nommu_expand_for_mapping() in the usual fashion. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Wolfram Sang <wsa@the-dreams.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-01fix nommu breakage in shmem.cAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26fs: encode_fh: return FILEID_INVALID if invalid fid_typeNamjae Jeon
This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26shmem_setup_file(): use d_alloc_pseudo() instead of d_alloc()Al Viro
Note that provided ->d_dname() reproduces what we used to get for those guys in e.g. /proc/self/maps; it might be a good idea to change that to something less ugly, but for now let's keep the existing user-visible behaviour Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace and namespace infrastructure changes from Eric W Biederman: "This set of changes starts with a few small enhnacements to the user namespace. reboot support, allowing more arbitrary mappings, and support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the user namespace root. I do my best to document that if you care about limiting your unprivileged users that when you have the user namespace support enabled you will need to enable memory control groups. There is a minor bug fix to prevent overflowing the stack if someone creates way too many user namespaces. The bulk of the changes are a continuation of the kuid/kgid push down work through the filesystems. These changes make using uids and gids typesafe which ensures that these filesystems are safe to use when multiple user namespaces are in use. The filesystems converted for 3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The changes for these filesystems were a little more involved so I split the changes into smaller hopefully obviously correct changes. XFS is the only filesystem that remains. I was hoping I could get that in this release so that user namespace support would be enabled with an allyesconfig or an allmodconfig but it looks like the xfs changes need another couple of days before it they are ready." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits) cifs: Enable building with user namespaces enabled. cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t cifs: Convert struct cifs_sb_info to use kuids and kgids cifs: Modify struct smb_vol to use kuids and kgids cifs: Convert struct cifsFileInfo to use a kuid cifs: Convert struct cifs_fattr to use kuid and kgids cifs: Convert struct tcon_link to use a kuid. cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t cifs: Convert from a kuid before printing current_fsuid cifs: Use kuids and kgids SID to uid/gid mapping cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc cifs: Use BUILD_BUG_ON to validate uids and gids are the same size cifs: Override unmappable incoming uids and gids nfsd: Enable building with user namespaces enabled. nfsd: Properly compare and initialize kuids and kgids nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids nfsd: Modify nfsd4_cb_sec to use kuids and kgids nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion nfsd: Convert nfsxdr to use kuids and kgids nfsd: Convert nfs3xdr to use kuids and kgids ...
2013-02-23tmpfs: fix mempolicy object leaksGreg Thelen
Fix several mempolicy leaks in the tmpfs mount logic. These leaks are slow - on the order of one object leaked per mount attempt. Leak 1 (umount doesn't free mpol allocated in mount): while true; do mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt umount /mnt done Leak 2 (errors parsing remount options will leak mpol): mount -t tmpfs -o size=100M nodev /mnt while true; do mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null done umount /mnt Leak 3 (multiple mpol per mount leak mpol): while true; do mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt umount /mnt done This patch fixes all of the above. I could have broken the patch into three pieces but is seemed easier to review as one. [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh] Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23tmpfs: fix use-after-free of mempolicy objectGreg Thelen
The tmpfs remount logic preserves filesystem mempolicy if the mpol=M option is not specified in the remount request. A new policy can be specified if mpol=M is given. Before this patch remounting an mpol bound tmpfs without specifying mpol= mount option in the remount request would set the filesystem's mempolicy object to a freed mempolicy object. To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run: # mkdir /tmp/x # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0 # mount -o remount,size=200M nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0 # note ? garbage in mpol=... output above # dd if=/dev/zero of=/tmp/x/f count=1 # panic here Panic: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) [...] Oops: 0010 [#1] SMP DEBUG_PAGEALLOC Call Trace: mpol_shared_policy_init+0xa5/0x160 shmem_get_inode+0x209/0x270 shmem_mknod+0x3e/0xf0 shmem_create+0x18/0x20 vfs_create+0xb5/0x130 do_last+0x9a1/0xea0 path_openat+0xb3/0x4d0 do_filp_open+0x42/0xa0 do_sys_open+0xfe/0x1e0 compat_sys_open+0x1b/0x20 cstar_dispatch+0x7/0x1f Non-debug kernels will not crash immediately because referencing the dangling mpol will not cause a fault. Instead the filesystem will reference a freed mempolicy object, which will cause unpredictable behavior. The problem boils down to a dropped mpol reference below if shmem_parse_options() does not allocate a new mpol: config = *sbinfo shmem_parse_options(data, &config, true) mpol_put(sbinfo->mpol) sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */ This patch avoids the crash by not releasing the mempolicy if shmem_parse_options() doesn't create a new mpol. How far back does this issue go? I see it in both 2.6.36 and 3.3. I did not look back further. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: shmem: use new radix tree iteratorJohannes Weiner
In shmem_find_get_pages_and_swap(), use the faster radix tree iterator construct from commit 78c1d78488a3 ("radix-tree: introduce bit-optimized iterator"). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22clean shmem_file_setup() a bitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22fs: Preserve error code in get_empty_filp(), part 2Anatol Pomozov
Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-01-26userns: Allow the userns root to mount tmpfs.Eric W. Biederman
There is no backing store to tmpfs and file creation rules are the same as for any other filesystem so it is semantically safe to allow unprivileged users to mount it. ramfs is safe for the same reasons so allow either flavor of tmpfs to be mounted by a user namespace root user. The memory control group successfully limits how much memory tmpfs can consume on any system that cares about a user namespace root using tmpfs to exhaust memory the memory control group can be deployed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-02mempolicy: remove arg from mpol_parse_str, mpol_to_strHugh Dickins
Remove the unused argument (formerly no_context) from mpol_parse_str() and from mpol_to_str(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17lseek: the "whence" argument is called "whence"Andrew Morton
But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)Hugh Dickins
Revert 3.5's commit f21f8062201f ("tmpfs: revert SEEK_DATA and SEEK_HOLE") to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"), with the intervening additional arg to generic_file_llseek_size(). In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper SEEK_DATA and SEEK_HOLE support; and a good case has now been made for it on tmpfs, so let's join the party. It's quite easy for tmpfs to scan the radix_tree to support llseek's new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still on my mind (in particular, the !PageUptodate-ness of pages fallocated but still unwritten). [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n] Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jaegeuk Hanse <jaegeuk.hanse@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: Jeff liu <jeff.liu@oracle.com> Cc: Paul Eggert <eggert@cs.ucla.edu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Josef Bacik <josef@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06tmpfs: fix shared mempolicy leakMel Gorman
This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: change final i_blocks BUG to WARNINGHugh Dickins
Under a particular load on one machine, I have hit shmem_evict_inode()'s BUG_ON(inode->i_blocks), enough times to narrow it down to a particular race between swapout and eviction. It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and the lack of coherent locking between mapping's nrpages and shmem's swapped count. There's a window in shmem_writepage(), between lowering nrpages in shmem_delete_from_page_cache() and then raising swapped count, when the freed count appears to be +1 when it should be 0, and then the asymmetry stops it from being corrected with -1 before hitting the BUG. One answer is coherent locking: using tree_lock throughout, without info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on used_blocks makes that messier than expected. Another answer may be a further effort to eliminate the weird shmem_recalc_inode() altogether, but previous attempts at that failed. So far undecided, but for now change the BUG_ON to WARN_ON: in usual circumstances it remains a useful consistency check. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: fix shmem_getpage_gfp() VM_BUG_ONHugh Dickins
Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora has converted to WARNING) in shmem_getpage_gfp(): WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 Call Trace: warn_slowpath_common+0x7f/0xc0 warn_slowpath_null+0x1a/0x20 shmem_getpage_gfp+0xa5c/0xa70 shmem_fault+0x4f/0xa0 __do_fault+0x71/0x5c0 handle_pte_fault+0x97/0xae0 handle_mm_fault+0x289/0x350 __do_page_fault+0x18e/0x530 do_page_fault+0x2b/0x50 page_fault+0x28/0x30 tracesys+0xe1/0xe6 Thanks to Johannes for pointing to truncation: free_swap_and_cache() only does a trylock on the page, so the page lock we've held since before confirming swap is not enough to protect against truncation. What cleanup is needed in this case? Just delete_from_swap_cache(), which takes care of the memcg uncharge. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checkingHugh Dickins
Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09mm: kill vma flag VM_CAN_NONLINEARKonstantin Khlebnikov
Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-24xattr: extract simple_xattr code from tmpfsAristeu Rozanski
Extract in-memory xattr APIs from tmpfs. Will be used by cgroup. $ size vmlinux.o text data bss dec hex filename 4658782 880729 5195032 10734543 a3cbcf vmlinux.o $ size vmlinux.o text data bss dec hex filename 4658957 880729 5195032 10734718 a3cc7e vmlinux.o v7: - checkpatch warnings fixed - Implement the changes requested by Hugh Dickins: - make simple_xattrs_init and simple_xattrs_free inline - get rid of locking and list reinitialization in simple_xattrs_free, they're not needed v6: - no changes v5: - no changes v4: - move simple_xattrs_free() to fs/xattr.c v3: - in kmem_xattrs_free(), reinitialize the list - use simple_xattr_* prefix - introduce simple_xattr_add() to prevent direct list usage Original-patch-by: Li Zefan <lizefan@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lennart Poettering <lpoetter@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-31tmpfs: distribute interleave better across nodesNathan Zimmer
When tmpfs has the interleave memory policy, it always starts allocating for each file from node 0 at offset 0. When there are many small files, the lower nodes fill up disproportionately. This patch spreads out node usage by starting files at nodes other than 0, by using the inode number to bias the starting node for interleave. Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14don't pass nameidata to ->create()Al Viro
boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-11shmem: cleanup shmem_add_to_page_cacheHugh Dickins
shmem_add_to_page_cache() has three callsites, but only one of them wants the radix_tree_preload() (an exceptional entry guarantees that the radix tree node is present in the other cases), and only that site can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the other cases). We did it this way originally to reflect add_to_page_cache_locked(); but it's confusing now, so move the radix_tree preloading and mem_cgroup uncharging to that one caller. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11shmem: fix negative rss in memcg memory.statHugh Dickins
When adding the page_private checks before calling shmem_replace_page(), I did realize that there is a further race, but thought it too unlikely to need a hurried fix. But independently I've been chasing why a mem cgroup's memory.stat sometimes shows negative rss after all tasks have gone: I expected it to be a stats gathering bug, but actually it's shmem swapping's fault. It's an old surprise, that when you lock_page(lookup_swap_cache(swap)), the page may have been removed from swapcache before getting the lock; or it may have been freed and reused and be back in swapcache; and it can even be using the same swap location as before (page_private same). The swapoff case is already secure against this (swap cannot be reused until the whole area has been swapped off, and a new swapped on); and shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for the expected radix_tree entry - but a little too late. By that time, we might have already decided to shmem_replace_page(): I don't know of a problem from that, but I'd feel more at ease not to do so spuriously. And we have already done mem_cgroup_cache_charge(), on perhaps the wrong mem cgroup: and this charge is not then undone on the error path, because PageSwapCache ends up preventing that. It's this last case which causes the occasional negative rss in memory.stat: the page is charged here as cache, but (sometimes) found to be anon when eventually it's uncharged - and in between, it's an undeserved charge on the wrong memcg. Fix this by adding an earlier check on the radix_tree entry: it's inelegant to descend the tree twice, but swapping is not the fast path, and a better solution would need a pair (try+commit) of memcg calls, and a rework of shmem_replace_page() to keep out of the swapcache. We can use the added shmem_confirm_swap() function to replace the find_get_page+page_cache_release we were already doing on the error path. And add a comment on that -EEXIST: it seems a peculiar errno to be using, but originates from its use in radix_tree_insert(). [It can be surprising to see positive rss left in a memcg's memory.stat after all tasks have gone, since it is supposed to count anonymous but not shmem. Aside from sharing anon pages via fork with a task in some other memcg, it often happens after swapping: because a swap page can't be freed while under writeback, nor while locked. So it's not an error, and these residual pages are easily freed once pressure demands.] Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>