summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2012-05-21wake up s_wait_unfrozen when ->freeze_fs failsKazuya Mio
commit e1616300a20c80396109c1cf013ba9a36055a3da upstream. dd slept infinitely when fsfeeze failed because of EIO. To fix this problem, if ->freeze_fs fails, freeze_super() wakes up the tasks waiting for the filesystem to become unfrozen. When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(), the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen. However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then freeze_super() returns an error number. In this case, FITHAW ioctl returns EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-21ext4: fix error handling on inode bitmap corruptionJan Kara
commit acd6ad83517639e8f09a8c5525b1dccd81cd2a10 upstream. When insert_inode_locked() fails in ext4_new_inode() it most likely means inode bitmap got corrupted and we allocated again inode which is already in use. Also doing unlock_new_inode() during error recovery is wrong since the inode does not have I_NEW set. Fix the problem by jumping to fail: (instead of fail_drop:) which declares filesystem error and does not call unlock_new_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-21ext3: Fix error handling on inode bitmap corruptionJan Kara
commit 1415dd8705394399d59a3df1ab48d149e1e41e77 upstream. When insert_inode_locked() fails in ext3_new_inode() it most likely means inode bitmap got corrupted and we allocated again inode which is already in use. Also doing unlock_new_inode() during error recovery is wrong since inode does not have I_NEW set. Fix the problem by jumping to fail: (instead of fail_drop:) which declares filesystem error and does not call unlock_new_inode(). Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-21NFSv4: Revalidate uid/gid after openJonathan Nieder
This is a shorter (and more appropriate for stable kernels) analog to the following upstream commit: commit 6926afd1925a54a13684ebe05987868890665e2b Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sat Jan 7 13:22:46 2012 -0500 NFSv4: Save the owner/group name string when doing open ...so that we can do the uid/gid mapping outside the asynchronous RPC context. This fixes a bug in the current NFSv4 atomic open code where the client isn't able to determine what the true uid/gid fields of the file are, (because the asynchronous nature of the OPEN call denies it the ability to do an upcall) and so fills them with default values, marking the inode as needing revalidation. Unfortunately, in some cases, the VFS will do some additional sanity checks on the file, and may override the server's decision to allow the open because it sees the wrong owner/group fields. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Without this patch, logging into two different machines with home directories mounted over NFS4 and then running "vim" and typing ":q" in each reliably produces the following error on the second machine: E137: Viminfo file is not writable: /users/system/rtheys/.viminfo This regression was introduced by 80e52aced138 ("NFSv4: Don't do idmapper upcalls for asynchronous RPC calls", merged during the 2.6.32 cycle) --- after the OPEN call, .viminfo has the default values for st_uid and st_gid (0xfffffffe) cached because we do not want to let rpciod wait for an idmapper upcall to fill them in. The fix used in mainline is to save the owner and group as strings and perform the upcall in _nfs4_proc_open outside the rpciod context, which takes about 600 lines. For stable, we can do something similar with a one-liner: make open check for the stale fields and make a (synchronous) GETATTR call to fill them when needed. Trond dictated the patch, I typed it in, and Rik tested it. Addresses http://bugs.debian.org/659111 and https://bugzilla.redhat.com/789298 Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Explained-by: David Flyn <davidf@rd.bbc.co.uk> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-21ext4: avoid deadlock on sync-mounted FS w/o journalEric Sandeen
commit c1bb05a657fb3d8c6179a4ef7980261fae4521d7 upstream. Processes hang forever on a sync-mounted ext2 file system that is mounted with the ext4 module (default in Fedora 16). I can reproduce this reliably by mounting an ext2 partition with "-o sync" and opening a new file an that partition with vim. vim will hang in "D" state forever. The same happens on ext4 without a journal. I am attaching a small patch here that solves this issue for me. In the sync mounted case without a journal, ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which can't be called with buffer lock held. Also move mb_cache_entry_release inside lock to avoid race fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix Note too that ext2 fixed this same problem in 2006 with b2f49033 [PATCH] fix deadlock in ext2 Signed-off-by: Martin.Wilck@ts.fujitsu.com [sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-21jffs2: Fix lock acquisition order bug in gc pathJosh Cartwright
commit 226bb7df3d22bcf4a1c0fe8206c80cc427498eae upstream. The locking policy is such that the erase_complete_block spinlock is nested within the alloc_sem mutex. This fixes a case in which the acquisition order was erroneously reversed. This issue was caught by the following lockdep splat: ======================================================= [ INFO: possible circular locking dependency detected ] 3.0.5 #1 ------------------------------------------------------- jffs2_gcd_mtd6/299 is trying to acquire lock: (&c->alloc_sem){+.+.+.}, at: [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890 but task is already holding lock: (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&c->erase_completion_lock)->rlock){+.+...}: [<c008bec4>] validate_chain+0xe6c/0x10bc [<c008c660>] __lock_acquire+0x54c/0xba4 [<c008d240>] lock_acquire+0xa4/0x114 [<c046780c>] _raw_spin_lock+0x3c/0x4c [<c01f744c>] jffs2_garbage_collect_pass+0x4c/0x890 [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc [<c0071a68>] kthread+0x98/0xa0 [<c000f264>] kernel_thread_exit+0x0/0x8 -> #0 (&c->alloc_sem){+.+.+.}: [<c008ad2c>] print_circular_bug+0x70/0x2c4 [<c008c08c>] validate_chain+0x1034/0x10bc [<c008c660>] __lock_acquire+0x54c/0xba4 [<c008d240>] lock_acquire+0xa4/0x114 [<c0466628>] mutex_lock_nested+0x74/0x33c [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890 [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc [<c0071a68>] kthread+0x98/0xa0 [<c000f264>] kernel_thread_exit+0x0/0x8 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&c->erase_completion_lock)->rlock); lock(&c->alloc_sem); lock(&(&c->erase_completion_lock)->rlock); lock(&c->alloc_sem); *** DEADLOCK *** 1 lock held by jffs2_gcd_mtd6/299: #0: (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890 stack backtrace: [<c00155dc>] (unwind_backtrace+0x0/0x100) from [<c0463dc0>] (dump_stack+0x20/0x24) [<c0463dc0>] (dump_stack+0x20/0x24) from [<c008ae84>] (print_circular_bug+0x1c8/0x2c4) [<c008ae84>] (print_circular_bug+0x1c8/0x2c4) from [<c008c08c>] (validate_chain+0x1034/0x10bc) [<c008c08c>] (validate_chain+0x1034/0x10bc) from [<c008c660>] (__lock_acquire+0x54c/0xba4) [<c008c660>] (__lock_acquire+0x54c/0xba4) from [<c008d240>] (lock_acquire+0xa4/0x114) [<c008d240>] (lock_acquire+0xa4/0x114) from [<c0466628>] (mutex_lock_nested+0x74/0x33c) [<c0466628>] (mutex_lock_nested+0x74/0x33c) from [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890) [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890) from [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc) [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc) from [<c0071a68>] (kthread+0x98/0xa0) [<c0071a68>] (kthread+0x98/0xa0) from [<c000f264>] (kernel_thread_exit+0x0/0x8) This was introduce in '81cfc9f jffs2: Fix serious write stall due to erase'. Signed-off-by: Josh Cartwright <joshc@linux.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07hfsplus: Fix potential buffer overflowsGreg Kroah-Hartman
commit 6f24f892871acc47b40dd594c63606a17c714f77 upstream. Commit ec81aecb2966 ("hfs: fix a potential buffer overflow") fixed a few potential buffer overflows in the hfs filesystem. But as Timo Warns pointed out, these changes also need to be made on the hfsplus filesystem as well. Reported-by: Timo Warns <warns@pre-sense.de> Acked-by: WANG Cong <amwang@redhat.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Sage Weil <sage@newdream.net> Cc: Eugene Teo <eteo@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-07autofs: make the autofsv5 packet file descriptor use a packetized pipeLinus Torvalds
commit 64f371bc3107e69efce563a3d0f0e6880de0d537 upstream. The autofs packet size has had a very unfortunate size problem on x86: because the alignment of 'u64' differs in 32-bit and 64-bit modes, and because the packet data was not 8-byte aligned, the size of the autofsv5 packet structure differed between 32-bit and 64-bit modes despite looking otherwise identical (300 vs 304 bytes respectively). We first fixed that up by making the 64-bit compat mode know about this problem in commit a32744d4abae ("autofs: work around unhappy compat problem on x86-64"), and that made a 32-bit 'systemd' work happily on a 64-bit kernel because everything then worked the same way as on a 32-bit kernel. But it turned out that 'automount' had actually known and worked around this problem in user space, so fixing the kernel to do the proper 32-bit compatibility handling actually *broke* 32-bit automount on a 64-bit kernel, because it knew that the packet sizes were wrong and expected those incorrect sizes. As a result, we ended up reverting that compatibility mode fix, and thus breaking systemd again, in commit fcbf94b9dedd. With both automount and systemd doing a single read() system call, and verifying that they get *exactly* the size they expect but using different sizes, it seemed that fixing one of them inevitably seemed to break the other. At one point, a patch I seriously considered applying from Michael Tokarev did a "strcmp()" to see if it was automount that was doing the operation. Ugly, ugly. However, a prettier solution exists now thanks to the packetized pipe mode. By marking the communication pipe as being packetized (by simply setting the O_DIRECT flag), we can always just write the bigger packet size, and if user-space does a smaller read, it will just get that partial end result and the extra alignment padding will simply be thrown away. This makes both automount and systemd happy, since they now get the size they asked for, and the kernel side of autofs simply no longer needs to care - it could pad out the packet arbitrarily. Of course, if there is some *other* user of autofs (please, please, please tell me it ain't so - and we haven't heard of any) that tries to read the packets with multiple writes, that other user will now be broken - the whole point of the packetized mode is that one system call gets exactly one packet, and you cannot read a packet in pieces. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07pipes: add a "packetized pipe" mode for writingLinus Torvalds
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream. The actual internal pipe implementation is already really about individual packets (called "pipe buffers"), and this simply exposes that as a special packetized mode. When we are in the packetized mode (marked by O_DIRECT as suggested by Alan Cox), a write() on a pipe will not merge the new data with previous writes, so each write will get a pipe buffer of its own. The pipe buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn will tell the reader side to break the read at that boundary (and throw away any partial packet contents that do not fit in the read buffer). End result: as long as you do writes less than PIPE_BUF in size (so that the pipe doesn't have to split them up), you can now treat the pipe as a packet interface, where each read() system call will read one packet at a time. You can just use a sufficiently big read buffer (PIPE_BUF is sufficient, since bigger than that doesn't guarantee atomicity anyway), and the return value of the read() will naturally give you the size of the packet. NOTE! We do not support zero-sized packets, and zero-sized reads and writes to a pipe continue to be no-ops. Also note that big packets will currently be split at write time, but that the size at which that happens is not really specified (except that it's bigger than PIPE_BUF). Currently that limit is the system page size, but we might want to explicitly support bigger packets some day. The main user for this is going to be the autofs packet interface, allowing us to stop having to care so deeply about exact packet sizes (which have had bugs with 32/64-bit compatibility modes). But user space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will fail with an EINVAL on kernels that do not support this interface. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() failsAl Viro
commit 04da6e9d63427b2d0fd04766712200c250b3278f upstream. nfsd_open() already returns an NFS error value; only vfs_test_lock() result needs to be fed through nfserrno(). Broken by commit 55ef12 (nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT) three years ago... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07nfsd: fix b0rken error value for setattr on read-only mountAl Viro
commit 96f6f98501196d46ce52c2697dd758d9300c63f5 upstream. ..._want_write() returns -EROFS on failure, _not_ an NFS error value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07Revert "autofs: work around unhappy compat problem on x86-64"Linus Torvalds
commit fcbf94b9dedd2ce08e798a99aafc94fec8668161 upstream. This reverts commit a32744d4abae24572eff7269bc17895c41bd0085. While that commit was technically the right thing to do, and made the x86-64 compat mode work identically to native 32-bit mode (and thus fixing the problem with a 32-bit systemd install on a 64-bit kernel), it turns out that the automount binaries had workarounds for this compat problem. Now, the workarounds are disgusting: doing an "uname()" to find out the architecture of the kernel, and then comparing it for the 64-bit cases and fixing up the size of the read() in automount for those. And they were confused: it's not actually a generic 64-bit issue at all, it's very much tied to just x86-64, which has different alignment for an 'u64' in 64-bit mode than in 32-bit mode. But the end result is that fixing the compat layer actually breaks the case of a 32-bit automount on a x86-64 kernel. There are various approaches to fix this (including just doing a "strcmp()" on current->comm and comparing it to "automount"), but I think that I will do the one that teaches pipes about a special "packet mode", which will allow user space to not have to care too deeply about the padding at the end of the autofs packet. That change will make the compat workaround unnecessary, so let's revert it first, and get automount working again in compat mode. The packetized pipes will then fix autofs for systemd. Reported-and-requested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07NFSv4: Ensure that we check lock exclusive/shared type against open modesTrond Myklebust
commit 55725513b5ef9d462aa3e18527658a0362aaae83 upstream. Since we may be simulating flock() locks using NFS byte range locks, we can't rely on the VFS having checked the file open mode for us. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07NFSv4: Ensure that the LOCK code sets exception->inodeTrond Myklebust
commit 05ffe24f5290dc095f98fbaf84afe51ef404ccc5 upstream. All callers of nfs4_handle_exception() that need to handle NFS4ERR_OPENMODE correctly should set exception->inode Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07nfs: Enclose hostname in brackets when needed in nfs_do_root_mountJan Kara
commit 98a2139f4f4d7b5fcc3a54c7fddbe88612abed20 upstream. When hostname contains colon (e.g. when it is an IPv6 address) it needs to be enclosed in brackets to make parsing of NFS device string possible. Fix nfs_do_root_mount() to enclose hostname properly when needed. NFS code actually does not need this as it does not parse the string passed by nfs_do_root_mount() but the device string is exposed to userspace in /proc/mounts. CC: Josh Boyer <jwboyer@redhat.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27tcp: allow splice() to build full TSO packetsEric Dumazet
[ This combines upstream commit 2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ] vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27lockd: fix the endianness bugAl Viro
commit e847469bf77a1d339274074ed068d461f0c872bc upstream. comparing be32 values for < is not doing the right thing... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27ocfs2: ->e_leaf_clusters endianness breakageAl Viro
commit 72094e43e3af5020510f920321d71f1798fa896d upstream. le16, not le32... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27ocfs2: ->rl_count endianness breakageAl Viro
commit 28748b325dc2d730ccc312830a91c4ae0c0d9379 upstream. le16, not le32... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27ocfs: ->rl_used breakage on big-endianAl Viro
commit e1bf4cc620fd143766ddfcee3b004a1d1bb34fd0 upstream. it's le16, not le32 or le64... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27ocfs2: ->l_next_free_req breakage on big-endianAl Viro
commit 3a251f04fe97c3d335b745c98e4b377e3c3899f2 upstream. It's le16, not le32... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27btrfs: btrfs_root_readonly() broken on big-endianAl Viro
commit 6ed3cf2cdfce4c9f1d73171bd3f27d9cb77b734e upstream. ->root_flags is __le64 and all accesses to it go through the helpers that do proper conversions. Except for btrfs_root_readonly(), which checks bit 0 as in host-endian... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27nfsd: fix compose_entry_fh() failure exitsAl Viro
commit efe39651f08813180f37dc508d950fc7d92b29a8 upstream. Restore the original logics ("fail on mountpoints, negatives and in case of fh_compose() failures"). Since commit 8177e (nfsd: clean up readdirplus encoding) that got broken - rv = fh_compose(fhp, exp, dchild, &cd->fh); if (rv) goto out; if (!dchild->d_inode) goto out; rv = 0; out: is equivalent to rv = fh_compose(fhp, exp, dchild, &cd->fh); out: and the second check has no effect whatsoever... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27Don't limit non-nested epoll pathsJason Baron
commit 93dc6107a76daed81c07f50215fa6ae77691634f upstream. Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the number of possible wakeup paths in epoll is causing a few applications to longer work (dovecot for one). The original patch is really about limiting the amount of epoll nesting (since epoll fds can be attached to other fds). Thus, we probably can allow an unlimited number of paths of depth 1. My current patch limits it at 1000. And enforce the limits on paths that have a greater depth. This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578 Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27ext4: fix endianness breakage in ext4_split_extent_at()Al Viro
commit af1584f570b19b0285e4402a0b54731495d31784 upstream. ->ee_len is __le16, so assigning cpu_to_le32() to it is going to do Bad Things(tm) on big-endian hosts... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Ted Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-27jbd2: use GFP_NOFS for blkdev_issue_flushShaohua Li
commit 99aa78466777083255b876293e9e83dec7cd809a upstream. flush request is issued in transaction commit code path, so looks using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are using GFP. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02lockd: fix arg parsing for grace_period and timeout.NeilBrown
commit de5b8e8e047534aac6bc9803f96e7257436aef9c upstream. If you try to set grace_period or timeout via a module parameter to lockd, and do this on a big-endian machine where sizeof(int) != sizeof(unsigned long) it won't work. This number given will be effectively shifted right by the difference in those two sizes. So cast kp->arg properly to get correct result. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02xfs: Fix oops on IO error during xlog_recover_process_iunlinks()Jan Kara
commit d97d32edcd732110758799ae60af725e5110b3dc upstream. When an IO error happens during inode deletion run from xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not count with the fact that read of a buffer which was read a while ago can really fail which results in the oops on agi = XFS_BUF_TO_AGI(agibp); Fix the problem by cleaning up the buffer handling in xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer lock but keep buffer reference to AG buffer. That is enough for buffer to stay pinned in memory and we don't have to call xfs_read_agi() all the time. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02udf: Fix deadlock in udf_release_file()Jan Kara
commit a0391a3ae91d301c0e59368531a4de5f0b122bcf upstream. udf_release_file() can be called from munmap() path with mmap_sem held. Thus we cannot take i_mutex there because that ranks above mmap_sem. Luckily, i_mutex is not needed in udf_release_file() anymore since protection by i_data_sem is enough to protect from races with write and truncate. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02vfs: fix d_ancestor() case in d_materialize_uniqueMichel Lespinasse
commit b18dafc86bb879d2f38a1743985d7ceb283c2f4d upstream. In d_materialise_unique() there are 3 subcases to the 'aliased dentry' case; in two subcases the inode i_lock is properly released but this does not occur in the -ELOOP subcase. This seems to have been introduced by commit 1836750115f2 ("fix loop checks in d_materialise_unique()"). Signed-off-by: Michel Lespinasse <walken@google.com> [ Added a comment, and moved the unlock to where we generate the -ELOOP, which seems to be more natural. You probably can't actually trigger this without a buggy network file server - d_materialize_unique() is for finding aliases on non-local filesystems, and the d_ancestor() case is for a hardlinked directory loop. But we should be robust in the case of such buggy servers anyway. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02ext4: check for zero length extentTheodore Ts'o
commit 31d4f3a2f3c73f279ff96a7135d7202ef6833f12 upstream. Explicitly test for an extent whose length is zero, and flag that as a corrupted extent. This avoids a kernel BUG_ON assertion failure. Tested: Without this patch, the file system image found in tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a kernel panic. With this patch, an ext4 file system error is noted instead, and the file system is marked as being corrupted. https://bugzilla.kernel.org/show_bug.cgi?id=42859 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delallocLukas Czerner
commit 3d2b158262826e8b75bbbfb7b97010838dd92ac7 upstream. Ext4 does not support data journalling with delayed allocation enabled. We even do not allow to mount the file system with delayed allocation and data journalling enabled, however it can be set via FS_IOC_SETFLAGS so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file system mounted with delayed allocation (default) and that's where problem arises. The easies way to reproduce this problem is with the following set of commands: mkfs.ext4 /dev/sdd mount /dev/sdd /mnt/test1 dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 chattr +j /mnt/test1/file dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc chattr -j /mnt/test1/file Additionally it can be reproduced quite reliably with xfstests 272 and 269. In fact the above reproducer is a part of test 272. To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if the file system is mounted with delayed allocation. This can be easily done by fixing ext4_should_*_data() functions do ignore data journal flag when delalloc is set (suggested by Ted). We also have to set the appropriate address space operations for the inode (again, ignoring data journal flag if delalloc enabled). Additionally this commit introduces ext4_inode_journal_mode() function because ext4_should_*_data() has already had a lot of common code and this change is putting it all into one function so it is easier to read. Successfully tested with xfstests in following configurations: delalloc + data=ordered delalloc + data=writeback data=journal nodelalloc + data=ordered nodelalloc + data=writeback nodelalloc + data=journal Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_bufferEric Sandeen
commit 15291164b22a357cb211b618adfef4fa82fc0de3 upstream. journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head state ala discard_buffer(), but does not touch _Delay or _Unwritten as discard_buffer() does. This can be problematic in some areas of the ext4 code which assume that if they have found a buffer marked unwritten or delay, then it's a live one. Perhaps those spots should check whether it is mapped as well, but if jbd2 is going to tear down a buffer, let's really tear it down completely. Without this I get some fsx failures on sub-page-block filesystems up until v3.2, at which point 4e96b2dbbf1d7e81f22047a50f862555a6cb87cb and 189e868fa8fdca702eb9db9d8afc46b5cb9144c9 make the failures go away, because buried within that large change is some more flag clearing. I still think it's worth doing in jbd2, since ->invalidatepage leads here directly, and it's the right place to clear away these flags. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02ext4: flush any pending end_io requests before DIO reads w/dioread_nolockJiaying Zhang
commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b upstream. (backported to 3.0 by mjt) There is a race between ext4 buffer write and direct_IO read with dioread_nolock mount option enabled. The problem is that we clear PageWriteback flag during end_io time but will do uninitialized-to-initialized extent conversion later with dioread_nolock. If an O_direct read request comes in during this period, ext4 will return zero instead of the recently written data. This patch checks whether there are any pending uninitialized-to-initialized extent conversion requests before doing O_direct read to close the race. Note that this is just a bandaid fix. The fundamental issue is that we clear PageWriteback flag before we really complete an IO, which is problem-prone. To fix the fundamental issue, we may need to implement an extent tree cache that we can use to look up pending to-be-converted extents. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate().Pravin B Shelar
commit 1b26c9b334044cff6d1d2698f2be41bc7d9a0864 upstream. The namespace cleanup path leaks a dentry which holds a reference count on a network namespace. Keeping that network namespace from being freed when the last user goes away. Leaving things like vlan devices in the leaked network namespace. If you use ip netns add for much real work this problem becomes apparent pretty quickly. It light testing the problem hides because frequently you simply don't notice the leak. Use d_set_d_op() so that DCACHE_OP_* flags are set correctly. This issue exists back to 3.0. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reported-by: Justin Pettit <jpettit@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02cifs: fix issue mounting of DFS ROOT when redirecting from one domain ↵Jeff Layton
controller to the next commit 1daaae8fa4afe3df78ca34e724ed7e8187e4eb32 upstream. This patch fixes an issue when cifs_mount receives a STATUS_BAD_NETWORK_NAME error during cifs_get_tcon but is able to continue after an DFS ROOT referral. In this case, the return code variable is not reset prior to trying to mount from the system referred to. Thus, is_path_accessible is not executed and the final DFS referral is not performed causing a mount error. Use case: In DNS, example.com resolves to the secondary AD server ad2.example.com Our primary domain controller is ad1.example.com and has a DFS redirection set up from \\ad1\share\Users to \\files\share\Users. Mounting \\example.com\share\Users fails. Regression introduced by commit 724d9f1. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru Signed-off-by: Thomas Hadig <thomas@intapp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02xfs: fix inode lookup raceDave Chinner
commit f30d500f809eca67a21704347ab14bb35877b5ee upstream. When we get concurrent lookups of the same inode that is not in the per-AG inode cache, there is a race condition that triggers warnings in unlock_new_inode() indicating that we are initialising an inode that isn't in a the correct state for a new inode. When we do an inode lookup via a file handle or a bulkstat, we don't serialise lookups at a higher level through the dentry cache (i.e. pathless lookup), and so we can get concurrent lookups of the same inode. The race condition is between the insertion of the inode into the cache in the case of a cache miss and a concurrently lookup: Thread 1 Thread 2 xfs_iget() xfs_iget_cache_miss() xfs_iread() lock radix tree radix_tree_insert() rcu_read_lock radix_tree_lookup lock inode flags XFS_INEW not set igrab() unlock inode flags rcu_read_unlock use uninitialised inode ..... lock inode flags set XFS_INEW unlock inode flags unlock radix tree xfs_setup_inode() inode flags = I_NEW unlock_new_inode() WARNING as inode flags != I_NEW This can lead to inode corruption, inode list corruption, etc, and is generally a bad thing to occur. Fix this by setting XFS_INEW before inserting the inode into the radix tree. This will ensure any concurrent lookup will find the new inode with XFS_INEW set and that forces the lookup to wait until the XFS_INEW flag is removed before allowing the lookup to succeed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02NFSv4: Return the delegation if the server returns NFS4ERR_OPENMODETrond Myklebust
commit 3114ea7a24d3264c090556a2444fc6d2c06176d4 upstream. If a setattr() fails because of an NFS4ERR_OPENMODE error, it is probably due to us holding a read delegation. Ensure that the recovery routines return that delegation in this case. Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02NFS: Properly handle the case where the delegation is revokedTrond Myklebust
commit a1d0b5eebc4fd6e0edb02688b35f17f67f42aea5 upstream. If we know that the delegation stateid is bad or revoked, we need to remove that delegation as soon as possible, and then mark all the stateids that relied on that delegation for recovery. We cannot use the delegation as part of the recovery process. Also note that NFSv4.1 uses a different error code (NFS4ERR_DELEG_REVOKED) to indicate that the delegation was revoked. Finally, ensure that setlk() and setattr() can both recover safely from a revoked delegation. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02hugetlbfs: avoid taking i_mutex from hugetlbfs_read()Aneesh Kumar K.V
commit a05b0855fd15504972dba2358e5faa172a1e50ba upstream. Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as explained below Thread A: read() on hugetlbfs hugetlbfs_read() called i_mutex grabbed hugetlbfs_read_actor() called __copy_to_user() called page fault is triggered Thread B, sharing address space with A: mmap() the same file ->mmap_sem is grabbed on task_B->mm->mmap_sem hugetlbfs_file_mmap() is called attempt to grab ->i_mutex and block waiting for A to give it up Thread A: pagefault handled blocked on attempt to grab task_A->mm->mmap_sem, which happens to be the same thing as task_B->mm->mmap_sem. Block waiting for B to give it up. AFAIU the i_mutex locking was added to hugetlbfs_read() as per http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take care of the race between truncate and read. This patch fixes this by looking at page->mapping under lock_page() (find_lock_page()) to ensure that the inode didn't get truncated in the range during a parallel read. Ideally we can extend the patch to make sure we don't increase i_size in mmap. But that will break userspace, because applications will now have to use truncate(2) to increase i_size in hugetlbfs. Based on the original patch from Hillf Danton. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read modeAndrea Arcangeli
commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream. In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02sysfs: Fix memory leak in sysfs_sd_setsecdata().Masami Ichikawa
commit 93518dd2ebafcc761a8637b2877008cfd748c202 upstream. This patch fixies follwing two memory leak patterns that reported by kmemleak. sysfs_sd_setsecdata() is called during sys_lsetxattr() operation. It checks sd->s_iattr is NULL or not. Then if it is NULL, it calls sysfs_init_inode_attrs() to allocate memory. That code is this. iattrs = sd->s_iattr; if (!iattrs) iattrs = sysfs_init_inode_attrs(sd); The iattrs recieves sysfs_init_inode_attrs()'s result, but sd->s_iattr doesn't know the address. so it needs to set correct address to sd->s_iattr to free memory in other function. unreferenced object 0xffff880250b73e60 (size 32): comm "systemd", pid 1, jiffies 4294683888 (age 94.553s) hex dump (first 32 bytes): 73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_ 72 3a 73 79 73 66 73 5f 74 3a 73 30 00 00 00 00 r:sysfs_t:s0.... backtrace: [<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98 [<ffffffff811270ab>] __kmalloc+0x100/0x12c [<ffffffff8120775a>] context_struct_to_string+0x106/0x210 [<ffffffff81207cc1>] security_sid_to_context_core+0x10b/0x129 [<ffffffff812090ef>] security_sid_to_context+0x10/0x12 [<ffffffff811fb0da>] selinux_inode_getsecurity+0x7d/0xa8 [<ffffffff811fb127>] selinux_inode_getsecctx+0x22/0x2e [<ffffffff811f4d62>] security_inode_getsecctx+0x16/0x18 [<ffffffff81191dad>] sysfs_setxattr+0x96/0x117 [<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9 [<ffffffff811543d9>] vfs_setxattr+0x83/0xa1 [<ffffffff811544c6>] setxattr+0xcf/0x101 [<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f [<ffffffff814efda9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88024163c5a0 (size 96): comm "systemd", pid 1, jiffies 4294683888 (age 94.553s) hex dump (first 32 bytes): 00 00 00 00 ed 41 00 00 00 00 00 00 00 00 00 00 .....A.......... 00 00 00 00 00 00 00 00 0c 64 42 4f 00 00 00 00 .........dBO.... backtrace: [<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98 [<ffffffff81127402>] kmem_cache_alloc_trace+0xc4/0xee [<ffffffff81191cbe>] sysfs_init_inode_attrs+0x2a/0x83 [<ffffffff81191dd6>] sysfs_setxattr+0xbf/0x117 [<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9 [<ffffffff811543d9>] vfs_setxattr+0x83/0xa1 [<ffffffff811544c6>] setxattr+0xcf/0x101 [<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f [<ffffffff814efda9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff ` Signed-off-by: Masami Ichikawa <masami256@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-23afs: Remote abort can cause BUG in rxrpc codeAnton Blanchard
commit c0173863528a8c9212c53e080d63a1aaae5ef4f4 upstream. When writing files to afs I sometimes hit a BUG: kernel BUG at fs/afs/rxrpc.c:179! With a backtrace of: afs_free_call afs_make_call afs_fs_store_data afs_vnode_store_data afs_write_back_from_locked_page afs_writepages_region afs_writepages The cause is: ASSERT(skb_queue_empty(&call->rx_queue)); Looking at a tcpdump of the session the abort happens because we are exceeding our disk quota: rx abort fs reply store-data error diskquota exceeded (32) So the abort error is valid. We hit the BUG because we haven't freed all the resources for the call. By freeing any skbs in call->rx_queue before calling afs_free_call we avoid hitting leaking memory and avoid hitting the BUG. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-23afs: Read of file returns EBADMSGAnton Blanchard
commit 2c724fb92732c0b2a5629eb8af74e82eb62ac947 upstream. A read of a large file on an afs mount failed: # cat junk.file > /dev/null cat: junk.file: Bad message Looking at the trace, call->offset wrapped since it is only an unsigned short. In afs_extract_data: _enter("{%u},{%zu},%d,,%zu", call->offset, len, last, count); ... if (call->offset < count) { if (last) { _leave(" = -EBADMSG [%d < %zu]", call->offset, count); return -EBADMSG; } Which matches the trace: [cat ] ==> afs_extract_data({65132},{524},1,,65536) [cat ] <== afs_extract_data() = -EBADMSG [0 < 65536] call->offset went from 65132 to 0. Fix this by making call->offset an unsigned int. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-23nilfs2: fix NULL pointer dereference in nilfs_load_super_block()Ryusuke Konishi
commit d7178c79d9b7c5518f9943188091a75fc6ce0675 upstream. According to the report from Slicky Devil, nilfs caused kernel oops at nilfs_load_super_block function during mount after he shrank the partition without resizing the filesystem: BUG: unable to handle kernel NULL pointer dereference at 00000048 IP: [<d0d7a08e>] nilfs_load_super_block+0x17e/0x280 [nilfs2] *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP ... Call Trace: [<d0d7a87b>] init_nilfs+0x4b/0x2e0 [nilfs2] [<d0d6f707>] nilfs_mount+0x447/0x5b0 [nilfs2] [<c0226636>] mount_fs+0x36/0x180 [<c023d961>] vfs_kern_mount+0x51/0xa0 [<c023ddae>] do_kern_mount+0x3e/0xe0 [<c023f189>] do_mount+0x169/0x700 [<c023fa9b>] sys_mount+0x6b/0xa0 [<c04abd1f>] sysenter_do_call+0x12/0x28 Code: 53 18 8b 43 20 89 4b 18 8b 4b 24 89 53 1c 89 43 24 89 4b 20 8b 43 20 c7 43 2c 00 00 00 00 23 75 e8 8b 50 68 89 53 28 8b 54 b3 20 <8b> 72 48 8b 7a 4c 8b 55 08 89 b3 84 00 00 00 89 bb 88 00 00 00 EIP: [<d0d7a08e>] nilfs_load_super_block+0x17e/0x280 [nilfs2] SS:ESP 0068:ca9bbdcc CR2: 0000000000000048 This turned out due to a defect in an error path which runs if the calculated location of the secondary super block was invalid. This patch fixes it and eliminates the reported oops. Reported-by: Slicky Devil <slicky.dvl@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Slicky Devil <slicky.dvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19block: Fix NULL pointer dereference in sd_revalidate_diskJun'ichi Nomura
commit fe316bf2d5847bc5dd975668671a7b1067603bc7 upstream. Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(), __blkdev_get() calls rescan_partitions() to remove in-kernel partition structures and raise KOBJ_CHANGE uevent. However it ends up calling driver's revalidate_disk without open and could cause oops. In the case of SCSI: process A process B ---------------------------------------------- sys_open __blkdev_get sd_open returns -ENOMEDIUM scsi_remove_device <scsi_device torn down> rescan_partitions sd_revalidate_disk <oops> Oopses are reported here: http://marc.info/?l=linux-scsi&m=132388619710052 This patch separates the partition invalidation from rescan_partitions() and use it for -ENOMEDIUM case. Reported-by: Huajun Li <huajun.li.lee@gmail.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19vfs: fix double put after complete_walk()Miklos Szeredi
commit 097b180ca09b581ef0dc24fbcfc1b227de3875df upstream. complete_walk() already puts nd->path, no need to do it again at cleanup time. This would result in Oopses if triggered, apparently the codepath is not too well exercised. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19vfs: fix return value from do_last()Miklos Szeredi
commit 7f6c7e62fcc123e6bd9206da99a2163fe3facc31 upstream. complete_walk() returns either ECHILD or ESTALE. do_last() turns this into ECHILD unconditionally. If not in RCU mode, this error will reach userspace which is complete nonsense. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19aio: fix the "too late munmap()" raceAl Viro
commit c7b285550544c22bc005ec20978472c9ac7138c6 upstream. Current code has put_ioctx() called asynchronously from aio_fput_routine(); that's done *after* we have killed the request that used to pin ioctx, so there's nothing to stop io_destroy() waiting in wait_for_all_aios() from progressing. As the result, we can end up with async call of put_ioctx() being the last one and possibly happening during exit_mmap() or elf_core_dump(), neither of which expects stray munmap() being done to them... We do need to prevent _freeing_ ioctx until aio_fput_routine() is done with that, but that's all we care about - neither io_destroy() nor exit_aio() will progress past wait_for_all_aios() until aio_fput_routine() does really_put_req(), so the ioctx teardown won't be done until then and we don't care about the contents of ioctx past that point. Since actual freeing of these suckers is RCU-delayed, we don't need to bump ioctx refcount when request goes into list for async removal. All we need is rcu_read_lock held just over the ->ctx_lock-protected area in aio_fput_routine(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19aio: fix io_setup/io_destroy raceAl Viro
commit 86b62a2cb4fc09037bbce2959d2992962396fd7f upstream. Have ioctx_alloc() return an extra reference, so that caller would drop it on success and not bother with re-grabbing it on failure exit. The current code is obviously broken - io_destroy() from another thread that managed to guess the address io_setup() would've returned would free ioctx right under us; gets especially interesting if aio_context_t * we pass to io_setup() points to PROT_READ mapping, so put_user() fails and we end up doing io_destroy() on kioctx another thread has just got freed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>