summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2016-09-11fix d_walk()/non-delayed __d_free() raceWilly Tarreau
I checked Jari's explanation below and found that v3.14.77 and v3.12.62 are missing the same fix as 3.10. In fact Al's original commit 3d56c25 ("fix d_walk()/non-delayed __d_free() race") used to mention to check this __d_materialise_dentry() function in the Cc: stable line, but this got lost during the backports. Normally all of our 3 kernels need to apply the following patch that Ben correctly put in 3.16 and 3.2. I'm fixing the backport in 3.10.103 right now. On Mon, Aug 22, 2016 at 04:56:57PM +0300, Jari Ruusu wrote: > This patch for 3.10 branch appears to be missing one important > > + dentry->d_flags |= DCACHE_RCUACCESS; > > in fs/dcache.c __d_materialise_dentry() function. When Ben Hutchings > backported Al Viro's original fix to stable branches that he maintains, > he added that one additional line to both 3.2 and 3.16 branches. Please > consider including that additional one line fix for 3.10 stable branch > also. > > > Ben Hutchings said this on his 3.2.82-rc1 patch: > [bwh: Backported to 3.2: > - Adjust context > - Also set the flag in __d_materialise_dentry())] > > http://marc.info/?l=linux-kernel&m=147117565612275&w=2 > > > Ben Hutchings said this on his 3.16.37-rc1 patch: > [bwh: Backported to 3.16: > - Adjust context > - Also set the flag in __d_materialise_dentry())] > > http://marc.info/?l=linux-kernel&m=147117433412006&w=2 > > > Also mentioned by Sasha Levin on 3.18 and 4.1 commits: > Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry()) > > http://marc.info/?l=linux-stable-commits&m=146648034410827&w=2 > http://marc.info/?l=linux-stable-commits&m=146647471009771&w=2 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-11ext4: validate that metadata blocks do not overlap superblockTheodore Ts'o
commit 829fa70dddadf9dd041d62b82cd7cea63943899d upstream. A number of fuzzing failures seem to be caused by allocation bitmaps or other metadata blocks being pointed at the superblock. This can cause kernel BUG or WARNings once the superblock is overwritten, so validate the group descriptor blocks to make sure this doesn't happen. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07fs/seq_file: fix out-of-bounds readVegard Nossum
commit 088bf2ff5d12e2e32ee52a4024fec26e582f44d3 upstream. seq_read() is a nasty piece of work, not to mention buggy. It has (I think) an old bug which allows unprivileged userspace to read beyond the end of m->buf. I was getting these: BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880 Read of size 2713 by task trinity-c2/1329 CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: kasan_object_err+0x1c/0x80 kasan_report_error+0x2cb/0x7e0 kasan_report+0x4e/0x80 check_memory_region+0x13e/0x1a0 kasan_check_read+0x11/0x20 seq_read+0xcd2/0x1480 proc_reg_read+0x10b/0x260 do_loop_readv_writev.part.5+0x140/0x2c0 do_readv_writev+0x589/0x860 vfs_readv+0x7b/0xd0 do_readv+0xd8/0x2c0 SyS_readv+0xb/0x10 do_syscall_64+0x1b3/0x4b0 entry_SYSCALL64_slow_path+0x25/0x25 Object at ffff880116889100, in cache kmalloc-4096 size: 4096 Allocated: PID = 1329 save_stack_trace+0x26/0x80 save_stack+0x46/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x1aa/0x4a0 seq_buf_alloc+0x35/0x40 seq_read+0x7d8/0x1480 proc_reg_read+0x10b/0x260 do_loop_readv_writev.part.5+0x140/0x2c0 do_readv_writev+0x589/0x860 vfs_readv+0x7b/0xd0 do_readv+0xd8/0x2c0 SyS_readv+0xb/0x10 do_syscall_64+0x1b3/0x4b0 return_from_SYSCALL_64+0x0/0x6a Freed: PID = 0 (stack is not available) Memory state around the buggy address: ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Disabling lock debugging due to kernel taint This seems to be the same thing that Dave Jones was seeing here: https://lkml.org/lkml/2016/8/12/334 There are multiple issues here: 1) If we enter the function with a non-empty buffer, there is an attempt to flush it. But it was not clearing m->from after doing so, which means that if we try to do this flush twice in a row without any call to traverse() in between, we are going to be reading from the wrong place -- the splat above, fixed by this patch. 2) If there's a short write to userspace because of page faults, the buffer may already contain multiple lines (i.e. pos has advanced by more than 1), but we don't save the progress that was made so the next call will output what we've already returned previously. Since that is a much less serious issue (and I have a headache after staring at seq_read() for the past 8 hours), I'll leave that for now. Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Dave Jones <davej@codemonkey.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20nfs: don't create zero-length requestsBenjamin Coddington
commit 149a4fddd0a72d526abbeac0c8deaab03559836a upstream. NFS doesn't expect requests with wb_bytes set to zero and may make unexpected decisions about how to handle that request at the page IO layer. Skip request creation if we won't have any wb_bytes in the request. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20CIFS: Fix a possible invalid memory access in smb2_query_symlink()Pavel Shilovsky
commit 7893242e2465aea6f2cbc2639da8fa5ce96e8cc2 upstream. During following a symbolic link we received err_buf from SMB2_open(). While the validity of SMB2 error response is checked previously in smb2_check_message() a symbolic link payload is not checked at all. Fix it by adding such checks. Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20cifs: fix crash due to race in hmac(md5) handlingRabin Vincent
commit bd975d1eead2558b76e1079e861eacf1f678b73b upstream. The secmech hmac(md5) structures are present in the TCP_Server_Info struct and can be shared among multiple CIFS sessions. However, the server mutex is not currently held when these structures are allocated and used, which can lead to a kernel crashes, as in the scenario below: mount.cifs(8) #1 mount.cifs(8) #2 Is secmech.sdeschmaccmd5 allocated? // false Is secmech.sdeschmaccmd5 allocated? // false secmech.hmacmd = crypto_alloc_shash.. secmech.sdeschmaccmd5 = kzalloc.. sdeschmaccmd5->shash.tfm = &secmec.hmacmd; secmech.sdeschmaccmd5 = kzalloc // sdeschmaccmd5->shash.tfm // not yet assigned crypto_shash_update() deref NULL sdeschmaccmd5->shash.tfm Unable to handle kernel paging request at virtual address 00000030 epc : 8027ba34 crypto_shash_update+0x38/0x158 ra : 8020f2e8 setup_ntlmv2_rsp+0x4bc/0xa84 Call Trace: crypto_shash_update+0x38/0x158 setup_ntlmv2_rsp+0x4bc/0xa84 build_ntlmssp_auth_blob+0xbc/0x34c sess_auth_rawntlmssp_authenticate+0xac/0x248 CIFS_SessSetup+0xf0/0x178 cifs_setup_session+0x4c/0x84 cifs_get_smb_ses+0x2c8/0x314 cifs_mount+0x38c/0x76c cifs_do_mount+0x98/0x440 mount_fs+0x20/0xc0 vfs_kern_mount+0x58/0x138 do_mount+0x1e8/0xccc SyS_mount+0x88/0xd4 syscall_common+0x30/0x54 Fix this by locking the srv_mutex around the code which uses these hmac(md5) structures. All the other secmech algos already have similar locking. Fixes: 95dc8dd14e2e84cc ("Limit allocation of crypto mechanisms to dialect which requires") Signed-off-by: Rabin Vincent <rabinv@axis.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20cifs: Check for existing directory when opening file with O_CREATSachin Prabhu
commit 8d9535b6efd86e6c07da59f97e68f44efb7fe080 upstream. When opening a file with O_CREAT flag, check to see if the file opened is an existing directory. This prevents the directory from being opened which subsequently causes a crash when the close function for directories cifs_closedir() is called which frees up the file->private_data memory while the file is still listed on the open file list for the tcon. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: fix reference counting bug on block allocation errorVegard Nossum
commit 554a5ccc4e4a20c5f3ec859de0842db4b4b9c77e upstream. If we hit this error when mounted with errors=continue or errors=remount-ro: EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to continue. However, ext4_mb_release_context() is the wrong thing to call here since we are still actually using the allocation context. Instead, just error out. We could retry the allocation, but there is a possibility of getting stuck in an infinite loop instead, so this seems safer. [ Fixed up so we don't return EAGAIN to userspace. --tytso ] Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: short-cut orphan cleanup on errorVegard Nossum
commit c65d5c6c81a1f27dec5f627f67840726fcd146de upstream. If we encounter a filesystem error during orphan cleanup, we should stop. Otherwise, we may end up in an infinite loop where the same inode is processed again and again. EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters Aborting journal on device loop0-8. EXT4-fs (loop0): Remounting filesystem read-only EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed! [...] EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed! [...] EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed! [...] See-also: c9eb13a9105 ("ext4: fix hang when processing corrupted orphaned inode list") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: don't call ext4_should_journal_data() on the journal inodeVegard Nossum
commit 6a7fd522a7c94cdef0a3b08acf8e6702056e635c upstream. If ext4_fill_super() fails early, it's possible for ext4_evict_inode() to call ext4_should_journal_data() before superblock options and flags are fully set up. In that case, the iput() on the journal inode can end up causing a BUG(). Work around this problem by reordering the tests so we only call ext4_should_journal_data() after we know it's not the journal inode. Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data") Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: fix deadlock during page writebackJan Kara
commit 646caa9c8e196880b41cd3e3d33a2ebc752bdb85 upstream. Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a deadlock in ext4_writepages() which was previously much harder to hit. After this commit xfstest generic/130 reproduces the deadlock on small filesystems. The problem happens when ext4_do_update_inode() sets LARGE_FILE feature and marks current inode handle as synchronous. That subsequently results in ext4_journal_stop() called from ext4_writepages() to block waiting for transaction commit while still holding page locks, reference to io_end, and some prepared bio in mpd structure each of which can possibly block transaction commit from completing and thus results in deadlock. Fix the problem by releasing page locks, io_end reference, and submitting prepared bio before calling ext4_journal_stop(). [ Changed to defer the call to ext4_journal_stop() only if the handle is synchronous. --tytso ] Reported-and-tested-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16ext4: check for extents that wrap aroundVegard Nossum
commit f70749ca42943faa4d4dcce46dfdcaadb1d0c4b6 upstream. An extent with lblock = 4294967295 and len = 1 will pass the ext4_valid_extent() test: ext4_lblk_t last = lblock + len - 1; if (len == 0 || lblock > last) return 0; since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end(). We can simplify it by removing the - 1 altogether and changing the test to use lblock + len <= lblock, since now if len = 0, then lblock + 0 == lblock and it fails, and if len > 0 then lblock + len > lblock in order to pass (i.e. it doesn't overflow). Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Fixes: 2f974865f ("ext4: check for zero length extent explicitly") Cc: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16fuse: fix wrong assignment of ->flags in fuse_send_init()Wei Fang
commit 9446385f05c9af25fed53dbed3cc75763730be52 upstream. FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10ext4: verify extent header depthVegard Nossum
commit 7bc9491645118c9461bd21099c31755ff6783593 upstream. Although the extent tree depth of 5 should enough be for the worst case of 2*32 extents of length 1, the extent tree code does not currently to merge nodes which are less than half-full with a sibling node, or to shrink the tree depth if possible. So it's possible, at least in theory, for the tree depth to be greater than 5. However, even in the worst case, a tree depth of 32 is highly unlikely, and if the file system is maliciously corrupted, an insanely large eh_depth can cause memory allocation failures that will trigger kernel warnings (here, eh_depth = 65280): JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580 CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508 Stack: 604a8947 625badd8 0002fd09 00000000 60078643 00000000 62623910 601bf9bc 62623970 6002fc84 626239b0 900000125 Call Trace: [<6001c2dc>] show_stack+0xdc/0x1a0 [<601bf9bc>] dump_stack+0x2a/0x2e [<6002fc84>] __warn+0x114/0x140 [<6002fdff>] warn_slowpath_null+0x1f/0x30 [<60165829>] start_this_handle+0x569/0x580 [<60165d4e>] jbd2__journal_start+0x11e/0x220 [<60146690>] __ext4_journal_start_sb+0x60/0xa0 [<60120a81>] ext4_truncate+0x131/0x3a0 [<60123677>] ext4_setattr+0x757/0x840 [<600d5d0f>] notify_change+0x16f/0x2a0 [<600b2b16>] do_truncate+0x76/0xc0 [<600c3e56>] path_openat+0x806/0x1300 [<600c55c9>] do_filp_open+0x89/0xf0 [<600b4074>] do_sys_open+0x134/0x1e0 [<600b4140>] SyS_open+0x20/0x30 [<6001ea68>] handle_syscall+0x88/0x90 [<600295fd>] userspace+0x3fd/0x500 [<6001ac55>] fork_handler+0x85/0x90 ---[ end trace 08b0b88b6387a244 ]--- [ Commit message modified and the extent tree depath check changed from 5 to 32 -- tytso ] Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10ecryptfs: don't allow mmap when the lower fs doesn't support itJeff Mahoney
commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b upstream. There are legitimate reasons to disallow mmap on certain files, notably in sysfs or procfs. We shouldn't emulate mmap support on file systems that don't offer support natively. CVE-2016-1583 Signed-off-by: Jeff Mahoney <jeffm@suse.com> [tyhicks: clean up f_op check by using ecryptfs_file_to_lower()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10Revert "ecryptfs: forbid opening files without mmap handler"Jeff Mahoney
commit 78c4e172412de5d0456dc00d2b34050aa0b683b5 upstream. This reverts commit 2f36db71009304b3f0b95afacd8eba1f9f046b87. It fixed a local root exploit but also introduced a dependency on the lower file system implementing an mmap operation just to open a file, which is a bit of a heavy hammer. The right fix is to have mmap depend on the existence of the mmap handler instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10fs/nilfs2: fix potential underflow in call to crc32_leTorsten Hilbrich
commit 63d2f95d63396059200c391ca87161897b99e74a upstream. The value `bytes' comes from the filesystem which is about to be mounted. We cannot trust that the value is always in the range we expect it to be. Check its value before using it to calculate the length for the crc32_le call. It value must be larger (or equal) sumoff + 4. This fixes a kernel bug when accidentially mounting an image file which had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance. The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a s_bytes value of 1. This caused an underflow when substracting sumoff + 4 (20) in the call to crc32_le. BUG: unable to handle kernel paging request at ffff88021e600000 IP: crc32_le+0x36/0x100 ... Call Trace: nilfs_valid_sb.part.5+0x52/0x60 [nilfs2] nilfs_load_super_block+0x142/0x300 [nilfs2] init_nilfs+0x60/0x390 [nilfs2] nilfs_mount+0x302/0x520 [nilfs2] mount_fs+0x38/0x160 vfs_kern_mount+0x67/0x110 do_mount+0x269/0xe00 SyS_mount+0x9f/0x100 entry_SYSCALL_64_fastpath+0x16/0x71 Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27Fix reconnect to not defer smb3 session reconnect long after socket reconnectSteve French
commit 4fcd1813e6404dd4420c7d12fb483f9320f0bf93 upstream. Azure server blocks clients that open a socket and don't do anything on it. In our reconnect scenarios, we can reconnect the tcp session and detect the socket is available but we defer the negprot and SMB3 session setup and tree connect reconnection until the next i/o is requested, but this looks suspicous to some servers who expect SMB3 negprog and session setup soon after a socket is created. In the echo thread, reconnect SMB3 sessions and tree connections that are disconnected. A later patch will replay persistent (and resilient) handle opens. Signed-off-by: Steve French <steve.french@primarydata.com> Acked-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27nfsd: check permissions when setting ACLsBen Hutchings
commit 999653786df6954a31044528ac3f7a5dadca08f4 upstream. Use set_posix_acl, which includes proper permission checks, instead of calling ->set_acl directly. Without this anyone may be able to grant themselves permissions to a file by setting the ACL. Lock the inode to make the new checks atomic with respect to set_acl. (Also, nfsd was the only caller of set_acl not locking the inode, so I suspect this may fix other races.) This also simplifies the code, and ensures our ACLs are checked by posix_acl_valid. The permission checks and the inode locking were lost with commit 4ac7249e, which changed nfsd to use the set_acl inode operation directly instead of going through xattr handlers. Reported-by: David Sinquin <david@sinquin.eu> [agreunba@redhat.com: use set_posix_acl] Fixes: 4ac7249e Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27posix_acl: Add set_posix_aclAndreas Gruenbacher
commit 485e71e8fb6356c08c7fc6bcce4bf02c9a9a663f upstream. Factor out part of posix_acl_xattr_set into a common function that takes a posix_acl, which nfsd can also call. The prototype already exists in include/linux/posix_acl.h. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27UBIFS: Implement ->migratepage()Kirill A. Shutemov
commit 4ac1c17b2044a1b4b2fbed74451947e905fc2992 upstream. During page migrations UBIFS might get confused and the following assert triggers: [ 213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436) [ 213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008 [ 213.490000] Hardware name: Allwinner sun4i/sun5i Families [ 213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14) [ 213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0) [ 213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50) [ 213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8) [ 213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290) [ 213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80) [ 213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0) [ 213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4) [ 213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0) [ 213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8) [ 213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274) [ 213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c) [ 213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0) [ 213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8) [ 213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48) [ 213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444) [ 213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614) [ 213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c) [ 213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34) UBIFS is using PagePrivate() which can have different meanings across filesystems. Therefore the generic page migration code cannot handle this case correctly. We have to implement our own migration function which basically does a plain copy but also duplicates the page private flag. UBIFS is not a block device filesystem and cannot use buffer_migrate_page(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [rw: Massaged changelog, build fixes, etc...] Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27NFS: Fix another OPEN_DOWNGRADE bugTrond Myklebust
commit e547f2628327fec6afd2e03b46f113f614cca05b upstream. Olga Kornievskaia reports that the following test fails to trigger an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE. fd0 = open(foo, RDRW) -- should be open on the wire for "both" fd1 = open(foo, RDONLY) -- should be open on the wire for "read" close(fd0) -- should trigger an open_downgrade read(fd1) close(fd1) The issue is that we're missing a check for whether or not the current state transitioned from an O_RDWR state as opposed to having transitioned from a combination of O_RDONLY and O_WRONLY. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: cd9288ffaea4 ("NFSv4: Fix another bug in the close/open_downgrade code") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-27make nfs_atomic_open() call d_drop() on all ->open_context() errors.Al Viro
commit d20cb71dbf3487f24549ede1a8e2d67579b4632e upstream. In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code" unconditional d_drop() after the ->open_context() had been removed. It had been correct for success cases (there ->open_context() itself had been doing dcache manipulations), but not for error ones. Only one of those (ENOENT) got a compensatory d_drop() added in that commit, but in fact it should've been done for all errors. As it is, the case of O_CREAT non-exclusive open on a hashed negative dentry racing with e.g. symlink creation from another client ended up with ->open_context() getting an error and proceeding to call nfs_lookup(). On a hashed dentry, which would've instantly triggered BUG_ON() in d_materialise_unique() (or, these days, its equivalent in d_splice_alias()). Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream. On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Chas Williams <3chas3@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24xfs: fix up backport error in fs/xfs/xfs_inode.cGreg Kroah-Hartman
Commit c66edeaf79bb6f0ca688ffec9ca50a61b7569984, which was a backport of commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream, needed to have the error value be positive, not negative, in order to work properly. Reported-by: "Thomas D." <whissi@whissi.de> Reported-by: Brad Spender <spender@grsecurity.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24fix d_walk()/non-delayed __d_free() raceAl Viro
commit 3d56c25e3bb0726a5c5e16fc2d9e38f8ed763085 upstream. Ascend-to-parent logics in d_walk() depends on all encountered child dentries not getting freed without an RCU delay. Unfortunately, in quite a few cases it is not true, with hard-to-hit oopsable race as the result. Fortunately, the fix is simiple; right now the rule is "if it ever been hashed, freeing must be delayed" and changing it to "if it ever had a parent, freeing must be delayed" closes that hole and covers all cases the old rule used to cover. Moreover, pipes and sockets remain _not_ covered, so we do not introduce RCU delay in the cases which are the reason for having that delay conditional in the first place. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24ecryptfs: forbid opening files without mmap handlerJann Horn
commit 2f36db71009304b3f0b95afacd8eba1f9f046b87 upstream. This prevents users from triggering a stack overflow through a recursive invocation of pagefault handling that involves mapping procfs files into virtual memory. Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07xfs: skip stale inodes in xfs_iflush_clusterDave Chinner
commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e upstream. We don't write back stale inodes so we should skip them in xfs_iflush_cluster, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07xfs: fix inode validity check in xfs_iflush_clusterDave Chinner
commit 51b07f30a71c27405259a0248206ed4e22adbee2 upstream. Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs: convert inode cache lookups to use RCU locking") back in late 2010, and so xfs_iflush_cluster checks the wrong inode for whether it is still valid under RCU protection. Fix it to lock and check the correct inode. (*) Careless-idiot: Dave Chinner <dchinner@redhat.com> Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07xfs: xfs_iflush_cluster fails to abort on errorDave Chinner
commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream. When a failure due to an inode buffer occurs, the error handling fails to abort the inode writeback correctly. This can result in the inode being reclaimed whilst still in the AIL, leading to use-after-free situations as well as filesystems that cannot be unmounted as the inode log items left in the AIL never get removed. Fix this by ensuring fatal errors from xfs_imap_to_bp() result in the inode flush being aborted correctly. Reported-by: Shyam Kaushik <shyam@zadarastorage.com> Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com> Tested-by: Shyam Kaushik <shyam@zadarastorage.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07ext4: silence UBSAN in ext4_mb_init()Nicolai Stange
commit 935244cd54b86ca46e69bc6604d2adfb1aec2d42 upstream. Currently, in ext4_mb_init(), there's a loop like the following: do { ... offset += 1 << (sb->s_blocksize_bits - i); i++; } while (i <= sb->s_blocksize_bits + 1); Note that the updated offset is used in the loop's next iteration only. However, at the last iteration, that is at i == sb->s_blocksize_bits + 1, the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15 shift exponent 4294967295 is too large for 32-bit type 'int' [...] Call Trace: [<ffffffff818c4d25>] dump_stack+0xbc/0x117 [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390 [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0 [<ffffffff814293c7>] ? create_cache+0x57/0x1f0 [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0 [<ffffffff821c2168>] ? mutex_lock+0x38/0x60 [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50 [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0 [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0 [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0 [...] Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1. Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of offset is never used again. Silence UBSAN by introducing another variable, offset_incr, holding the next increment to apply to offset and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07ext4: address UBSAN warning in mb_find_order_for_block()Nicolai Stange
commit b5cb316cdf3a3f5f6125412b0f6065185240cfdc upstream. Currently, in mb_find_order_for_block(), there's a loop like the following: while (order <= e4b->bd_blkbits + 1) { ... bb += 1 << (e4b->bd_blkbits - order); } Note that the updated bb is used in the loop's next iteration only. However, at the last iteration, that is at order == e4b->bd_blkbits + 1, the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11 shift exponent -1 is negative [...] Call Trace: [<ffffffff818c4d35>] dump_stack+0xbc/0x117 [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169 [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254 [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158 [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590 [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80 [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240 [...] Unless compilers start to do some fancy transformations (which at least GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the such calculated value of bb is never used again. Silence UBSAN by introducing another variable, bb_incr, holding the next increment to apply to bb and adjust that one by right shifting it by one position per loop iteration. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161 Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07ext4: fix hang when processing corrupted orphaned inode listTheodore Ts'o
commit c9eb13a9105e2e418f72e46a2b6da3f49e696902 upstream. If the orphaned inode list contains inode #5, ext4_iget() returns a bad inode (since the bootloader inode should never be referenced directly). Because of the bad inode, we end up processing the inode repeatedly and this hangs the machine. This can be reproduced via: mke2fs -t ext4 /tmp/foo.img 100 debugfs -w -R "ssv last_orphan 5" /tmp/foo.img mount -o loop /tmp/foo.img /mnt (But don't do this if you are using an unpatched kernel if you care about the system staying functional. :-) This bug was found by the port of American Fuzzy Lop into the kernel to find file system problems[1]. (Since it *only* happens if inode #5 shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not surprising that AFL needed two hours before it found it.) [1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf Reported by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07pipe: Fix buffer offset after partially failed readBen Hutchings
commit feae3ca2e5e1a8f44aa6290255d3d9709985d0b2 upstream. Quoting the RHEL advisory: > It was found that the fix for CVE-2015-1805 incorrectly kept buffer > offset and buffer length in sync on a failed atomic read, potentially > resulting in a pipe buffer state corruption. A local, unprivileged user > could use this flaw to crash the system or leak kernel memory to user > space. (CVE-2016-0774, Moderate) The same flawed fix was applied to stable branches from 2.6.32.y to 3.14.y inclusive, and I was able to reproduce the issue on 3.2.y. We need to give pipe_iov_copy_to_user() a separate offset variable and only update the buffer offset if it succeeds. References: https://rhn.redhat.com/errata/RHSA-2016-0103.html Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01fs/cifs: correctly to anonymous authentication via NTLMSSPStefan Metzmacher
commit cfda35d98298131bf38fbad3ce4cd5ecb3cf18db upstream. See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client: ... Set NullSession to FALSE If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND (AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1) OR AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0)) -- Special case: client requested anonymous authentication Set NullSession to TRUE ... Only server which map unknown users to guest will allow access using a non-null NTChallengeResponse. For Samba it's the "map to guest = bad user" option. BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913 Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01remove directory incorrectly tries to set delete on close on non-empty ↵Steve French
directories commit 897fba1172d637d344f009d700f7eb8a1fa262f1 upstream. Wrong return code was being returned on SMB3 rmdir of non-empty directory. For SMB3 (unlike for cifs), we attempt to delete a directory by set of delete on close flag on the open. Windows clients set this flag via a set info (SET_FILE_DISPOSITION to set this flag) which properly checks if the directory is empty. With this patch on smb3 mounts we correctly return "DIRECTORY NOT EMPTY" on attempts to remove a non-empty directory. Signed-off-by: Steve French <steve.french@primarydata.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01Btrfs: don't use src fd for printkJosef Bacik
commit c79b4713304f812d3d6c95826fc3e5fc2c0b0c14 upstream. The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18get_rock_ridge_filename(): handle malformed NM entriesAl Viro
commit 99d825822eade8d827a1817357cbf3f889a552d6 upstream. Payloads of NM entries are not supposed to contain NUL. When we run into such, only the part prior to the first NUL goes into the concatenation (i.e. the directory entry name being encoded by a bunch of NM entries). We do stop when the amount collected so far + the claimed amount in the current NM entry exceed 254. So far, so good, but what we return as the total length is the sum of *claimed* sizes, not the actual amount collected. And that can grow pretty large - not unlimited, since you'd need to put CE entries in between to be able to get more than the maximum that could be contained in one isofs directory entry / continuation chunk and we are stop once we'd encountered 32 CEs, but you can get about 8Kb easily. And that's what will be passed to readdir callback as the name length. 8Kb __copy_to_user() from a buffer allocated by __get_free_page() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11proc: prevent accessing /proc/<PID>/environ until it's readyMathias Krause
commit 8148a73c9901a8794a50f950083c00ccf97d43b3 upstream. If /proc/<PID>/environ gets read before the envp[] array is fully set up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to read more bytes than are actually written, as env_start will already be set but env_end will still be zero, making the range calculation underflow, allowing to read beyond the end of what has been written. Fix this as it is done for /proc/<PID>/cmdline by testing env_end for zero. It is, apparently, intentionally set last in create_*_tables(). This bug was found by the PaX size_overflow plugin that detected the arithmetic underflow of 'this_len = env_end - (env_start + src)' when env_end is still zero. The expected consequence is that userland trying to access /proc/<PID>/environ of a not yet fully set up process may get inconsistent data as we're in the middle of copying in the environment variables. Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363 Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461 Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: Pax Team <pageexec@freemail.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()Eryu Guan
commit 5e1021f2b6dff1a86a468a1424d59faae2bc63c1 upstream. ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is ignored in the following "if" condition and ext4_expand_extra_isize() might be called with NULL iloc.bh set, which triggers NULL pointer dereference. This is uncovered by commit 8b4953e13f4c ("ext4: reserve code points for the project quota feature"), which enlarges the ext4_inode size, and run the following script on new kernel but with old mke2fs: #/bin/bash mnt=/mnt/ext4 devname=ext4-error dev=/dev/mapper/$devname fsimg=/home/fs.img trap cleanup 0 1 2 3 9 15 cleanup() { umount $mnt >/dev/null 2>&1 dmsetup remove $devname losetup -d $backend_dev rm -f $fsimg exit 0 } rm -f $fsimg fallocate -l 1g $fsimg backend_dev=`losetup -f --show $fsimg` devsize=`blockdev --getsz $backend_dev` good_tab="0 $devsize linear $backend_dev 0" error_tab="0 $devsize error $backend_dev 0" dmsetup create $devname --table "$good_tab" mkfs -t ext4 $dev mount -t ext4 -o errors=continue,strictatime $dev $mnt dmsetup load $devname --table "$error_tab" && dmsetup resume $devname echo 3 > /proc/sys/vm/drop_caches ls -l $mnt exit 0 [ Patch changed to simplify the function a tiny bit. -- Ted ] Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20ext4: add lockdep annotations for i_data_semTheodore Ts'o
commit daf647d2dd58cec59570d7698a45b98e580f2076 upstream. With the internal Quota feature, mke2fs creates empty quota inodes and quota usage tracking is enabled as soon as the file system is mounted. Since quotacheck is no longer preallocating all of the blocks in the quota inode that are likely needed to be written to, we are now seeing a lockdep false positive caused by needing to allocate a quota block from inside ext4_map_blocks(), while holding i_data_sem for a data inode. This results in this complaint: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_data_sem); lock(&s->s_dquot.dqio_mutex); lock(&ei->i_data_sem); lock(&s->s_dquot.dqio_mutex); Google-Bug-Id: 27907753 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_listJoseph Qi
commit be12b299a83fc807bbaccd2bcb8ec50cbb0cb55c upstream. When master handles convert request, it queues ast first and then returns status. This may happen that the ast is sent before the request status because the above two messages are sent by two threads. And right after the ast is sent, if master down, it may trigger BUG in dlm_move_lockres_to_recovery_list in the requested node because ast handler moves it to grant list without clear lock->convert_pending. So remove BUG_ON statement and check if the ast is processed in dlmconvert_remote. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12ocfs2/dlm: fix race between convert and recoveryJoseph Qi
commit ac7cf246dfdbec3d8fed296c7bf30e16f5099dac upstream. There is a race window between dlmconvert_remote and dlm_move_lockres_to_recovery_list, which will cause a lock with OCFS2_LOCK_BUSY in grant list, thus system hangs. dlmconvert_remote { spin_lock(&res->spinlock); list_move_tail(&lock->list, &res->converting); lock->convert_pending = 1; spin_unlock(&res->spinlock); status = dlm_send_remote_convert_request(); >>>>>> race window, master has queued ast and return DLM_NORMAL, and then down before sending ast. this node detects master down and calls dlm_move_lockres_to_recovery_list, which will revert the lock to grant list. Then OCFS2_LOCK_BUSY won't be cleared as new master won't send ast any more because it thinks already be authorized. spin_lock(&res->spinlock); lock->convert_pending = 0; if (status != DLM_NORMAL) dlm_revert_pending_convert(res, lock); spin_unlock(&res->spinlock); } In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res is still in recovering) or res master changed (new master has finished recovery), reset the status to DLM_RECOVERING, then it will retry convert. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn
commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream. This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12xfs: fix two memory leaks in xfs_attr_list.c error pathsMateusz Guzik
commit 2e83b79b2d6c78bf1b4aa227938a214dcbddc83f upstream. This plugs 2 trivial leaks in xfs_attr_shortform_list and xfs_attr3_leaf_list_int. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12splice: handle zero nr_pages in splice_to_pipe()Rabin Vincent
commit d6785d9152147596f60234157da2b02540c3e60f upstream. Running the following command: busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null with any tracing enabled pretty very quickly leads to various NULL pointer dereferences and VM BUG_ON()s, such as these: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40 Call Trace: [<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0 [<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10 [<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0 [<ffffffff81196869>] do_sendfile+0x199/0x380 [<ffffffff81197600>] SyS_sendfile64+0x90/0xa0 [<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) kernel BUG at include/linux/mm.h:367! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40 Call Trace: [<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0 [<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10 [<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0 [<ffffffff81196869>] do_sendfile+0x199/0x380 [<ffffffff81197600>] SyS_sendfile64+0x90/0xa0 [<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89 (busybox's cat uses sendfile(2), unlike the coreutils version) This is because tracing_splice_read_pipe() can call splice_to_pipe() with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and we fill the page pointers and the other fields of the pipe_buffers with garbage. All other callers of splice_to_pipe() avoid calling it when nr_pages == 0, and we could make tracing_splice_read_pipe() do that too, but it seems reasonable to have splice_to_page() handle this condition gracefully. Signed-off-by: Rabin Vincent <rabin@rab.in> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount pathOGAWA Hirofumi
commit c0a2ad9b50dd80eeccd73d9ff962234590d5ec93 upstream. On umount path, jbd2_journal_destroy() writes latest transaction ID (->j_tail_sequence) to be used at next mount. The bug is that ->j_tail_sequence is not holding latest transaction ID in some cases. So, at next mount, there is chance to conflict with remaining (not overwritten yet) transactions. mount (id=10) write transaction (id=11) write transaction (id=12) umount (id=10) <= the bug doesn't write latest ID mount (id=10) write transaction (id=11) crash mount [recovery process] transaction (id=11) transaction (id=12) <= valid transaction ID, but old commit must not replay Like above, this bug become the cause of recovery failure, or FS corruption. So why ->j_tail_sequence doesn't point latest ID? Because if checkpoint transactions was reclaimed by memory pressure (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated. (And another case is, __jbd2_journal_clean_checkpoint_list() is called with empty transaction.) So in above cases, ->j_tail_sequence is not pointing latest transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not done too. So, to fix this problem with minimum changes, this patch updates ->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes, some optimizations would be possible to avoid unnecessary REQ_FLUSH for example though.) BTW, journal->j_tail_sequence = ++journal->j_transaction_sequence; Increment of ->j_transaction_sequence seems to be unnecessary, but ext3 does this. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16efi: Make efivarfs entries immutable by defaultPeter Jones
commit ed8b0de5a33d2a2557dce7f9429dca8cb5bc5879 upstream. "rm -rf" is bricking some peoples' laptops because of variables being used to store non-reinitializable firmware driver data that's required to POST the hardware. These are 100% bugs, and they need to be fixed, but in the mean time it shouldn't be easy to *accidentally* brick machines. We have to have delete working, and picking which variables do and don't work for deletion is quite intractable, so instead make everything immutable by default (except for a whitelist), and make tools that aren't quite so broad-spectrum unset the immutable flag. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-16efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad versionPeter Jones
commit e0d64e6a880e64545ad7d55786aa84ab76bac475 upstream. Translate EFI's UCS-2 variable names to UTF-8 instead of just assuming all variable names fit in ASCII. Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09Fix directory hardlinks from deleted directoriesDavid Woodhouse
commit be629c62a603e5935f8177fd8a19e014100a259e upstream. When a directory is deleted, we don't take too much care about killing off all the dirents that belong to it — on the basis that on remount, the scan will conclude that the directory is dead anyway. This doesn't work though, when the deleted directory contained a child directory which was moved *out*. In the early stages of the fs build we can then end up with an apparent hard link, with the child directory appearing both in its true location, and as a child of the original directory which are this stage of the mount process we don't *yet* know is defunct. To resolve this, take out the early special-casing of the "directories shall not have hard links" rule in jffs2_build_inode_pass1(), and let the normal nlink processing happen for directories as well as other inodes. Then later in the build process we can set ic->pino_nlink to the parent inode#, as is required for directories during normal operaton, instead of the nlink. And complain only *then* about hard links which are still in evidence even after killing off all the unreachable paths. Reported-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>