aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-09-23Merge tag 'v4.18.44' into v4.18/standard/baseBruce Ashfield17-38/+117
This is the 4.18.44 stable release # gpg: Signature made Tue 17 Sep 2019 10:35:59 AM EDT # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2019-09-16Abort file_remove_privs() for non-reg. filesAlexander Lochmann1-2/+7
commit f69e749a49353d96af1a293f56b5b56de59c668a upstream. file_remove_privs() might be called for non-regular files, e.g. blkdev inode. There is no reason to do its job on things like blkdev inodes, pipes, or cdevs. Hence, abort if file does not refer to a regular inode. AV: more to the point, for devices there might be any number of inodes refering to given device. Which one to strip the permissions from, even if that made any sense in the first place? All of them will be observed with contents modified, after all. Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf Spinczyk) Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alexander Lochmann <alexander.lochmann@tu-dortmund.de> Signed-off-by: Horst Schirmeier <horst.schirmeier@tu-dortmund.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16ocfs2: fix error path kobject memory leakTobin C. Harding1-0/+1
commit b9fba67b3806e21b98bd5a98dc3921a8e9b42d61 upstream. If a call to kobject_init_and_add() fails we should call kobject_put() otherwise we leak memory. Add call to kobject_put() in the error path of call to kobject_init_and_add(). Please note, this has the side effect that the release method is called if kobject_init_and_add() fails. Link: http://lkml.kernel.org/r/20190513033458.2824-1-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16configfs: Fix use-after-free when accessing sd->s_dentrySahitya Tummala1-8/+6
commit f6122ed2a4f9c9c1c073ddf6308d1b2ac10e0781 upstream. In the vfs_statx() context, during path lookup, the dentry gets added to sd->s_dentry via configfs_attach_attr(). In the end, vfs_statx() kills the dentry by calling path_put(), which invokes configfs_d_iput(). Ideally, this dentry must be removed from sd->s_dentry but it doesn't if the sd->s_count >= 3. As a result, sd->s_dentry is holding reference to a stale dentry pointer whose memory is already freed up. This results in use-after-free issue, when this stale sd->s_dentry is accessed later in configfs_readdir() path. This issue can be easily reproduced, by running the LTP test case - sh fs_racer_file_list.sh /config (https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/fs/racer/fs_racer_file_list.sh) Fixes: 76ae281f6307 ('configfs: fix race between dentry put and lookup') Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to avoid accessing xattr across the boundaryRandall Huang2-9/+29
commit 2777e654371dd4207a3a7f4fb5fa39550053a080 upstream. When we traverse xattr entries via __find_xattr(), if the raw filesystem content is faked or any hardware failure occurs, out-of-bound error can be detected by KASAN. Fix the issue by introducing boundary check. [ 38.402878] c7 1827 BUG: KASAN: slab-out-of-bounds in f2fs_getxattr+0x518/0x68c [ 38.402891] c7 1827 Read of size 4 at addr ffffffc0b6fb35dc by task [ 38.402935] c7 1827 Call trace: [ 38.402952] c7 1827 [<ffffff900809003c>] dump_backtrace+0x0/0x6bc [ 38.402966] c7 1827 [<ffffff9008090030>] show_stack+0x20/0x2c [ 38.402981] c7 1827 [<ffffff900871ab10>] dump_stack+0xfc/0x140 [ 38.402995] c7 1827 [<ffffff9008325c40>] print_address_description+0x80/0x2d8 [ 38.403009] c7 1827 [<ffffff900832629c>] kasan_report_error+0x198/0x1fc [ 38.403022] c7 1827 [<ffffff9008326104>] kasan_report_error+0x0/0x1fc [ 38.403037] c7 1827 [<ffffff9008325000>] __asan_load4+0x1b0/0x1b8 [ 38.403051] c7 1827 [<ffffff90085fcc44>] f2fs_getxattr+0x518/0x68c [ 38.403066] c7 1827 [<ffffff90085fc508>] f2fs_xattr_generic_get+0xb0/0xd0 [ 38.403080] c7 1827 [<ffffff9008395708>] __vfs_getxattr+0x1f4/0x1fc [ 38.403096] c7 1827 [<ffffff9008621bd0>] inode_doinit_with_dentry+0x360/0x938 [ 38.403109] c7 1827 [<ffffff900862d6cc>] selinux_d_instantiate+0x2c/0x38 [ 38.403123] c7 1827 [<ffffff900861b018>] security_d_instantiate+0x68/0x98 [ 38.403136] c7 1827 [<ffffff9008377db8>] d_splice_alias+0x58/0x348 [ 38.403149] c7 1827 [<ffffff900858d16c>] f2fs_lookup+0x608/0x774 [ 38.403163] c7 1827 [<ffffff900835eacc>] lookup_slow+0x1e0/0x2cc [ 38.403177] c7 1827 [<ffffff9008367fe0>] walk_component+0x160/0x520 [ 38.403190] c7 1827 [<ffffff9008369ef4>] path_lookupat+0x110/0x2b4 [ 38.403203] c7 1827 [<ffffff900835dd38>] filename_lookup+0x1d8/0x3a8 [ 38.403216] c7 1827 [<ffffff900835eeb0>] user_path_at_empty+0x54/0x68 [ 38.403229] c7 1827 [<ffffff9008395f44>] SyS_getxattr+0xb4/0x18c [ 38.403241] c7 1827 [<ffffff9008084200>] el0_svc_naked+0x34/0x38 Signed-off-by: Randall Huang <huangrandall@google.com> [Jaegeuk Kim: Fix wrong ending boundary] Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16fs/ocfs2: fix race in ocfs2_dentry_attach_lock()Wengang Wang1-0/+12
commit be99ca2716972a712cde46092c54dee5e6192bf8 upstream. ocfs2_dentry_attach_lock() can be executed in parallel threads against the same dentry. Make that race safe. The race is like this: thread A thread B (A1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias, so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl1 ..... (B1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl2. ...... (A2) set dentry->d_fsdata with dl1, call ocfs2_dentry_lock() and increase dl1->dl_lockres.l_ro_holders to 1 on success. ...... (B2) set dentry->d_fsdata with dl2 call ocfs2_dentry_lock() and increase dl2->dl_lockres.l_ro_holders to 1 on success. ...... (A3) call ocfs2_dentry_unlock() and decrease dl2->dl_lockres.l_ro_holders to 0 on success. .... (B3) call ocfs2_dentry_unlock(), decreasing dl2->dl_lockres.l_ro_holders, but see it's zero now, panic Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reported-by: Daniel Sobe <daniel.sobe@nxp.com> Tested-by: Daniel Sobe <daniel.sobe@nxp.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16nfsd: avoid uninitialized variable warningArnd Bergmann1-0/+4
commit 0ab88ca4bcf18ba21058d8f19220f60afe0d34d8 upstream. clang warns that 'contextlen' may be accessed without an initialization: fs/nfsd/nfs4xdr.c:2911:9: error: variable 'contextlen' is uninitialized when used here [-Werror,-Wuninitialized] contextlen); ^~~~~~~~~~ fs/nfsd/nfs4xdr.c:2424:16: note: initialize the variable 'contextlen' to silence this warning int contextlen; ^ = 0 Presumably this cannot happen, as FATTR4_WORD2_SECURITY_LABEL is set if CONFIG_NFSD_V4_SECURITY_LABEL is enabled. Adding another #ifdef like the other two in this function avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16nfsd: allow fh_want_write to be called twiceJ. Bruce Fields1-1/+4
commit 0b8f62625dc309651d0efcb6a6247c933acd8b45 upstream. A fuzzer recently triggered lockdep warnings about potential sb_writers deadlocks caused by fh_want_write(). Looks like we aren't careful to pair each fh_want_write() with an fh_drop_write(). It's not normally a problem since fh_put() will call fh_drop_write() for us. And was OK for NFSv3 where we'd do one operation that might call fh_want_write(), and then put the filehandle. But an NFSv4 protocol fuzzer can do weird things like call unlink twice in a compound, and then we get into trouble. I'm a little worried about this approach of just leaving everything to fh_put(). But I think there are probably a lot of fh_want_write()/fh_drop_write() imbalances so for now I think we need it to be more forgiving. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16fuse: retrieve: cap requested size to negotiated max_writeKirill Smelkov1-1/+1
commit 7640682e67b33cab8628729afec8ca92b851394f upstream. FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header + that max_write bytes. A filesystem server is free to set its max_write in anywhere in the range between [1*page, fc->max_pages*page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum. If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back. Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16configfs: fix possible use-after-free in configfs_register_groupYueHaibing1-5/+12
commit 35399f87e271f7cf3048eab00a421a6519ac8441 upstream. In configfs_register_group(), if create_default_group() failed, we forget to unlink the group. It will left a invalid item in the parent list, which may trigger the use-after-free issue seen below: BUG: KASAN: use-after-free in __list_add_valid+0xd4/0xe0 lib/list_debug.c:26 Read of size 8 at addr ffff8881ef61ae20 by task syz-executor.0/5996 CPU: 1 PID: 5996 Comm: syz-executor.0 Tainted: G C 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xa9/0x10e lib/dump_stack.c:113 print_address_description+0x65/0x270 mm/kasan/report.c:187 kasan_report+0x149/0x18d mm/kasan/report.c:317 __list_add_valid+0xd4/0xe0 lib/list_debug.c:26 __list_add include/linux/list.h:60 [inline] list_add_tail include/linux/list.h:93 [inline] link_obj+0xb0/0x190 fs/configfs/dir.c:759 link_group+0x1c/0x130 fs/configfs/dir.c:784 configfs_register_group+0x56/0x1e0 fs/configfs/dir.c:1751 configfs_register_default_group+0x72/0xc0 fs/configfs/dir.c:1834 ? 0xffffffffc1be0000 iio_sw_trigger_init+0x23/0x1000 [industrialio_sw_trigger] do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f494ecbcc58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003 RBP: 00007f494ecbcc70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f494ecbd6bc R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004 Allocated by task 5987: set_track mm/kasan/common.c:87 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:497 kmalloc include/linux/slab.h:545 [inline] kzalloc include/linux/slab.h:740 [inline] configfs_register_default_group+0x4c/0xc0 fs/configfs/dir.c:1829 0xffffffffc1bd0023 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 5987: set_track mm/kasan/common.c:87 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:459 slab_free_hook mm/slub.c:1429 [inline] slab_free_freelist_hook mm/slub.c:1456 [inline] slab_free mm/slub.c:3003 [inline] kfree+0xe1/0x270 mm/slub.c:3955 configfs_register_default_group+0x9a/0xc0 fs/configfs/dir.c:1836 0xffffffffc1bd0023 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8881ef61ae00 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 32 bytes inside of 192-byte region [ffff8881ef61ae00, ffff8881ef61aec0) The buggy address belongs to the page: page:ffffea0007bd8680 count:1 mapcount:0 mapping:ffff8881f6c03000 index:0xffff8881ef61a700 flags: 0x2fffc0000000200(slab) raw: 02fffc0000000200 ffffea0007ca4740 0000000500000005 ffff8881f6c03000 raw: ffff8881ef61a700 000000008010000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8881ef61ad00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881ef61ad80: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc >ffff8881ef61ae00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8881ef61ae80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8881ef61af00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 5cf6a51e6062 ("configfs: allow dynamic group creation") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to do sanity check on valid block count of segmentChao Yu1-2/+1
commit e95bcdb2fefa129f37bd9035af1d234ca92ee4ef upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203233 - Overview When mounting the attached crafted image and running program, following errors are reported. Additionally, it hangs on sync after running program. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces cc poc_13.c mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out sync - Kernel messages F2FS-fs (sdb): Bitmap was wrongly set, blk:4608 kernel BUG at fs/f2fs/segment.c:2102! RIP: 0010:update_sit_entry+0x394/0x410 Call Trace: f2fs_allocate_data_block+0x16f/0x660 do_write_page+0x62/0x170 f2fs_do_write_node_page+0x33/0xa0 __write_node_page+0x270/0x4e0 f2fs_sync_node_pages+0x5df/0x670 f2fs_write_checkpoint+0x372/0x1400 f2fs_sync_fs+0xa3/0x130 f2fs_do_sync_file+0x1a6/0x810 do_fsync+0x33/0x60 __x64_sys_fsync+0xb/0x10 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 sit.vblocks and sum valid block count in sit.valid_map may be inconsistent, segment w/ zero vblocks will be treated as free segment, while allocating in free segment, we may allocate a free block, if its bitmap is valid previously, it can cause kernel crash due to bitmap verification failure. Anyway, to avoid further serious metadata inconsistence and corruption, it is necessary and worth to detect SIT inconsistence. So let's enable check_block_count() to verify vblocks and valid_map all the time rather than do it only CONFIG_F2FS_CHECK_FS is enabled. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to use inline space only if inline_xattr is enableChao Yu1-1/+3
commit 622927f3b8809206f6da54a6a7ed4df1a7770fce upstream. With below mkfs and mount option: MKFS_OPTIONS -- -O extra_attr -O project_quota -O inode_checksum -O flexible_inline_xattr -O inode_crtime -f MOUNT_OPTIONS -- -o noinline_xattr We may miss xattr data with below testcase: - mkdir dir - setfattr -n "user.name" -v 0 dir - for ((i = 0; i < 190; i++)) do touch dir/$i; done - umount - mount - getfattr -n "user.name" dir user.name: No such attribute The root cause is that we persist xattr data into reserved inline xattr space, even if inline_xattr is not enable in inline directory inode, after inline dentry conversion, reserved space no longer exists, so that xattr data missed. Let's use inline xattr space only if inline_xattr flag is set on inode to fix this iusse. Fixes: 6afc662e68b5 ("f2fs: support flexible inline xattr size") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to clear dirty inode in error path of f2fs_iget()Chao Yu1-0/+1
commit 546d22f070d64a7b96f57c93333772085d3a5e6d upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203217 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after running the program. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_test_05.c mkdir test mount -t f2fs tmp.img test sudo ./a.out sync - Messages kernel BUG at fs/f2fs/inode.c:707! RIP: 0010:f2fs_evict_inode+0x33f/0x3a0 Call Trace: evict+0xba/0x180 f2fs_iget+0x598/0xdf0 f2fs_lookup+0x136/0x320 __lookup_slow+0x92/0x140 lookup_slow+0x30/0x50 walk_component+0x1c1/0x350 path_lookupat+0x62/0x200 filename_lookup+0xb3/0x1a0 do_readlinkat+0x56/0x110 __x64_sys_readlink+0x16/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 During inode loading, __recover_inline_status() can recovery inode status and set inode dirty, once we failed in following process, it will fail the check in f2fs_evict_inode, result in trigger BUG_ON(). Let's clear dirty inode in error path of f2fs_iget() to avoid panic. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to do sanity check on free nidChao Yu1-0/+3
commit 626bcf2b7ce87211dba565f2bfa7842ba5be5c1b upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203225 - Overview When mounting the attached crafted image and unmounting it, following errors are reported. Additionally, it hangs on sync after unmounting. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces mkdir test mount -t f2fs tmp.img test touch test/t umount test sync - Messages kernel BUG at fs/f2fs/node.c:3073! RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300 Call Trace: f2fs_put_super+0xf4/0x270 generic_shutdown_super+0x62/0x110 kill_block_super+0x1c/0x50 kill_f2fs_super+0xad/0xd0 deactivate_locked_super+0x35/0x60 cleanup_mnt+0x36/0x70 task_work_run+0x75/0x90 exit_to_usermode_loop+0x93/0xa0 do_syscall_64+0xba/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300 NAT table is corrupted, so reserved meta/node inode ids were added into free list incorrectly, during file creation, since reserved id has cached in inode hash, so it fails the creation and preallocated nid can not be released later, result in kernel panic. To fix this issue, let's do nid boundary check during free nid loading. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to avoid panic in f2fs_remove_inode_page()Chao Yu1-2/+8
commit 8b6810f8acfe429fde7c7dad4714692cc5f75651 upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203219 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after running the program. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_06.c mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out sync - Messages kernel BUG at fs/f2fs/node.c:1183! RIP: 0010:f2fs_remove_inode_page+0x294/0x2d0 Call Trace: f2fs_evict_inode+0x2a3/0x3a0 evict+0xba/0x180 __dentry_kill+0xbe/0x160 dentry_kill+0x46/0x180 dput+0xbb/0x100 do_renameat2+0x3c9/0x550 __x64_sys_rename+0x17/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is f2fs_remove_inode_page() will trigger kernel panic due to inconsistent i_blocks value of inode. To avoid panic, let's just print debug message and set SBI_NEED_FSCK to give a hint to fsck for latter repairing of potential image corruption. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix build warning and add unlikely] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to avoid panic in f2fs_inplace_write_data()Chao Yu1-2/+7
commit 05573d6ccf702df549a7bdeabef31e4753df1a90 upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203239 - Overview When mounting the attached crafted image and running program, following errors are reported. Additionally, it hangs on sync after running program. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces cc poc_15.c ./run.sh f2fs sync - Kernel messages ------------[ cut here ]------------ kernel BUG at fs/f2fs/segment.c:3162! RIP: 0010:f2fs_inplace_write_data+0x12d/0x160 Call Trace: f2fs_do_write_data_page+0x3c1/0x820 __write_data_page+0x156/0x720 f2fs_write_cache_pages+0x20d/0x460 f2fs_write_data_pages+0x1b4/0x300 do_writepages+0x15/0x60 __filemap_fdatawrite_range+0x7c/0xb0 file_write_and_wait_range+0x2c/0x80 f2fs_do_sync_file+0x102/0x810 do_fsync+0x33/0x60 __x64_sys_fsync+0xb/0x10 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is f2fs_inplace_write_data() will trigger kernel panic due to data block locates in node type segment. To avoid panic, let's just return error code and set SBI_NEED_FSCK to give a hint to fsck for latter repairing. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16f2fs: fix to avoid panic in do_recover_data()Chao Yu1-1/+9
commit 22d61e286e2d9097dae36f75ed48801056b77cac upstream. As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203227 - Overview When mounting the attached crafted image, following errors are reported. Additionally, it hangs on sync after trying to mount it. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces mkdir test mount -t f2fs tmp.img test sync - Messages kernel BUG at fs/f2fs/recovery.c:549! RIP: 0010:recover_data+0x167a/0x1780 Call Trace: f2fs_recover_fsync_data+0x613/0x710 f2fs_fill_super+0x1043/0x1aa0 mount_bdev+0x16d/0x1a0 mount_fs+0x4a/0x170 vfs_kern_mount+0x5d/0x100 do_mount+0x200/0xcf0 ksys_mount+0x79/0xc0 __x64_sys_mount+0x1c/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 During recovery, if ofs_of_node is inconsistent in between recovered node page and original checkpointed node page, let's just fail recovery instead of making kernel panic. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16mm: page_mkclean vs MADV_DONTNEED raceAneesh Kumar K.V1-1/+1
commit 024eee0e83f0df52317be607ca521e0fc572aa07 upstream. MADV_DONTNEED is handled with mmap_sem taken in read mode. We call page_mkclean without holding mmap_sem. MADV_DONTNEED implies that pages in the region are unmapped and subsequent access to the pages in that range is handled as a new page fault. This implies that if we don't have parallel access to the region when MADV_DONTNEED is run we expect those range to be unallocated. w.r.t page_mkclean() we need to make sure that we don't break the MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. This implies we skip the pmd if we temporarily mark pmd none. Avoid doing that while marking the page clean. Keep the sequence same for dax too even though we don't support MADV_DONTNEED for dax mapping The bug was noticed by code review and I didn't observe any failures w.r.t test run. This is similar to commit 58ceeb6bec86d9140f9d91d71a710e963523d063 Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Thu Apr 13 14:56:26 2017 -0700 thp: fix MADV_DONTNEED vs. MADV_FREE race commit ced108037c2aa542b3ed8b7afd1576064ad1362a Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Thu Apr 13 14:56:20 2017 -0700 thp: fix MADV_DONTNEED vs. numa balancing race Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc:"Kirill A . Shutemov" <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-16fs/fat/file.c: issue flush after the writeback of FATHou Tao1-3/+8
commit bd8309de0d60838eef6fb575b0c4c7e95841cf73 upstream. fsync() needs to make sure the data & meta-data of file are persistent after the return of fsync(), even when a power-failure occurs later. In the case of fat-fs, the FAT belongs to the meta-data of file, so we need to issue a flush after the writeback of FAT instead before. Also bail out early when any stage of fsync fails. Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-15Merge tag 'v4.18.43' into v4.18/standard/baseBruce Ashfield10-57/+73
This is the 4.18.43 stable release
2019-09-12pstore/ram: Run without kernel crash dump regionKees Cook2-14/+25
commit 8880fa32c557600f5f624084152668ed3c2ea51e upstream. The ram pstore backend has always had the crash dumper frontend enabled unconditionally. However, it was possible to effectively disable it by setting a record_size=0. All the machinery would run (storing dumps to the temporary crash buffer), but 0 bytes would ultimately get stored due to there being no przs allocated for dumps. Commit 89d328f637b9 ("pstore/ram: Correctly calculate usable PRZ bytes"), however, assumed that there would always be at least one allocated dprz for calculating the size of the temporary crash buffer. This was, of course, not the case when record_size=0, and would lead to a NULL deref trying to find the dprz buffer size: BUG: unable to handle kernel NULL pointer dereference at (null) ... IP: ramoops_probe+0x285/0x37e (fs/pstore/ram.c:808) cxt->pstore.bufsize = cxt->dprzs[0]->buffer_size; Instead, we need to only enable the frontends based on the success of the prz initialization and only take the needed actions when those zones are available. (This also fixes a possible error in detecting if the ftrace frontend should be enabled.) Reported-and-tested-by: Yaro Slav <yaro330@gmail.com> Fixes: 89d328f637b9 ("pstore/ram: Correctly calculate usable PRZ bytes") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12pstore: Set tfm to NULL on free_buf_for_compressionPi-Hsun Shih1-1/+3
commit a9fb94a99bb515d8720ba8440ce3aba84aec80f8 upstream. Set tfm to NULL on free_buf_for_compression() after crypto_free_comp(). This avoid a use-after-free when allocate_buf_for_compression() and free_buf_for_compression() are called twice. Although free_buf_for_compression() freed the tfm, allocate_buf_for_compression() won't reinitialize the tfm since the tfm pointer is not NULL. Fixes: 95047b0519c1 ("pstore: Refactor compression initialization") Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handledYihao Wu1-3/+5
commit ba851a39c9703f09684a541885ed176f8fb7c868 upstream. When a waiter is waked by CB_NOTIFY_LOCK, it will retry nfs4_proc_setlk(). The waiter may fail to nfs4_proc_setlk() and sleep again. However, the waiter is already removed from clp->cl_lock_waitq when handling CB_NOTIFY_LOCK in nfs4_wake_lock_waiter(). So any subsequent CB_NOTIFY_LOCK won't wake this waiter anymore. We should put the waiter back to clp->cl_lock_waitq before retrying. Cc: stable@vger.kernel.org #4.9+ Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiterYihao Wu1-17/+7
commit 52b042ab9948cc367b61f9ca9c18603aa7813c3a upstream. Commit b7dbcc0e433f "NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter" found this bug. However it didn't fix it. This commit replaces schedule_timeout() with wait_woken() and default_wake_function() with woken_wake_function() in function nfs4_retry_setlk() and nfs4_wake_lock_waiter(). wait_woken() uses memory barriers in its implementation to avoid potential race condition when putting a process into sleeping state and then waking it up. Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks") Cc: stable@vger.kernel.org #4.9+ Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12Revert "lockd: Show pid of lockd for remote locks"Benjamin Coddington2-4/+4
commit 141731d15d6eb2fd9aaefbf9b935ce86ae243074 upstream. This reverts most of commit b8eee0e90f97 ("lockd: Show pid of lockd for remote locks"), which caused remote locks to not be differentiated between remote processes for NLM. We retain the fixup for setting the client's fl_pid to a negative value. Fixes: b8eee0e90f97 ("lockd: Show pid of lockd for remote locks") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: XueWei Zhang <xueweiz@google.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEMRoberto Bergantinos Corpas1-1/+3
commit 31fad7d41e73731f05b8053d17078638cf850fa6 upstream. In cifs_read_allocate_pages, in case of ENOMEM, we go through whole rdata->pages array but we have failed the allocation before nr_pages, therefore we may end up calling put_page with NULL pointer, causing oops Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12cifs: fix memory leak of pneg_inbuf on -EOPNOTSUPP ioctl caseColin Ian King1-1/+2
commit 210782038b54ec8e9059a3c12d6f6ae173efa3a9 upstream. Currently in the case where SMB2_ioctl returns the -EOPNOTSUPP error there is a memory leak of pneg_inbuf. Fix this by returning via the out_free_inbuf exit path that will perform the relevant kfree. Addresses-Coverity: ("Resource leak") Fixes: 969ae8e8d4ee ("cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED") CC: Stable <stable@vger.kernel.org> # v5.1+ Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12Btrfs: incremental send, fix file corruption when no-holes feature is enabledFilipe Manana1-0/+6
commit 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 upstream. When using the no-holes feature, if we have a file with prealloc extents with a start offset beyond the file's eof, doing an incremental send can cause corruption of the file due to incorrect hole detection. Such case requires that the prealloc extent(s) exist in both the parent and send snapshots, and that a hole is punched into the file that covers all its extents that do not cross the eof boundary. Example reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/base.snap /mnt/sdb/base $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr $ md5sum /mnt/sdb/incr/foobar 816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/base.snap /mnt/sdc $ btrfs receive -f /tmp/incr.snap /mnt/sdc $ md5sum /mnt/sdc/incr/foobar cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar --> Different checksum, because the prealloc extent beyond the file's eof confused the hole detection code and it assumed a hole starting at offset 0 and ending at the offset of the prealloc extent (1200Kb) instead of ending at the offset 500Kb (the file's size). Fix this by ensuring we never cross the file's size when issuing the write operations for a hole. Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 3.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12Btrfs: fix fsync not persisting changed attributes of a directoryFilipe Manana1-12/+0
commit 60d9f50308e5df19bc18c2fefab0eba4a843900a upstream. While logging an inode we follow its ancestors and for each one we mark it as logged in the current transaction, even if we have not logged it. As a consequence if we change an attribute of an ancestor, such as the UID or GID for example, and then explicitly fsync it, we end up not logging the inode at all despite returning success to user space, which results in the attribute being lost if a power failure happens after the fsync. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ chown 6007:6007 /mnt/dir $ sync $ chown 9003:9003 /mnt/dir $ touch /mnt/dir/file $ xfs_io -c fsync /mnt/dir/file # fsync our directory after fsync'ing the new file, should persist the # new values for the uid and gid. $ xfs_io -c fsync /mnt/dir <power failure> $ mount /dev/sdb /mnt $ stat -c %u:%g /mnt/dir 6007:6007 --> should be 9003:9003, the uid and gid were not persisted, despite the explicit fsync on the directory prior to the power failure Fix this by not updating the logged_trans field of ancestor inodes when logging an inode, since we have not logged them. Let only future calls to btrfs_log_inode() to mark inodes as logged. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12Btrfs: fix race updating log root item during fsyncFilipe Manana1-2/+6
commit 06989c799f04810f6876900d4760c0edda369cf7 upstream. When syncing the log, the final phase of a fsync operation, we need to either create a log root's item or update the existing item in the log tree of log roots, and that depends on the current value of the log root's log_transid - if it's 1 we need to create the log root item, otherwise it must exist already and we update it. Since there is no synchronization between updating the log_transid and checking it for deciding whether the log root's item needs to be created or updated, we end up with a tiny race window that results in attempts to update the item to fail because the item was not yet created: CPU 1 CPU 2 btrfs_sync_log() lock root->log_mutex set log root's log_transid to 1 unlock root->log_mutex btrfs_sync_log() lock root->log_mutex sets log root's log_transid to 2 unlock root->log_mutex update_log_root() sees log root's log_transid with a value of 2 calls btrfs_update_root(), which fails with -EUCLEAN and causes transaction abort Until recently the race lead to a BUG_ON at btrfs_update_root(), but after the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a root key") we just abort the current transaction. A sample trace of the BUG_ON() on a SLE12 kernel: ------------[ cut here ]------------ kernel BUG at ../fs/btrfs/root-tree.c:157! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 NUMA pSeries (...) Supported: Yes, External CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G X 4.4.156-94.57-default #1 task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000 NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000 REGS: c00000ff42b0b860 TRAP: 0700 Tainted: G X (4.4.156-94.57-default) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22444484 XER: 20000000 CFAR: d000000036aba66c SOFTE: 1 GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054 GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000 GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079 GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023 GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28 GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001 GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888 GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20 NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs] LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] Call Trace: [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable) [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs] [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs] [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120 [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0 [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40 [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100 Instruction dump: 7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0 e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0 ---[ end trace 8f2dc8f919cabab8 ]--- So fix this by doing the check of log_transid and updating or creating the log root's item while holding the root's log_mutex. Fixes: 7237f1833601d ("Btrfs: fix tree logs parallel sync") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-09-12Btrfs: fix wrong ctime and mtime of a directory after log replayFilipe Manana1-2/+12
commit 5338e43abbab13791144d37fd8846847062351c6 upstream. When replaying a log that contains a new file or directory name that needs to be added to its parent directory, we end up updating the mtime and the ctime of the parent directory to the current time after we have set their values to the correct ones (set at fsync time), efectivelly losing them. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/file # fsync of the directory is optional, not needed $ xfs_io -c fsync /mnt/dir $ xfs_io -c fsync /mnt/dir/file $ stat -c %Y /mnt/dir 1557856079 <power failure> $ sleep 3 $ mount /dev/sdb /mnt $ stat -c %Y /mnt/dir 1557856082 --> should have been 1557856079, the mtime is updated to the current time when replaying the log Fix this by not updating the mtime and ctime to the current time at btrfs_add_link() when we are replaying a log tree. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-25Merge tag 'v4.18.42' into v4.18/standard/baseBruce Ashfield2-1/+7
This is the 4.18.42 stable release
2019-08-17NFS: Fix a double unlock from nfs_match,get_clientBenjamin Coddington1-1/+1
commit c260121a97a3e4df6536edbc2f26e166eff370ce upstream. Now that nfs_match_client drops the nfs_client_lock, we should be careful to always return it in the same condition: locked. Fixes: 950a578c6128 ("NFS: make nfs_match_client killable") Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-17chardev: add additional check for minor range overlapChengguang Xu1-0/+6
commit de36e16d1557a0b6eb328bc3516359a12ba5c25c upstream. Current overlap checking cannot correctly handle a case which is baseminor < existing baseminor && baseminor + minorct > existing baseminor + minorct. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-07Merge tag 'v4.18.41' into v4.18/standard/baseBruce Ashfield28-116/+179
This is the 4.18.41 stable release
2019-08-02NFS: Cleanup if nfs_match_client is interruptedBenjamin Coddington1-2/+2
commit 9f7761cf0409465075dadb875d5d4b8ef2f890c8 upstream. Don't bail out before cleaning up a new allocation if the wait for searching for a matching nfs client is interrupted. Memory leaks. Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com Fixes: 950a578c6128 ("NFS: make nfs_match_client killable") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02fuse: fallocate: fix return with locked inodeMiklos Szeredi1-1/+1
commit 35d6fcbb7c3e296a52136347346a698a35af3fda upstream. Do the proper cleanup in case the size check fails. Tested with xfstests:generic/228 Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 0cbade024ba5 ("fuse: honor RLIMIT_FSIZE in fuse_file_fallocate") Cc: Liu Bo <bo.liu@linux.alibaba.com> Cc: <stable@vger.kernel.org> # v3.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02btrfs: Don't panic when we can't find a root keyQu Wenruo1-5/+8
commit 7ac1e464c4d473b517bb784f30d40da1f842482e upstream. When we failed to find a root key in btrfs_update_root(), we just panic. That's definitely not cool, fix it by outputting an unique error message, aborting current transaction and return -EUCLEAN. This should not normally happen as the root has been used by the callers in some way. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02btrfs: fix panic during relocation after ENOSPC before writeback happensJosef Bacik1-11/+20
commit ff612ba7849964b1898fd3ccd1f56941129c6aab upstream. We've been seeing the following sporadically throughout our fleet panic: kernel BUG at fs/btrfs/relocation.c:4584! netversion: 5.0-0 Backtrace: #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8 #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a #4 [ffffc90003adb9c0] do_trap at ffffffff81019114 #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0 #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b [exception RIP: btrfs_reloc_cow_block+692] RIP: ffffffff8143b614 RSP: ffffc90003adbb68 RFLAGS: 00010246 RAX: fffffffffffffff7 RBX: ffff8806b9c32000 RCX: ffff8806aad00690 RDX: ffff880850b295e0 RSI: ffff8806b9c32000 RDI: ffff88084f205bd0 RBP: ffff880849415000 R8: ffffc90003adbbe0 R9: ffff88085ac90000 R10: ffff8805f7369140 R11: 0000000000000000 R12: ffff880850b295e0 R13: ffff88084f205bd0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3 #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c The way relocation moves data extents is by creating a reloc inode and preallocating extents in this inode and then copying the data into these preallocated extents. Once we've done this for all of our extents, we'll write out these dirty pages, which marks the extent written, and goes into btrfs_reloc_cow_block(). From here we get our current reloc_control, which _should_ match the reloc_control for the current block group we're relocating. However if we get an ENOSPC in this path at some point we'll bail out, never initiating writeback on this inode. Not a huge deal, unless we happen to be doing relocation on a different block group, and this block group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in btrfs_reloc_cow_block(), because we expect to be done modifying the data inode. We are in fact done modifying the metadata for the data inode we're currently using, but not the one from the failed block group, and thus we BUG_ON(). (This happens when writeback finishes for extents from the previous group, when we are at btrfs_finish_ordered_io() which updates the data reloc tree (inode item, drops/adds extent items, etc).) Fix this by writing out the reloc data inode always, and then breaking out of the loop after that point to keep from tripping this BUG_ON() later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ add note from Filipe ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota ↵Robbie Ko1-1/+2
reserve commit 39ad317315887c2cb9a4347a93a8859326ddf136 upstream. When doing fallocate, we first add the range to the reserve_list and then reserve the quota. If quota reservation fails, we'll release all reserved parts of reserve_list. However, cur_offset is not updated to indicate that this range is already been inserted into the list. Therefore, the same range is freed twice. Once at list_for_each_entry loop, and once at the end of the function. This will result in WARN_ON on bytes_may_use when we free the remaining space. At the end, under the 'out' label we have a call to: btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset); The start offset, third argument, should be cur_offset. Everything from alloc_start to cur_offset was freed by the list_for_each_entry_safe_loop. Fixes: 18513091af94 ("btrfs: update btrfs_space_info's bytes_may_use timely") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02gfs2: Fix occasional glock use-after-freeAndreas Gruenbacher3-3/+7
commit 9287c6452d2b1f24ea8e84bd3cf6f3c6f267f712 upstream. This patch has to do with the life cycle of glocks and buffers. When gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata object is assigned to track the buffer, and that is queued to various lists, including the glock's gl_ail_list to indicate it's on the active items list. Once the page associated with the buffer has been written, it is removed from the ail list, but its life isn't over until a revoke has been successfully written. So after the block is written, its bufdata object is moved from the glock's gl_ail_list to a file-system-wide list of pending revokes, sd_log_le_revoke. At that point the glock still needs to track how many revokes it contributed to that list (in gl_revokes) so that things like glock go_sync can ensure all the metadata has been not only written, but also revoked before the glock is granted to a different node. This is to guarantee journal replay doesn't replay the block once the glock has been granted to another node. Ross Lagerwall recently discovered a race in which an inode could be evicted, and its glock freed after its ail list had been synced, but while it still had unwritten revokes on the sd_log_le_revoke list. The evict decremented the glock reference count to zero, which allowed the glock to be freed. After the revoke was written, function revoke_lo_after_commit tried to adjust the glock's gl_revokes counter and clear its GLF_LFLUSH flag, at which time it referenced the freed glock. This patch fixes the problem by incrementing the glock reference count in gfs2_add_revoke when the glock's first bufdata object is moved from the glock to the global revokes list. Later, when the glock's last such bufdata object is freed, the reference count is decremented. This guarantees that whichever process finishes last (the revoke writing or the evict) will properly free the glock, and neither will reference the glock after it has been freed. Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02NFS: make nfs_match_client killableRoberto Bergantinos Corpas1-1/+6
commit 950a578c6128c2886e295b9c7ecb0b6b22fcc92b upstream. Actually we don't do anything with return value from nfs_wait_client_init_complete in nfs_match_client, as a consequence if we get a fatal signal and client is not fully initialised, we'll loop to "again" label This has been proven to cause soft lockups on some scenarios (no-carrier but configured network interfaces) Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02gfs2: Fix lru_count going negativeRoss Lagerwall1-9/+13
commit 7881ef3f33bb80f459ea6020d1e021fc524a6348 upstream. Under certain conditions, lru_count may drop below zero resulting in a large amount of log spam like this: vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \ negative objects to delete nr=-1 This happens as follows: 1) A glock is moved from lru_list to the dispose list and lru_count is decremented. 2) The dispose function calls cond_resched() and drops the lru lock. 3) Another thread takes the lru lock and tries to add the same glock to lru_list, checking if the glock is on an lru list. 4) It is on a list (actually the dispose list) and so it avoids incrementing lru_count. 5) The glock is moved to lru_list. 5) The original thread doesn't dispose it because it has been re-added to the lru list but the lru_count has still decreased by one. Fix by checking if the LRU flag is set on the glock rather than checking if the glock is on some list and rearrange the code so that the LRU flag is added/removed precisely when the glock is added/removed from lru_list. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02acct_on(): don't mess with freeze protectionAl Viro1-2/+0
commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream. What happens there is that we are replacing file->path.mnt of a file we'd just opened with a clone and we need the write count contribution to be transferred from original mount to new one. That's it. We do *NOT* want any kind of freeze protection for the duration of switchover. IOW, we should just use __mnt_{want,drop}_write() for that switchover; no need to bother with mnt_{want,drop}_write() there. Tested-by: Amir Goldstein <amir73il@gmail.com> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02NFSv4.1 fix incorrect return value in copy_file_rangeOlga Kornievskaia2-4/+3
commit 0769663b4f580566ef6cdf366f3073dbe8022c39 upstream. According to the NFSv4.2 spec if the input and output file is the same file, operation should fail with EINVAL. However, linux copy_file_range() system call has no such restrictions. Therefore, in such case let's return EOPNOTSUPP and allow VFS to fallback to doing do_splice_direct(). Also when copy_file_range is called on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support COPY functionality), we also need to return EOPNOTSUPP and fallback to a regular copy. Fixes xfstest generic/075, generic/091, generic/112, generic/263 for all NFSv4.x versions. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02btrfs: sysfs: don't leak memory when failing add fsidTobin C. Harding1-1/+6
commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream. A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02btrfs: sysfs: Fix error path kobject memory leakTobin C. Harding1-2/+1
commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream. If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana1-0/+12
commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream. When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana1-0/+1
commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream. When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-08-02Btrfs: do not abort transaction at btrfs_update_root() after failure to COW pathFilipe Manana1-3/+1
commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream. Currently when we fail to COW a path at btrfs_update_root() we end up always aborting the transaction. However all the current callers of btrfs_update_root() are able to deal with errors returned from it, many do end up aborting the transaction themselves (directly or not, such as the transaction commit path), other BUG_ON() or just gracefully cancel whatever they were doing. When syncing the fsync log, we call btrfs_update_root() through tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log sync code does not abort the transaction, instead it gracefully handles the error and returns -EAGAIN to the fsync handler, so that it falls back to a transaction commit. Any other error different from -ENOSPC, makes the log sync code abort the transaction. So remove the transaction abort from btrfs_update_log() when we fail to COW a path to update the root item, so that if an -ENOSPC failure happens we avoid aborting the current transaction and have a chance of the fsync succeeding after falling back to a transaction commit. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413 Fixes: 79787eaab46121 ("btrfs: replace many BUG_ONs with proper error handling") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>