summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2019-12-10NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()Trond Myklebust
commit 79cc55422ce99be5964bde208ba8557174720893 upstream. A typo in nfs4_refresh_delegation_stateid() means we're leaking an RCU lock, and always returning a value of 'false'. As the function description states, we were always supposed to return 'true' if a matching delegation was found. Fixes: 12f275cdd163 ("NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10fuse: truncate pending writes on O_TRUNCMiklos Szeredi
commit e4648309b85a78f8c787457832269a8712a8673e upstream. Make sure cached writes are not reordered around open(..., O_TRUNC), with the obvious wrong results. Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10fuse: flush dirty data/metadata before non-truncate setattrMiklos Szeredi
commit b24e7598db62386a95a3c8b9c75630c5d56fe077 upstream. If writeback cache is enabled, then writes might get reordered with chmod/chown/utimes. The problem with this is that performing the write in the fuse daemon might itself change some of these attributes. In such case the following sequence of operations will result in file ending up with the wrong mode, for example: int fd = open ("suid", O_WRONLY|O_CREAT|O_EXCL); write (fd, "1", 1); fchown (fd, 0, 0); fchmod (fd, 04755); close (fd); This patch fixes this by flushing pending writes before performing chown/chmod/utimes. Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10NFSv4: Fix leak of clp->cl_acceptor stringChuck Lever
commit 1047ec868332034d1fbcb2fae19fe6d4cb869ff2 upstream. Our client can issue multiple SETCLIENTID operations to the same server in some circumstances. Ensure that calls to nfs4_proc_setclientid() after the first one do not overwrite the previously allocated cl_acceptor string. unreferenced object 0xffff888461031800 (size 32): comm "mount.nfs", pid 2227, jiffies 4294822467 (age 1407.749s) hex dump (first 32 bytes): 6e 66 73 40 6b 6c 69 6d 74 2e 69 62 2e 31 30 31 nfs@klimt.ib.101 35 67 72 61 6e 67 65 72 2e 6e 65 74 00 00 00 00 5granger.net.... backtrace: [<00000000ab820188>] __kmalloc+0x128/0x176 [<00000000eeaf4ec8>] gss_stringify_acceptor+0xbd/0x1a7 [auth_rpcgss] [<00000000e85e3382>] nfs4_proc_setclientid+0x34e/0x46c [nfsv4] [<000000003d9cf1fa>] nfs40_discover_server_trunking+0x7a/0xed [nfsv4] [<00000000b81c3787>] nfs4_discover_server_trunking+0x81/0x244 [nfsv4] [<000000000801b55f>] nfs4_init_client+0x1b0/0x238 [nfsv4] [<00000000977daf7f>] nfs4_set_client+0xfe/0x14d [nfsv4] [<0000000053a68a2a>] nfs4_create_server+0x107/0x1db [nfsv4] [<0000000088262019>] nfs4_remote_mount+0x2c/0x59 [nfsv4] [<00000000e84a2fd0>] legacy_get_tree+0x2d/0x4c [<00000000797e947c>] vfs_get_tree+0x20/0xc7 [<00000000ecabaaa8>] fc_mount+0xe/0x36 [<00000000f15fafc2>] vfs_kern_mount+0x74/0x8d [<00000000a3ff4e26>] nfs_do_root_mount+0x8a/0xa3 [nfsv4] [<00000000d1c2b337>] nfs4_try_mount+0x58/0xad [nfsv4] [<000000004c9bddee>] nfs_fs_mount+0x820/0x869 [nfs] Fixes: f11b2a1cfbf5 ("nfs4: copy acceptor name from context ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10btrfs: silence maybe-uninitialized warning in clone_rangeAustin Kim
commit 431d39887d6273d6d84edf3c2eab09f4200e788a upstream. GCC throws warning message as below: ‘clone_src_i_size’ may be used uninitialized in this function [-Wmaybe-uninitialized] #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) ^ fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here u64 clone_src_i_size; ^ The clone_src_i_size is only used as call-by-reference in a call to get_inode_info(). Silence the warning by initializing clone_src_i_size to 0. Note that the warning is a false positive and reported by older versions of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the patch is applied. Setting clone_src_i_size to 0 does not otherwise make sense and would not do any action in case the code changes in the future. Signed-off-by: Austin Kim <austindh.kim@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()Jia-Ju Bai
commit 583fee3e12df0e6f1f66f063b989d8e7fed0e65a upstream. In ocfs2_write_end_nolock(), there are an if statement on lines 1976, 2047 and 2058, to check whether handle is NULL: if (handle) When handle is NULL, it is used on line 2045: ocfs2_update_inode_fsync_trans(handle, inode, 1); oi->i_sync_tid = handle->h_transaction->t_tid; Thus, a possible null-pointer dereference may occur. To fix this bug, handle is checked before calling ocfs2_update_inode_fsync_trans(). This bug is found by a static analysis tool STCheck written by us. Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10fs: ocfs2: fix a possible null-pointer dereference in ↵Jia-Ju Bai
ocfs2_info_scan_inode_alloc() commit 2abb7d3b12d007c30193f48bebed781009bebdd2 upstream. In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283 to check whether inode_alloc is NULL: if (inode_alloc) When inode_alloc is NULL, it is used on line 287: ocfs2_inode_lock(inode_alloc, &bh, 0); ocfs2_inode_lock_full_nested(inode, ...) struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); Thus, a possible null-pointer dereference may occur. To fix this bug, inode_alloc is checked on line 286. This bug is found by a static analysis tool STCheck written by us. Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10ocfs2: clear zero in unaligned direct IOJia Guo
commit 7a243c82ea527cd1da47381ad9cd646844f3b693 upstream. Unused portion of a part-written fs-block-sized block is not set to zero in unaligned append direct write.This can lead to serious data inconsistencies. Ocfs2 manage disk with cluster size(for example, 1M), part-written in one cluster will change the cluster state from UN-WRITTEN to WRITTEN, VFS(function dio_zero_block) doesn't do the cleaning because bh's state is not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN cluster. For example, the cluster size is 1M, file size is 8k and we direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty data. We have to deal with two cases: 1.The starting position of direct write is outside the file. 2.The starting position of direct write is located in the file. We need set bh's state to NEW in the first case. In the second case, we need mapped twice because bh's state of area out file should be set to NEW while area in file not. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10fs: cifs: mute -Wunused-const-variable messageAustin Kim
commit dd19c106a36690b47bb1acc68372f2b472b495b8 upstream. After 'Initial git repository build' commit, 'mapping_table_ERRHRD' variable has not been used. So 'mapping_table_ERRHRD' const variable could be removed to mute below warning message: fs/cifs/netmisc.c:120:40: warning: unused variable 'mapping_table_ERRHRD' [-Wunused-const-variable] static const struct smb_to_posix_error mapping_table_ERRHRD[] = { ^ Signed-off-by: Austin Kim <austindh.kim@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10nfs: Fix nfsi->nrequests count error on nfs_inode_remove_requestZhangXiaoxu
commit 33ea5aaa87cdae0f9af4d6b7ee4f650a1a36fd1d upstream. When xfstests testing, there are some WARNING as below: WARNING: CPU: 0 PID: 6235 at fs/nfs/inode.c:122 nfs_clear_inode+0x9c/0xd8 Modules linked in: CPU: 0 PID: 6235 Comm: umount.nfs Hardware name: linux,dummy-virt (DT) pstate: 60000005 (nZCv daif -PAN -UAO) pc : nfs_clear_inode+0x9c/0xd8 lr : nfs_evict_inode+0x60/0x78 sp : fffffc000f68fc00 x29: fffffc000f68fc00 x28: fffffe00c53155c0 x27: fffffe00c5315000 x26: fffffc0009a63748 x25: fffffc000f68fd18 x24: fffffc000bfaaf40 x23: fffffc000936d3c0 x22: fffffe00c4ff5e20 x21: fffffc000bfaaf40 x20: fffffe00c4ff5d10 x19: fffffc000c056000 x18: 000000000000003c x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000040 x14: 0000000000000228 x13: fffffc000c3a2000 x12: 0000000000000045 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : fffffc00084b027c x5 : fffffc0009a64000 x4 : fffffe00c0e77400 x3 : fffffc000c0563a8 x2 : fffffffffffffffb x1 : 000000000000764e x0 : 0000000000000001 Call trace: nfs_clear_inode+0x9c/0xd8 nfs_evict_inode+0x60/0x78 evict+0x108/0x380 dispose_list+0x70/0xa0 evict_inodes+0x194/0x210 generic_shutdown_super+0xb0/0x220 nfs_kill_super+0x40/0x88 deactivate_locked_super+0xb4/0x120 deactivate_super+0x144/0x160 cleanup_mnt+0x98/0x148 __cleanup_mnt+0x38/0x50 task_work_run+0x114/0x160 do_notify_resume+0x2f8/0x308 work_pending+0x8/0x14 The nrequest should be increased/decreased only if PG_INODE_REF flag was setted. But in the nfs_inode_remove_request function, it maybe decrease when no PG_INODE_REF flag, this maybe lead nrequests count error. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10btrfs: qgroup: Always free PREALLOC META reserve in ↵Qu Wenruo
btrfs_delalloc_release_extents() commit 8702ba9396bf7bbae2ab93c94acd4bd37cfa4f09 upstream. [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10Btrfs: fix inode cache block reserve leak on failure to allocate data spaceFilipe Manana
commit 29d47d00e0ae61668ee0c5d90bef2893c8abbafa upstream. If we failed to allocate the data extent(s) for the inode space cache, we were bailing out without releasing the previously reserved metadata. This was triggering the following warnings when unmounting a filesystem: $ cat -n fs/btrfs/inode.c (...) 9268 void btrfs_destroy_inode(struct inode *inode) 9269 { (...) 9276 WARN_ON(BTRFS_I(inode)->block_rsv.reserved); 9277 WARN_ON(BTRFS_I(inode)->block_rsv.size); (...) 9281 WARN_ON(BTRFS_I(inode)->csum_bytes); 9282 WARN_ON(BTRFS_I(inode)->defrag_bytes); (...) Several fstests test cases triggered this often, such as generic/083, generic/102, generic/172, generic/269 and generic/300 at least, producing stack traces like the following in dmesg/syslog: [82039.079546] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9276 btrfs_destroy_inode+0x203/0x270 [btrfs] (...) [82039.081543] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1 [82039.081912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [82039.082673] RIP: 0010:btrfs_destroy_inode+0x203/0x270 [btrfs] (...) [82039.083913] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206 [82039.084320] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002 [82039.084736] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660 [82039.085156] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000 [82039.085578] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0 [82039.086000] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000 [82039.086416] FS: 00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000 [82039.086837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [82039.087253] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0 [82039.087672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [82039.088089] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [82039.088504] Call Trace: [82039.088918] destroy_inode+0x3b/0x70 [82039.089340] btrfs_free_fs_root+0x16/0xa0 [btrfs] [82039.089768] btrfs_free_fs_roots+0xd8/0x160 [btrfs] [82039.090183] ? wait_for_completion+0x65/0x1a0 [82039.090607] close_ctree+0x172/0x370 [btrfs] [82039.091021] generic_shutdown_super+0x6c/0x110 [82039.091427] kill_anon_super+0xe/0x30 [82039.091832] btrfs_kill_super+0x12/0xa0 [btrfs] [82039.092233] deactivate_locked_super+0x3a/0x70 [82039.092636] cleanup_mnt+0x3b/0x80 [82039.093039] task_work_run+0x93/0xc0 [82039.093457] exit_to_usermode_loop+0xfa/0x100 [82039.093856] do_syscall_64+0x162/0x1d0 [82039.094244] entry_SYSCALL_64_after_hwframe+0x49/0xbe [82039.094634] RIP: 0033:0x7f8db8fbab37 (...) [82039.095876] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [82039.096290] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37 [82039.096700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240 [82039.097110] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015 [82039.097522] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64 [82039.097937] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0 [82039.098350] irq event stamp: 0 [82039.098750] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [82039.099150] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.099545] softirqs last enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.099925] softirqs last disabled at (0): [<0000000000000000>] 0x0 [82039.100292] ---[ end trace f2521afa616ddccc ]--- [82039.100707] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9277 btrfs_destroy_inode+0x1ac/0x270 [btrfs] (...) [82039.103050] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1 [82039.103428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [82039.104203] RIP: 0010:btrfs_destroy_inode+0x1ac/0x270 [btrfs] (...) [82039.105461] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206 [82039.105866] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002 [82039.106270] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660 [82039.106673] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000 [82039.107078] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0 [82039.107487] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000 [82039.107894] FS: 00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000 [82039.108309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [82039.108723] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0 [82039.109146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [82039.109567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [82039.109989] Call Trace: [82039.110405] destroy_inode+0x3b/0x70 [82039.110830] btrfs_free_fs_root+0x16/0xa0 [btrfs] [82039.111257] btrfs_free_fs_roots+0xd8/0x160 [btrfs] [82039.111675] ? wait_for_completion+0x65/0x1a0 [82039.112101] close_ctree+0x172/0x370 [btrfs] [82039.112519] generic_shutdown_super+0x6c/0x110 [82039.112988] kill_anon_super+0xe/0x30 [82039.113439] btrfs_kill_super+0x12/0xa0 [btrfs] [82039.113861] deactivate_locked_super+0x3a/0x70 [82039.114278] cleanup_mnt+0x3b/0x80 [82039.114685] task_work_run+0x93/0xc0 [82039.115083] exit_to_usermode_loop+0xfa/0x100 [82039.115476] do_syscall_64+0x162/0x1d0 [82039.115863] entry_SYSCALL_64_after_hwframe+0x49/0xbe [82039.116254] RIP: 0033:0x7f8db8fbab37 (...) [82039.117463] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [82039.117882] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37 [82039.118330] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240 [82039.118743] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015 [82039.119159] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64 [82039.119574] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0 [82039.119987] irq event stamp: 0 [82039.120387] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [82039.120787] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.121182] softirqs last enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.121563] softirqs last disabled at (0): [<0000000000000000>] 0x0 [82039.121933] ---[ end trace f2521afa616ddccd ]--- [82039.122353] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9278 btrfs_destroy_inode+0x1bc/0x270 [btrfs] (...) [82039.124606] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1 [82039.125008] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [82039.125801] RIP: 0010:btrfs_destroy_inode+0x1bc/0x270 [btrfs] (...) [82039.126998] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010202 [82039.127399] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002 [82039.127803] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8dde29b34660 [82039.128206] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000 [82039.128611] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0 [82039.129020] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000 [82039.129428] FS: 00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000 [82039.129846] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [82039.130261] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0 [82039.130684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [82039.131142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [82039.131561] Call Trace: [82039.131990] destroy_inode+0x3b/0x70 [82039.132417] btrfs_free_fs_root+0x16/0xa0 [btrfs] [82039.132844] btrfs_free_fs_roots+0xd8/0x160 [btrfs] [82039.133262] ? wait_for_completion+0x65/0x1a0 [82039.133688] close_ctree+0x172/0x370 [btrfs] [82039.134157] generic_shutdown_super+0x6c/0x110 [82039.134575] kill_anon_super+0xe/0x30 [82039.134997] btrfs_kill_super+0x12/0xa0 [btrfs] [82039.135415] deactivate_locked_super+0x3a/0x70 [82039.135832] cleanup_mnt+0x3b/0x80 [82039.136239] task_work_run+0x93/0xc0 [82039.136637] exit_to_usermode_loop+0xfa/0x100 [82039.137029] do_syscall_64+0x162/0x1d0 [82039.137418] entry_SYSCALL_64_after_hwframe+0x49/0xbe [82039.137812] RIP: 0033:0x7f8db8fbab37 (...) [82039.139059] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [82039.139475] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37 [82039.139890] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240 [82039.140302] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015 [82039.140719] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64 [82039.141138] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0 [82039.141597] irq event stamp: 0 [82039.142043] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [82039.142443] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.142839] softirqs last enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.143220] softirqs last disabled at (0): [<0000000000000000>] 0x0 [82039.143588] ---[ end trace f2521afa616ddcce ]--- [82039.167472] WARNING: CPU: 3 PID: 13167 at fs/btrfs/extent-tree.c:10120 btrfs_free_block_groups+0x30d/0x460 [btrfs] (...) [82039.173800] CPU: 3 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1 [82039.174847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [82039.177031] RIP: 0010:btrfs_free_block_groups+0x30d/0x460 [btrfs] (...) [82039.180397] RSP: 0018:ffffac0b426a7dd8 EFLAGS: 00010206 [82039.181574] RAX: ffff8de010a1db40 RBX: ffff8de010a1db40 RCX: 0000000000170014 [82039.182711] RDX: ffff8ddff4380040 RSI: ffff8de010a1da58 RDI: 0000000000000246 [82039.183817] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000 [82039.184925] R10: ffff8de036404380 R11: ffffffffb8a5ea00 R12: ffff8de010a1b2b8 [82039.186090] R13: ffff8de010a1b2b8 R14: 0000000000000000 R15: dead000000000100 [82039.187208] FS: 00007f8db96d12c0(0000) GS:ffff8de036b80000(0000) knlGS:0000000000000000 [82039.188345] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [82039.189481] CR2: 00007fb044005170 CR3: 00000002315cc006 CR4: 00000000003606e0 [82039.190674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [82039.191829] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [82039.192978] Call Trace: [82039.194160] close_ctree+0x19a/0x370 [btrfs] [82039.195315] generic_shutdown_super+0x6c/0x110 [82039.196486] kill_anon_super+0xe/0x30 [82039.197645] btrfs_kill_super+0x12/0xa0 [btrfs] [82039.198696] deactivate_locked_super+0x3a/0x70 [82039.199619] cleanup_mnt+0x3b/0x80 [82039.200559] task_work_run+0x93/0xc0 [82039.201505] exit_to_usermode_loop+0xfa/0x100 [82039.202436] do_syscall_64+0x162/0x1d0 [82039.203339] entry_SYSCALL_64_after_hwframe+0x49/0xbe [82039.204091] RIP: 0033:0x7f8db8fbab37 (...) [82039.206360] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [82039.207132] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37 [82039.207906] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240 [82039.208621] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015 [82039.209285] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64 [82039.209984] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0 [82039.210642] irq event stamp: 0 [82039.211306] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [82039.211971] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.212643] softirqs last enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00 [82039.213304] softirqs last disabled at (0): [<0000000000000000>] 0x0 [82039.213875] ---[ end trace f2521afa616ddccf ]--- Fix this by releasing the reserved metadata on failure to allocate data extent(s) for the inode cache. Fixes: 69fe2d75dd91d0 ("btrfs: make the delalloc block rsv per inode") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25ceph: just skip unrecognized info in ceph_reply_info_extraJeff Layton
commit 1d3f87233e26362fc3d4e59f0f31a71b570f90b9 upstream. In the future, we're going to want to extend the ceph_reply_info_extra for create replies. Currently though, the kernel code doesn't accept an extra blob that is larger than the expected data. Change the code to skip over any unrecognized fields at the end of the extra blob, rather than returning -EIO. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25btrfs: tracepoints: Fix wrong parameter order for qgroup eventsQu Wenruo
commit fd2b007eaec898564e269d1f478a2da0380ecf51 upstream. [BUG] For btrfs:qgroup_meta_reserve event, the trace event can output garbage: qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2 The diff should always be alinged to sector size (4k), so there is definitely something wrong. [CAUSE] For the wrong @diff, it's caused by wrong parameter order. The correct parameters are: struct btrfs_root, s64 diff, int type. However the parameters used are: struct btrfs_root, int type, s64 diff. Fixes: 4ee0d8832c2e ("btrfs: qgroup: Update trace events for metadata reservation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Btrfs: check for the full sync flag while holding the inode lock during fsyncFilipe Manana
commit ba0b084ac309283db6e329785c1dc4f45fdbd379 upstream. We were checking for the full fsync flag in the inode before locking the inode, which is racy, since at that that time it might not be set but after we acquire the inode lock some other task set it. One case where this can happen is on a system low on memory and some concurrent task failed to allocate an extent map and therefore set the full sync flag on the inode, to force the next fsync to work in full mode. A consequence of missing the full fsync flag set is hitting the problems fixed by commit 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"), BUG_ON() when dropping extents from a log tree, hitting assertion failures at tree-log.c:copy_items() or all sorts of weird inconsistencies after replaying a log due to file extents items representing ranges that overlap. So just move the check such that it's done after locking the inode and before starting writeback again. Fixes: 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Btrfs: fix qgroup double free after failure to reserve metadata for delallocFilipe Manana
commit c7967fc1499beb9b70bb9d33525fb0b384af8883 upstream. If we fail to reserve metadata for delalloc operations we end up releasing the previously reserved qgroup amount twice, once explicitly under the 'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a value of 'true' for its 'qgroup_free' argument, which results in btrfs_qgroup_free_meta_prealloc() being called again, so we end up having a double free. Also if we fail to reserve the necessary qgroup amount, we jump to the label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to reserve any qgroup amount. So we freed some amount we never reserved. So fix this by removing the call to btrfs_inode_rsv_release() in the failure path, since it's not necessary at all as we haven't changed the inode's block reserve in any way at this point. Fixes: c8eaeac7b73434 ("btrfs: reserve delalloc metadata differently") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [PG: adapt for different file name in older code base.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25btrfs: don't needlessly create extent-refs kernel threadDavid Sterba
commit 80ed4548d0711d15ca51be5dee0ff813051cfc90 upstream. The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") removed the async delayed refs but the thread has been created, without any use. Remove it to avoid resource consumption. Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Btrfs: add missing extents release on file extent cluster relocation errorFilipe Manana
commit 44db1216efe37bf670f8d1019cdc41658d84baf5 upstream. If we error out when finding a page at relocate_file_extent_cluster(), we need to release the outstanding extents counter on the relocation inode, set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise the inode's block reserve size can never decrease to zero and metadata space is leaked. Therefore add a call to btrfs_delalloc_release_extents() in case we can't find the target page. Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()Qu Wenruo
commit 4b654acdae850f48b8250b9a578a4eaa518c7a6f upstream. In btrfs_read_block_groups(), if we have an invalid block group which has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS feature, we error out without freeing the block group cache. This patch will add the missing btrfs_put_block_group() to prevent memory leak. Note for stable backports: the file to patch in versions <= 5.3 is fs/btrfs/extent-tree.c Fixes: 49303381f19a ("Btrfs: bail out if block group has different mixed flag") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [PG: adapt for different file name in older code base.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25fs/dax: Fix pmd vs pte conflict detectionDan Williams
commit 6370740e5f8ef12de7f9a9bf48a0393d202cd827 upstream. Users reported a v5.3 performance regression and inability to establish huge page mappings. A revised version of the ndctl "dax.sh" huge page unit test identifies commit 23c84eb78375 "dax: Fix missed wakeup with PMD faults" as the source. Update get_unlocked_entry() to check for NULL entries before checking the entry order, otherwise NULL is misinterpreted as a present pte conflict. The 'order' check needs to happen before the locked check as an unlocked entry at the wrong order must fallback to lookup the correct order. Reported-by: Jeff Smits <jeff.smits@intel.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Cc: <stable@vger.kernel.org> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25cifs: Fix missed free operationsChuhong Yuan
commit 783bf7b8b641167fb6f3f4f787f60ae62bad41b3 upstream. cifs_setattr_nounix has two paths which miss free operations for xid and fullpath. Use goto cifs_setattr_exit like other paths to fix them. CC: Stable <stable@vger.kernel.org> Fixes: aa081859b10c ("cifs: flush before set-info if we have writeable handles") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25CIFS: avoid using MID 0xFFFFRoberto Bergantinos Corpas
commit 03d9a9fe3f3aec508e485dd3dcfa1e99933b4bdb upstream. According to MS-CIFS specification MID 0xFFFF should not be used by the CIFS client, but we actually do. Besides, this has proven to cause races leading to oops between SendReceive2/cifs_demultiplex_thread. On SMB1, MID is a 2 byte value easy to reach in CurrentMid which may conflict with an oplock break notification request coming from server Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.cDavid Hildenbrand
commit aad5f69bc161af489dbb5934868bd347282f0764 upstream. There are three places where we access uninitialized memmaps, namely: - /proc/kpagecount - /proc/kpageflags - /proc/kpagecgroup We have initialized memmaps either when the section is online or when the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. For example, not onlining a DIMM during boot and calling /proc/kpagecount with CONFIG_PAGE_POISONING: :/# cat /proc/kpagecount > tmp.test BUG: unable to handle page fault for address: fffffffffffffffe #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 RIP: 0010:kpagecount_read+0xce/0x1e0 Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 Call Trace: proc_reg_read+0x3c/0x60 vfs_read+0xc5/0x180 ksys_read+0x68/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe For now, let's drop support for ZONE_DEVICE from the three pseudo files in order to fix this. To distinguish offline memory (with garbage memmap) from ZONE_DEVICE memory with properly initialized memmaps, we would have to check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage of both (especially, special casing devmem) is frowned upon and needs to be reworked. The fundamental issue we have is: if (pfn_to_online_page(pfn)) { /* memmap initialized */ } else if (pfn_valid(pfn)) { /* * ??? * a) offline memory. memmap garbage. * b) devmem: memmap initialized to ZONE_DEVICE. * c) devmem: reserved for driver. memmap garbage. * (d) devmem: memmap currently initializing - garbage) */ } We'll leave the pfn_zone_device_reserved() check in stable_page_flags() in place as that function is also used from memory failure. We now no longer dump information about pages that are not in use anymore - offline. Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Pankaj gupta <pagupta@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREADJens Axboe
commit 2b2ed9750fc9d040b9f6d076afcef6f00b6f1f7c upstream. We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25ocfs2: fix panic due to ocfs2_wq is nullYi Li
commit b918c43021baaa3648de09e19a4a3dd555a45f40 upstream. mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters an error. ocfs2_initialize_super() returns before allocating ocfs2_wq. ocfs2_dismount_volume() triggers the following panic. Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted. Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30 ------------[ cut here ]------------ Oops: 0002 [#1] SMP NOPTI CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E 4.14.148-200.ckv.x86_64 #1 Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018 task: ffff967af0520000 task.stack: ffffa5f05484000 RIP: 0010:mutex_lock+0x19/0x20 Call Trace: flush_workqueue+0x81/0x460 ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2] ocfs2_dismount_volume+0x84/0x400 [ocfs2] ocfs2_fill_super+0xa4/0x1270 [ocfs2] ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2] mount_bdev+0x17f/0x1c0 mount_fs+0x3a/0x160 Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com Signed-off-by: Yi Li <yilikernel@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25filldir[64]: remove WARN_ON_ONCE() for bad directory entriesLinus Torvalds
commit b9959c7a347d6adbb558fba7e36e9fef3cba3b07 upstream. This was always meant to be a temporary thing, just for testing and to see if it actually ever triggered. The only thing that reported it was syzbot doing disk image fuzzing, and then that warning is expected. So let's just remove it before -rc4, because the extra sanity testing should probably go to -stable, but we don't want the warning to do so. Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com Fixes: 8a23eb804ca4 ("Make filldir[64]() verify the directory entry filename is valid") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to itLinus Torvalds
commit c512c69187197fe08026cb5bbe7b9709f4f89b73 upstream. In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I made filldir() use unsafe_put_user(), which improves code generation on x86 enormously. But because we didn't have a "unsafe_copy_to_user()", the dirent name copy was also done by hand with unsafe_put_user() in a loop, and it turns out that a lot of other architectures didn't like that, because unlike x86, they have various alignment issues. Most non-x86 architectures trap and fix it up, and some (like xtensa) will just fail unaligned put_user() accesses unconditionally. Which makes that "copy using put_user() in a loop" not work for them at all. I could make that code do explicit alignment etc, but the architectures that don't like unaligned accesses also don't really use the fancy "user_access_begin/end()" model, so they might just use the regular old __copy_to_user() interface. So this commit takes that looping implementation, turns it into the x86 version of "unsafe_copy_to_user()", and makes other architectures implement the unsafe copy version as __copy_to_user() (the same way they do for the other unsafe_xyz() accessor functions). Note that it only does this for the copying _to_ user space, and we still don't have a unsafe version of copy_from_user(). That's partly because we have no current users of it, but also partly because the copy_from_user() case is slightly different and cannot efficiently be implemented in terms of a unsafe_get_user() loop (because gcc can't do asm goto with outputs). It would be trivial to do this using "rep movsb", which would work really nicely on newer x86 cores, but really badly on some older ones. Al Viro is looking at cleaning up all our user copy routines to make this all a non-issue, but for now we have this simple-but-stupid version for x86 that works fine for the dirent name copy case because those names are short strings and we simply don't need anything fancier. Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-and-tested-by: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Make filldir[64]() verify the directory entry filename is validLinus Torvalds
commit 8a23eb804ca4f2be909e372cf5a9e7b30ae476cd upstream. This has been discussed several times, and now filesystem people are talking about doing it individually at the filesystem layer, so head that off at the pass and just do it in getdents{64}(). This is partially based on a patch by Jann Horn, but checks for NUL bytes as well, and somewhat simplified. There's also commentary about how it might be better if invalid names due to filesystem corruption don't cause an immediate failure, but only an error at the end of the readdir(), so that people can still see the filenames that are ok. There's also been discussion about just how much POSIX strictly speaking requires this since it's about filesystem corruption. It's really more "protect user space from bad behavior" as pointed out by Jann. But since Eric Biederman looked up the POSIX wording, here it is for context: "From readdir: The readdir() function shall return a pointer to a structure representing the directory entry at the current position in the directory stream specified by the argument dirp, and position the directory stream at the next entry. It shall return a null pointer upon reaching the end of the directory stream. The structure dirent defined in the <dirent.h> header describes a directory entry. From definitions: 3.129 Directory Entry (or Link) An object that associates a filename with a file. Several directory entries can associate names with the same file. ... 3.169 Filename A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the slash character and the null byte. The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a 'pathname component'." Note that I didn't bother adding the checks to any legacy interfaces that nobody uses. Also note that if this ends up being noticeable as a performance regression, we can fix that to do a much more optimized model that checks for both NUL and '/' at the same time one word at a time. We haven't really tended to optimize 'memchr()', and it only checks for one pattern at a time anyway, and we really _should_ check for NUL too (but see the comment about "soft errors" in the code about why it currently only checks for '/') See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name lookup code looks for pathname terminating characters in parallel. Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/ Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25elf: don't use MAP_FIXED_NOREPLACE for elf executable mappingsLinus Torvalds
commit b212921b13bda088a004328457c5c21458262fe2 upstream. In commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") we changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the executable mappings. Then, people reported that it broke some binaries that had overlapping segments from the same file, and commit ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some overlaying elf segment cases. But only some - despite the summary line of that commit, it only did it when it also does a temporary brk vma for one obvious overlapping case. Now Russell King reports another overlapping case with old 32-bit x86 binaries, which doesn't trigger that limited case. End result: we had better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED. Yes, it's a sign of old binaries generated with old tool-chains, but we do pride ourselves on not breaking existing setups. This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp() and the old load_elf_library() use-cases, because nobody has reported breakage for those. Yet. Note that in all the cases seen so far, the overlapping elf sections seem to be just re-mapping of the same executable with different section attributes. We could possibly introduce a new MAP_FIXED_NOFILECHANGE flag or similar, which acts like NOREPLACE, but allows just remapping the same executable file using different protection flags. It's not clear that would make a huge difference to anything, but if people really hate that "elf remaps over previous maps" behavior, maybe at least a more limited form of remapping would alleviate some concerns. Alternatively, we should take a look at our elf_map() logic to see if we end up not mapping things properly the first time. In the meantime, this is the minimal "don't do that then" patch while people hopefully think about it more. Reported-by: Russell King <linux@armlinux.org.uk> Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") Fixes: ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments") Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Convert filldir[64]() from __put_user() to unsafe_put_user()Linus Torvalds
commit 9f79b78ef74436c7507bac6bfb7b8b989263bccb upstream. We really should avoid the "__{get,put}_user()" functions entirely, because they can easily be mis-used and the original intent of being used for simple direct user accesses no longer holds in a post-SMAP/PAN world. Manually optimizing away the user access range check makes no sense any more, when the range check is generally much cheaper than the "enable user accesses" code that the __{get,put}_user() functions still need. So instead of __put_user(), use the unsafe_put_user() interface with user_access_{begin,end}() that really does generate better code these days, and which is generally a nicer interface. Under some loads, the multiple user writes that filldir() does are actually quite noticeable. This also makes the dirent name copy use unsafe_put_user() with a couple of macros. We do not want to make function calls with SMAP/PAN disabled, and the code this generates is quite good when the architecture uses "asm goto" for unsafe_put_user() like x86 does. Note that this doesn't bother with the legacy cases. Nobody should use them anyway, so performance doesn't really matter there. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09CIFS: Fix use after free of file info structuresPavel Shilovsky
commit 1a67c415965752879e2e9fad407bc44fc7f25f23 upstream. Currently the code assumes that if a file info entry belongs to lists of open file handles of an inode and a tcon then it has non-zero reference. The recent changes broke that assumption when putting the last reference of the file info. There may be a situation when a file is being deleted but nothing prevents another thread to reference it again and start using it. This happens because we do not hold the inode list lock while checking the number of references of the file info structure. Fix this by doing the proper locking when doing the check. Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo") Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09io_uring: only flush workqueues on fileset removalJens Axboe
commit 8a99734081775c012a4a6c442fdef0379fe52bdf upstream. We should not remove the workqueue, we just need to ensure that the workqueues are synced. The workqueues are torn down on ctx removal. Cc: stable@vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [PG: use 5.3.x-stable version for this v5.2.x codebase.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09Fix the locking in dcache_readdir() and friendsAl Viro
commit d4f4de5e5ef8efde85febb6876cd3c8ab1631999 upstream. There are two problems in dcache_readdir() - one is that lockless traversal of the list needs non-trivial cooperation of d_alloc() (at least a switch to list_add_rcu(), and probably more than just that) and another is that it assumes that no removal will happen without the directory locked exclusive. Said assumption had always been there, never had been stated explicitly and is violated by several places in the kernel (devpts and selinuxfs). * replacement of next_positive() with different calling conventions: it returns struct list_head * instead of struct dentry *; the latter is passed in and out by reference, grabbing the result and dropping the original value. * scan is under ->d_lock. If we run out of timeslice, cursor is moved after the last position we'd reached and we reschedule; then the scan continues from that place. To avoid livelocks between multiple lseek() (with cursors getting moved past each other, never reaching the real entries) we always skip the cursors, need_resched() or not. * returned list_head * is either ->d_child of dentry we'd found or ->d_subdirs of parent (if we got to the end of the list). * dcache_readdir() and dcache_dir_lseek() switched to new helper. dcache_readdir() always holds a reference to dentry passed to dir_emit() now. Cursor is moved to just before the entry where dir_emit() has failed or into the very end of the list, if we'd run out. * move_cursor() eliminated - it had sucky calling conventions and after fixing that it became simply list_move() (in lseek and scan_positives) or list_move_tail() (in readdir). All operations with the list are under ->d_lock now, and we do not depend upon having all file removals done with parent locked exclusive anymore. Cc: stable@vger.kernel.org Reported-by: "zhengbin (A)" <zhengbin13@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09NFS: Fix O_DIRECT accounting of number of bytes read/writtenTrond Myklebust
commit 031d73ed768a40684f3ca21992265ffdb6a270bf upstream. When a series of O_DIRECT reads or writes are truncated, either due to eof or due to an error, then we should return the number of contiguous bytes that were received/sent starting at the offset specified by the application. Currently, we are failing to correctly check contiguity, and so we're failing the generic/465 in xfstests when the race between the read and write RPCs causes the file to get extended while the 2 reads are outstanding. If the first read RPC call wins the race and returns with eof set, we should treat the second read RPC as being truncated. Reported-by: Su Yanjun <suyj.fnst@cn.fujitsu.com> Fixes: 1ccbad9f9f9bd ("nfs: fix DIO good bytes calculation") Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09btrfs: fix uninitialized ret in ref-verifyJosef Bacik
commit c5f4987e86f6692fdb12533ea1fc7a7bb98e555a upstream. Coverity caught a case where we could return with a uninitialized value in ret in process_leaf. This is actually pretty likely because we could very easily run into a block group item key and have a garbage value in ret and think there was an errror. Fix this by initializing ret to 0. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: fd708b81d972 ("Btrfs: add a extent ref verify tool") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09btrfs: fix incorrect updating of log root treeJosef Bacik
commit 4203e968947071586a98b5314fd7ffdea3b4f971 upstream. We've historically had reports of being unable to mount file systems because the tree log root couldn't be read. Usually this is the "parent transid failure", but could be any of the related errors, including "fsid mismatch" or "bad tree block", depending on which block got allocated. The modification of the individual log root items are serialized on the per-log root root_mutex. This means that any modification to the per-subvol log root_item is completely protected. However we update the root item in the log root tree outside of the log root tree log_mutex. We do this in order to allow multiple subvolumes to be updated in each log transaction. This is problematic however because when we are writing the log root tree out we update the super block with the _current_ log root node information. Since these two operations happen independently of each other, you can end up updating the log root tree in between writing out the dirty blocks and setting the super block to point at the current root. This means we'll point at the new root node that hasn't been written out, instead of the one we should be pointing at. Thus whatever garbage or old block we end up pointing at complains when we mount the file system later and try to replay the log. Fix this by copying the log's root item into a local root item copy. Then once we're safely under the log_root_tree->log_mutex we update the root item in the log_root_tree. This way we do not modify the log_root_tree while we're committing it, fixing the problem. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09Btrfs: fix memory leak due to concurrent append writes with fiemapFilipe Manana
commit c67d970f0ea8dcc423e112137d34334fa0abb8ec upstream. When we have a buffered write that starts at an offset greater than or equals to the file's size happening concurrently with a full ranged fiemap, we can end up leaking an extent state structure. Suppose we have a file with a size of 1Mb, and before the buffered write and fiemap are performed, it has a single extent state in its io tree representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set. The following sequence diagram shows how the memory leak happens if a fiemap a buffered write, starting at offset 1Mb and with a length of 4Kb, are performed concurrently. CPU 1 CPU 2 extent_fiemap() --> it's a full ranged fiemap range from 0 to LLONG_MAX - 1 (9223372036854775807) --> locks range in the inode's io tree --> after this we have 2 extent states in the io tree: --> 1 for range [0, 1Mb[ with the bits EXTENT_LOCKED and EXTENT_DELALLOC_BITS set --> 1 for the range [1Mb, LLONG_MAX[ with the EXTENT_LOCKED bit set --> start buffered write at offset 1Mb with a length of 4Kb btrfs_file_write_iter() btrfs_buffered_write() --> cached_state is NULL lock_and_cleanup_extent_if_need() --> returns 0 and does not lock range because it starts at current i_size / eof --> cached_state remains NULL btrfs_dirty_pages() btrfs_set_extent_delalloc() (...) __set_extent_bit() --> splits extent state for range [1Mb, LLONG_MAX[ and now we have 2 extent states: --> one for the range [1Mb, 1Mb + 4Kb[ with EXTENT_LOCKED set --> another one for the range [1Mb + 4Kb, LLONG_MAX[ with EXTENT_LOCKED set as well --> sets EXTENT_DELALLOC on the extent state for the range [1Mb, 1Mb + 4Kb[ --> caches extent state [1Mb, 1Mb + 4Kb[ into @cached_state because it has the bit EXTENT_LOCKED set --> btrfs_buffered_write() ends up with a non-NULL cached_state and never calls anything to release its reference on it, resulting in a memory leak Fix this by calling free_extent_state() on cached_state if the range was not locked by lock_and_cleanup_extent_if_need(). The same issue can happen if anything else other than fiemap locks a range that covers eof and beyond. This could be triggered, sporadically, by test case generic/561 from the fstests suite, which makes duperemove run concurrently with fsstress, and duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set the leak is reported in dmesg/syslog when removing the btrfs module with a message like the following: [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1 Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak. CC: stable@vger.kernel.org # 4.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09btrfs: fix balance convert to single on 32-bit host CPUsZygo Blaxell
commit 7a54789074a54f64addf5b49bf1994f478337a83 upstream. Currently, the command: btrfs balance start -dconvert=single,soft . on a Raspberry Pi produces the following kernel message: BTRFS error (device mmcblk0p2): balance: invalid convert data profile single This fails because we use is_power_of_2(unsigned long) to validate the new data profile, the constant for 'single' profile uses bit 48, and there are only 32 bits in a long on ARM. Fix by open-coding the check using u64 variables. Tested by completing the original balance command on several Raspberry Pis. Fixes: 818255feece6 ("btrfs: use common helper instead of open coding a bit test") CC: stable@vger.kernel.org # 4.20+ Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09btrfs: allocate new inode in NOFS contextJosef Bacik
commit 11a19a90870ea5496a8ded69b86f5b476b6d3355 upstream. A user reported a lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.2.11-gentoo #2 Not tainted ------------------------------------------------------ kswapd0/711 is trying to acquire lock: 000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500 but task is already holding lock: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}: kmem_cache_alloc+0x1f/0x1c0 btrfs_alloc_inode+0x1f/0x260 alloc_inode+0x16/0xa0 new_inode+0xe/0xb0 btrfs_new_inode+0x70/0x610 btrfs_symlink+0xd0/0x420 vfs_symlink+0x9c/0x100 do_symlinkat+0x66/0xe0 do_syscall_64+0x55/0x1c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (sb_internal){.+.+}: __sb_start_write+0xf6/0x150 start_transaction+0x3a8/0x500 btrfs_commit_inode_delayed_inode+0x59/0x110 btrfs_evict_inode+0x19e/0x4c0 evict+0xbc/0x1f0 inode_lru_isolate+0x113/0x190 __list_lru_walk_one.isra.4+0x5c/0x100 list_lru_walk_one+0x32/0x50 prune_icache_sb+0x36/0x80 super_cache_scan+0x14a/0x1d0 do_shrink_slab+0x131/0x320 shrink_node+0xf7/0x380 balance_pgdat+0x2d5/0x640 kswapd+0x2ba/0x5e0 kthread+0x147/0x160 ret_from_fork+0x24/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); lock(sb_internal); *** DEADLOCK *** 3 locks held by kswapd0/711: #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30 #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380 #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0 stack backtrace: CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2 Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017 Call Trace: dump_stack+0x85/0xc7 print_circular_bug.cold.40+0x1d9/0x235 __lock_acquire+0x18b1/0x1f00 lock_acquire+0xa6/0x170 ? start_transaction+0x3a8/0x500 __sb_start_write+0xf6/0x150 ? start_transaction+0x3a8/0x500 start_transaction+0x3a8/0x500 btrfs_commit_inode_delayed_inode+0x59/0x110 btrfs_evict_inode+0x19e/0x4c0 ? var_wake_function+0x20/0x20 evict+0xbc/0x1f0 inode_lru_isolate+0x113/0x190 ? discard_new_inode+0xc0/0xc0 __list_lru_walk_one.isra.4+0x5c/0x100 ? discard_new_inode+0xc0/0xc0 list_lru_walk_one+0x32/0x50 prune_icache_sb+0x36/0x80 super_cache_scan+0x14a/0x1d0 do_shrink_slab+0x131/0x320 shrink_node+0xf7/0x380 balance_pgdat+0x2d5/0x640 kswapd+0x2ba/0x5e0 ? __wake_up_common_lock+0x90/0x90 kthread+0x147/0x160 ? balance_pgdat+0x640/0x640 ? __kthread_create_on_node+0x160/0x160 ret_from_fork+0x24/0x30 This is because btrfs_new_inode() calls new_inode() under the transaction. We could probably move the new_inode() outside of this but for now just wrap it in memalloc_nofs_save(). Reported-by: Zdenek Sojka <zsojka@seznam.cz> Fixes: 712e36c5f2a7 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode") CC: stable@vger.kernel.org # 4.16+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09btrfs: relocation: fix use-after-free on dead relocation rootsQu Wenruo
commit 1fac4a54374f7ef385938f3c6cf7649c0fe4f6cd upstream. [BUG] One user reported a reproducible KASAN report about use-after-free: BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265 BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0 ================================================================== BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs] Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579 CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018 Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Call Trace: dump_stack+0x7b/0xba print_address_description+0x6c/0x22e ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs] __kasan_report.cold+0x1b/0x3b ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs] kasan_report+0x12/0x17 __asan_report_store8_noabort+0x17/0x20 btrfs_init_reloc_root+0x2cd/0x340 [btrfs] record_root_in_trans+0x2a0/0x370 [btrfs] btrfs_record_root_in_trans+0xf4/0x140 [btrfs] start_transaction+0x1ab/0xe90 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs] ? lock_repin_lock+0x400/0x400 ? __kmem_cache_shutdown.cold+0x140/0x1ad ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] normal_work_helper+0x1bd/0xca0 [btrfs] ? process_one_work+0x819/0x1720 ? kasan_check_read+0x11/0x20 btrfs_endio_write_helper+0x12/0x20 [btrfs] process_one_work+0x8c9/0x1720 ? pwq_dec_nr_in_flight+0x2f0/0x2f0 ? worker_thread+0x1d9/0x1030 worker_thread+0x98/0x1030 kthread+0x2bb/0x3b0 ? process_one_work+0x1720/0x1720 ? kthread_park+0x120/0x120 ret_from_fork+0x35/0x40 Allocated by task 369692: __kasan_kmalloc.part.0+0x44/0xc0 __kasan_kmalloc.constprop.0+0xba/0xc0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x138/0x260 btrfs_read_tree_root+0x92/0x360 [btrfs] btrfs_read_fs_root+0x10/0xb0 [btrfs] create_reloc_root+0x47d/0xa10 [btrfs] btrfs_init_reloc_root+0x1e2/0x340 [btrfs] record_root_in_trans+0x2a0/0x370 [btrfs] btrfs_record_root_in_trans+0xf4/0x140 [btrfs] start_transaction+0x1ab/0xe90 [btrfs] btrfs_start_transaction+0x1e/0x20 [btrfs] __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs] btrfs_prealloc_file_range+0x13/0x20 [btrfs] prealloc_file_extent_cluster+0x29f/0x570 [btrfs] relocate_file_extent_cluster+0x193/0xc30 [btrfs] relocate_data_extent+0x1f8/0x490 [btrfs] relocate_block_group+0x600/0x1060 [btrfs] btrfs_relocate_block_group+0x3a0/0xa00 [btrfs] btrfs_relocate_chunk+0x9e/0x180 [btrfs] btrfs_balance+0x14e4/0x2fc0 [btrfs] btrfs_ioctl_balance+0x47f/0x640 [btrfs] btrfs_ioctl+0x119d/0x8380 [btrfs] do_vfs_ioctl+0x9f5/0x1060 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x73/0xb0 do_syscall_64+0xa5/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 369692: __kasan_slab_free+0x14f/0x210 kasan_slab_free+0xe/0x10 kfree+0xd8/0x270 btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs] clean_dirty_subvols+0x227/0x340 [btrfs] relocate_block_group+0x972/0x1060 [btrfs] btrfs_relocate_block_group+0x3a0/0xa00 [btrfs] btrfs_relocate_chunk+0x9e/0x180 [btrfs] btrfs_balance+0x14e4/0x2fc0 [btrfs] btrfs_ioctl_balance+0x47f/0x640 [btrfs] btrfs_ioctl+0x119d/0x8380 [btrfs] do_vfs_ioctl+0x9f5/0x1060 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x73/0xb0 do_syscall_64+0xa5/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff88856f671100 which belongs to the cache kmalloc-4k of size 4096 The buggy address is located 1552 bytes inside of 4096-byte region [ffff88856f671100, ffff88856f672100) The buggy address belongs to the page: page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0 flags: 0x2ffff0000010200(slab|head) raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600 raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== BTRFS info (device sdi1): 1 enospc errors during balance BTRFS info (device sdi1): balance: ended with status: -28 [CAUSE] The problem happens when finish_ordered_io() get called with balance still running, while the reloc root of that subvolume is already dead. (Tree is swap already done, but tree not yet deleted for possible qgroup usage.) That means root->reloc_root still exists, but that reloc_root can be under btrfs_drop_snapshot(), thus we shouldn't access it. The following race could cause the use-after-free problem: CPU1 | CPU2 -------------------------------------------------------------------------- | relocate_block_group() | |- unset_reloc_control(rc) | |- btrfs_commit_transaction() btrfs_finish_ordered_io() | |- clean_dirty_subvols() |- btrfs_join_transaction() | | |- record_root_in_trans() | | |- btrfs_init_reloc_root() | | |- if (root->reloc_root) | | | | |- root->reloc_root = NULL | | |- btrfs_drop_snapshot(reloc_root); |- reloc_root->last_trans| = trans->transid | ^^^^^^^^^^^^^^^^^^^^^^ Use after free [FIX] Fix it by the following modifications: - Test if the root has dead reloc tree before accessing root->reloc_root If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to create or update root->reloc_tree - Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped reloc tree To co-operate with above modification, so as long as BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create reloc tree at record_root_in_trans(). Reported-by: Cebtenzzre <cebtenzzre@gmail.com> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panicDave Wysochanski
commit cb248819d209d113e45fed459773991518e8e80b upstream. Commit 487317c99477 ("cifs: add spinlock for the openFileList to cifsInodeInfo") added cifsInodeInfo->open_file_lock spin_lock to protect the openFileList, but missed a few places where cifs_inode->openFileList was enumerated. Change these remaining tcon->open_file_lock to cifsInodeInfo->open_file_lock to avoid panic in is_size_safe_to_change. [17313.245641] RIP: 0010:is_size_safe_to_change+0x57/0xb0 [cifs] [17313.245645] Code: 68 40 48 89 ef e8 19 67 b7 f1 48 8b 43 40 48 8d 4b 40 48 8d 50 f0 48 39 c1 75 0f eb 47 48 8b 42 10 48 8d 50 f0 48 39 c1 74 3a <8b> 80 88 00 00 00 83 c0 01 a8 02 74 e6 48 89 ef c6 07 00 0f 1f 40 [17313.245649] RSP: 0018:ffff94ae1baefa30 EFLAGS: 00010202 [17313.245654] RAX: dead000000000100 RBX: ffff88dc72243300 RCX: ffff88dc72243340 [17313.245657] RDX: dead0000000000f0 RSI: 00000000098f7940 RDI: ffff88dd3102f040 [17313.245659] RBP: ffff88dd3102f040 R08: 0000000000000000 R09: ffff94ae1baefc40 [17313.245661] R10: ffffcdc8bb1c4e80 R11: ffffcdc8b50adb08 R12: 00000000098f7940 [17313.245663] R13: ffff88dc72243300 R14: ffff88dbc8f19600 R15: ffff88dc72243428 [17313.245667] FS: 00007fb145485700(0000) GS:ffff88dd3e000000(0000) knlGS:0000000000000000 [17313.245670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17313.245672] CR2: 0000026bb46c6000 CR3: 0000004edb110003 CR4: 00000000007606e0 [17313.245753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [17313.245756] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [17313.245759] PKRU: 55555554 [17313.245761] Call Trace: [17313.245803] cifs_fattr_to_inode+0x16b/0x580 [cifs] [17313.245838] cifs_get_inode_info+0x35c/0xa60 [cifs] [17313.245852] ? kmem_cache_alloc_trace+0x151/0x1d0 [17313.245885] cifs_open+0x38f/0x990 [cifs] [17313.245921] ? cifs_revalidate_dentry_attr+0x3e/0x350 [cifs] [17313.245953] ? cifsFileInfo_get+0x30/0x30 [cifs] [17313.245960] ? do_dentry_open+0x132/0x330 [17313.245963] do_dentry_open+0x132/0x330 [17313.245969] path_openat+0x573/0x14d0 [17313.245974] do_filp_open+0x93/0x100 [17313.245979] ? __check_object_size+0xa3/0x181 [17313.245986] ? audit_alloc_name+0x7e/0xd0 [17313.245992] do_sys_open+0x184/0x220 [17313.245999] do_syscall_64+0x5b/0x1b0 Fixes: 487317c99477 ("cifs: add spinlock for the openFileList to cifsInodeInfo") CC: Stable <stable@vger.kernel.org> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09CIFS: Force reval dentry if LOOKUP_REVAL flag is setPavel Shilovsky
commit 0b3d0ef9840f7be202393ca9116b857f6f793715 upstream. Mark inode for force revalidation if LOOKUP_REVAL flag is set. This tells the client to actually send a QueryInfo request to the server to obtain the latest metadata in case a directory or a file were changed remotely. Only do that if the client doesn't have a lease for the file to avoid unneeded round trips to the server. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09CIFS: Force revalidate inode when dentry is stalePavel Shilovsky
commit c82e5ac7fe3570a269c0929bf7899f62048e7dbc upstream. Currently the client indicates that a dentry is stale when inode numbers or type types between a local inode and a remote file don't match. If this is the case attributes is not being copied from remote to local, so, it is already known that the local copy has stale metadata. That's why the inode needs to be marked for revalidation in order to tell the VFS to lookup the dentry again before openning a file. This prevents unexpected stale errors to be returned to the user space when openning a file. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09CIFS: Gracefully handle QueryInfo errors during openPavel Shilovsky
commit 30573a82fb179420b8aac30a3a3595aa96a93156 upstream. Currently if the client identifies problems when processing metadata returned in CREATE response, the open handle is being leaked. This causes multiple problems like a file missing a lease break by that client which causes high latencies to other clients accessing the file. Another side-effect of this is that the file can't be deleted. Fix this by closing the file after the client hits an error after the file was opened and the open descriptor wasn't returned to the user space. Also convert -ESTALE to -EOPENSTALE to allow the VFS to revalidate a dentry and retry the open. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09vfs: Fix EOVERFLOW testing in put_compat_statfs64Eric Sandeen
commit cc3a7bfe62b947b423fcb2cfe89fcba92bf48fa3 upstream. Today, put_compat_statfs64() disallows nearly any field value over 2^32 if f_bsize is only 32 bits, but that makes no sense. compat_statfs64 is there for the explicit purpose of providing 64-bit fields for f_files, f_ffree, etc. And f_bsize is always only 32 bits. As a result, 32-bit userspace gets -EOVERFLOW for i.e. large file counts even with -D_FILE_OFFSET_BITS=64 set. In reality, only f_bsize and f_frsize can legitimately overflow (fields like f_type and f_namelen should never be large), so test only those fields. This bug was discussed at length some time ago, and this is the proposal Al suggested at https://lkml.org/lkml/2018/8/6/640. It seemed to get dropped amid the discussion of other related changes, but this part seems obviously correct on its own, so I've picked it up and sent it, for expediency. Fixes: 64d2ab32efe3 ("vfs: fix put_compat_statfs64() does not handle errors") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09Btrfs: fix selftests failure due to uninitialized i_mode in test inodesFilipe Manana
commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream. Some of the self tests create a test inode, setup some extents and then do calls to btrfs_get_extent() to test that the corresponding extent maps exist and are correct. However btrfs_get_extent(), since the 5.2 merge window, now errors out when it finds a regular or prealloc extent for an inode that does not correspond to a regular file (its ->i_mode is not S_IFREG). This causes the self tests to fail sometimes, specially when KASAN, slub_debug and page poisoning are enabled: $ modprobe btrfs modprobe: ERROR: could not insert 'btrfs': Invalid argument $ dmesg [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on [ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096 [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests [ 9414.692918] BTRFS: selftest: running extent only tests [ 9414.693061] BTRFS: selftest: running bitmap only tests [ 9414.693366] BTRFS: selftest: running bitmap and extent tests [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests [ 9414.697131] BTRFS: selftest: running extent buffer operation tests [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests [ 9414.697564] BTRFS: selftest: running extent I/O tests [ 9414.697583] BTRFS: selftest: running find delalloc tests [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests [ 9415.124192] BTRFS: selftest: running inode tests [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256 [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0 This happens because the test inodes are created without ever initializing the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs callback btrfs_alloc_inode() initialize the i_mode. Initialization of the i_mode is done through the various callbacks used by the VFS to create new inodes (regular files, directories, symlinks, tmpfiles, etc), which all call btrfs_new_inode() which in turn calls inode_init_owner(), which sets the inode's i_mode. Since the tests only uses new_inode() to create the test inodes, the i_mode was never initialized. This always happens on a VM I used with kasan, slub_debug and many other debug facilities enabled. It also happened to someone who reported this on bugzilla (on a 5.3-rc). Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode(). Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09fuse: fix memleak in cuse_channel_openzhengbin
commit 9ad09b1976c562061636ff1e01bfc3a57aebe56b upstream. If cuse_send_init fails, need to fuse_conn_put cc->fc. cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1) ->fuse_dev_alloc->fuse_conn_get ->fuse_dev_free->fuse_conn_put Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09pNFS: Ensure we do clear the return-on-close layout stateid on fatal errorsTrond Myklebust
commit 9c47b18cf722184f32148784189fca945a7d0561 upstream. IF the server rejected our layout return with a state error such as NFS4ERR_BAD_STATEID, or even a stale inode error, then we do want to clear out all the remaining layout segments and mark that stateid as invalid. Fixes: 1c5bd76d17cca ("pNFS: Enable layoutreturn operation for...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09ceph: reconnect connection if session hang in opening stateErqi Chen
commit 71a228bc8d65900179e37ac309e678f8c523f133 upstream. If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, mds won't send session msg to client, and delayed_work skip CEPH_MDS_SESSION_OPENING state session, the session hang forever. Allow ceph_con_keepalive to reconnect a session in OPENING to avoid session hang. Also, ensure that we skip sessions in RESTARTING and REJECTED states since those states can't be resurrected by issuing a keepalive. Link: https://tracker.ceph.com/issues/41551 Signed-off-by: Erqi Chen chenerqi@gmail.com Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09ceph: fetch cap_gen under spinlock in ceph_add_capJeff Layton
commit 606d102327a45a49d293557527802ee7fbfd7af1 upstream. It's protected by the s_gen_ttl_lock, so we should fetch under it and ensure that we're using the same generation in both places. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>