summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-12-02efivarfs: revert "fix memory leak in efivarfs_create()"Ard Biesheuvel
[ Upstream commit ff04f3b6f2e27f8ae28a498416af2a8dd5072b43 ] The memory leak addressed by commit fe5186cf12e3 is a false positive: all allocations are recorded in a linked list, and freed when the filesystem is unmounted. This leads to double frees, and as reported by David, leads to crashes if SLUB is configured to self destruct when double frees occur. So drop the redundant kfree() again, and instead, mark the offending pointer variable so the allocation is ignored by kmemleak. Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Fixes: fe5186cf12e3 ("efivarfs: fix memory leak in efivarfs_create()") Reported-by: David Laight <David.Laight@aculab.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-02proc: don't allow async path resolution of /proc/self componentsJens Axboe
[ Upstream commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 ] If this is attempted by a kthread, then return -EOPNOTSUPP as we don't currently support that. Once we can get task_pid_ptr() doing the right thing, then this can go away again. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-02btrfs: fix lockdep splat when reading qgroup config on mountFilipe Manana
commit 3d05cad3c357a2b749912914356072b38435edfa upstream. Lockdep reported the following splat when running test btrfs/190 from fstests: [ 9482.126098] ====================================================== [ 9482.126184] WARNING: possible circular locking dependency detected [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 9482.126365] ------------------------------------------------------ [ 9482.126456] mount/24187 is trying to acquire lock: [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.126647] but task is already holding lock: [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.126886] which lock already depends on the new lock. [ 9482.127078] the existing dependency chain (in reverse order) is: [ 9482.127213] -> #1 (btrfs-quota-00){++++}-{3:3}: [ 9482.127366] lock_acquire+0xd8/0x490 [ 9482.127436] down_read_nested+0x45/0x220 [ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs] [ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs] [ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs] [ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs] [ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs] [ 9482.128039] process_one_work+0x24e/0x5e0 [ 9482.128110] worker_thread+0x50/0x3b0 [ 9482.128181] kthread+0x153/0x170 [ 9482.128256] ret_from_fork+0x22/0x30 [ 9482.128327] -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}: [ 9482.128464] check_prev_add+0x91/0xc60 [ 9482.128551] __lock_acquire+0x1740/0x3110 [ 9482.128623] lock_acquire+0xd8/0x490 [ 9482.130029] __mutex_lock+0xa3/0xb30 [ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.133325] legacy_get_tree+0x30/0x60 [ 9482.133866] vfs_get_tree+0x28/0xe0 [ 9482.134392] fc_mount+0xe/0x40 [ 9482.134908] vfs_kern_mount.part.0+0x71/0x90 [ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.135942] legacy_get_tree+0x30/0x60 [ 9482.136444] vfs_get_tree+0x28/0xe0 [ 9482.136949] path_mount+0x2d7/0xa70 [ 9482.137438] do_mount+0x75/0x90 [ 9482.137923] __x64_sys_mount+0x8e/0xd0 [ 9482.138400] do_syscall_64+0x33/0x80 [ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.139346] other info that might help us debug this: [ 9482.140735] Possible unsafe locking scenario: [ 9482.141594] CPU0 CPU1 [ 9482.142011] ---- ---- [ 9482.142411] lock(btrfs-quota-00); [ 9482.142806] lock(&fs_info->qgroup_rescan_lock); [ 9482.143216] lock(btrfs-quota-00); [ 9482.143629] lock(&fs_info->qgroup_rescan_lock); [ 9482.144056] *** DEADLOCK *** [ 9482.145242] 2 locks held by mount/24187: [ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400 [ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.146509] stack backtrace: [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 9482.148709] Call Trace: [ 9482.149169] dump_stack+0x8d/0xb5 [ 9482.149628] check_noncircular+0xff/0x110 [ 9482.150090] check_prev_add+0x91/0xc60 [ 9482.150561] ? kvm_clock_read+0x14/0x30 [ 9482.151017] ? kvm_sched_clock_read+0x5/0x10 [ 9482.151470] __lock_acquire+0x1740/0x3110 [ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.152402] lock_acquire+0xd8/0x490 [ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.153354] __mutex_lock+0xa3/0xb30 [ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.157567] ? kfree+0x31f/0x3e0 [ 9482.158030] legacy_get_tree+0x30/0x60 [ 9482.158489] vfs_get_tree+0x28/0xe0 [ 9482.158947] fc_mount+0xe/0x40 [ 9482.159403] vfs_kern_mount.part.0+0x71/0x90 [ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.160805] ? kfree+0x31f/0x3e0 [ 9482.161260] ? legacy_get_tree+0x30/0x60 [ 9482.161714] legacy_get_tree+0x30/0x60 [ 9482.162166] vfs_get_tree+0x28/0xe0 [ 9482.162616] path_mount+0x2d7/0xa70 [ 9482.163070] do_mount+0x75/0x90 [ 9482.163525] __x64_sys_mount+0x8e/0xd0 [ 9482.163986] do_syscall_64+0x33/0x80 [ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.164902] RIP: 0033:0x7f51e907caaa This happens because at btrfs_read_qgroup_config() we can call qgroup_rescan_init() while holding a read lock on a quota btree leaf, acquired by the previous call to btrfs_search_slot_for_read(), and qgroup_rescan_init() acquires the mutex qgroup_rescan_lock. A qgroup rescan worker does the opposite: it acquires the mutex qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to update the qgroup status item in the quota btree through the call to update_qgroup_status_item(). This inversion of locking order between the qgroup_rescan_lock mutex and quota btree locks causes the splat. Fix this simply by releasing and freeing the path before calling qgroup_rescan_init() at btrfs_read_qgroup_config(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02btrfs: don't access possibly stale fs_info data for printing duplicate deviceJohannes Thumshirn
commit 0697d9a610998b8bdee6b2390836cb2391d8fd1a upstream. Syzbot reported a possible use-after-free when printing a duplicate device warning device_list_add(). At this point it can happen that a btrfs_device::fs_info is not correctly setup yet, so we're accessing stale data, when printing the warning message using the btrfs_printk() wrappers. ================================================================== BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068 CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943 btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359 btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x44840a RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630 RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003 Allocated by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] kvmalloc_node+0x81/0x110 mm/util.c:574 kvmalloc include/linux/mm.h:757 [inline] kvzalloc include/linux/mm.h:765 [inline] btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x3d/0x70 mm/kasan/common.c:56 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x113/0x200 mm/slab.c:3756 deactivate_locked_super+0xa7/0xf0 fs/super.c:335 btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff8880878e0000 which belongs to the cache kmalloc-16k of size 16384 The buggy address is located 1704 bytes inside of 16384-byte region [ffff8880878e0000, ffff8880878e4000) The buggy address belongs to the page: page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0 head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfffe0000010200(slab|head) raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00 raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The syzkaller reproducer for this use-after-free crafts a filesystem image and loop mounts it twice in a loop. The mount will fail as the crafted image has an invalid chunk tree. When this happens btrfs_mount_root() will call deactivate_locked_super(), which then cleans up fs_info and fs_info::sb. If a second thread now adds the same block-device to the filesystem, it will get detected as a duplicate device and device_list_add() will reject the duplicate and print a warning. But as the fs_info pointer passed in is non-NULL this will result in a use-after-free. Instead of printing possibly uninitialized or already freed memory in btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the device name will be skipped altogether. There was a slightly different approach discussed in https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-24ext4: fix bogus warning in ext4_update_dx_flag()Jan Kara
commit f902b216501094495ff75834035656e8119c537f upstream. The idea of the warning in ext4_update_dx_flag() is that we should warn when we are clearing EXT4_INODE_INDEX on a filesystem with metadata checksums enabled since after clearing the flag, checksums for internal htree nodes will become invalid. So there's no need to warn (or actually do anything) when EXT4_INODE_INDEX is not set. Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz Fixes: 48a34311953d ("ext4: fix checksum errors with indexed dirs") Reported-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-24efivarfs: fix memory leak in efivarfs_create()Vamshi K Sthambamkadi
commit fe5186cf12e30facfe261e9be6c7904a170bd822 upstream. kmemleak report: unreferenced object 0xffff9b8915fcb000 (size 4096): comm "efivarfs.sh", pid 2360, jiffies 4294920096 (age 48.264s) hex dump (first 32 bytes): 2d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -............... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000cc4d897c>] kmem_cache_alloc_trace+0x155/0x4b0 [<000000007d1dfa72>] efivarfs_create+0x6e/0x1a0 [<00000000e6ee18fc>] path_openat+0xe4b/0x1120 [<000000000ad0414f>] do_filp_open+0x91/0x100 [<00000000ce93a198>] do_sys_openat2+0x20c/0x2d0 [<000000002a91be6d>] do_sys_open+0x46/0x80 [<000000000a854999>] __x64_sys_openat+0x20/0x30 [<00000000c50d89c9>] do_syscall_64+0x38/0x90 [<00000000cecd6b5f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 In efivarfs_create(), inode->i_private is setup with efivar_entry object which is never freed. Cc: <stable@vger.kernel.org> Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Link: https://lore.kernel.org/r/20201023115429.GA2479@cosmos Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-24libfs: fix error cast of negative value in simple_attr_write()Yicong Yang
[ Upstream commit 488dac0c9237647e9b8f788b6a342595bfa40bda ] The attr->set() receive a value of u64, but simple_strtoll() is used for doing the conversion. It will lead to the error cast if user inputs a negative value. Use kstrtoull() instead of simple_strtoll() to convert a string got from the user to an unsigned value. The former will return '-EINVAL' if it gets a negetive value, but the latter can't handle the situation correctly. Make 'val' unsigned long long as what kstrtoull() takes, this will eliminate the compile warning on no 64-bit architectures. Fixes: f7b88631a897 ("fs/libfs.c: fix simple_attr_write() on 32bit machines") Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-24xfs: revert "xfs: fix rmap key and record comparison functions"Darrick J. Wong
[ Upstream commit eb8409071a1d47e3593cfe077107ac46853182ab ] This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f. Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been. Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage. Reported-by: Eric Sandeen <sandeen@sandeen.net> Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-24xfs: strengthen rmap record flags checkingDarrick J. Wong
[ Upstream commit 498fe261f0d6d5189f8e11d283705dd97b474b54 ] We always know the correct state of the rmap record flags (attr, bmbt, unwritten) so check them by direct comparison. Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-24xfs: fix the minrecs logic when dealing with inode root child blocksDarrick J. Wong
[ Upstream commit e95b6c3ef1311dd7b20467d932a24b6d0fd88395 ] The comment and logic in xchk_btree_check_minrecs for dealing with inode-rooted btrees isn't quite correct. While the direct children of the inode root are allowed to have fewer records than what would normally be allowed for a regular ondisk btree block, this is only true if there is only one child block and the number of records don't fit in the inode root. Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-24vfs: remove lockdep bogosity in __sb_start_writeDarrick J. Wong
[ Upstream commit 22843291efc986ce7722610073fcf85a39b4cb13 ] __sb_start_write has some weird looking lockdep code that claims to exist to handle nested freeze locking requests from xfs. The code as written seems broken -- if we think we hold a read lock on any of the higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a trylock. However, it's not correct to downgrade a blocking lock attempt to a trylock unless the downgrading code or the callers are prepared to deal with that situation. Neither __sb_start_write nor its callers handle this at all. For example: sb_start_pagefault ignores the return value completely, with the result that if xfs_filemap_fault loses a race with a different thread trying to fsfreeze, it will proceed without pagefault freeze protection (thereby breaking locking rules) and then unlocks the pagefault freeze lock that it doesn't own on its way out (thereby corrupting the lock state), which leads to a system hang shortly afterwards. Normally, this won't happen because our ownership of a read lock on a higher freeze protection level blocks fsfreeze from grabbing a write lock on that higher level. *However*, if lockdep is offline, lock_is_held_type unconditionally returns 1, which means that percpu_rwsem_is_held returns 1, which means that __sb_start_write unconditionally converts blocking freeze lock attempts into trylocks, even when we *don't* hold anything that would block a fsfreeze. Apparently this all held together until 5.10-rc1, when bugs in lockdep caused lockdep to shut itself off early in an fstests run, and once fstests gets to the "race writes with freezer" tests, kaboom. This might explain the long trail of vanishingly infrequent livelocks in fstests after lockdep goes offline that I've never been able to diagnose. We could fix it by spinning on the trylock if wait==true, but AFAICT the locking works fine if lockdep is not built at all (and I didn't see any complaints running fstests overnight), so remove this snippet entirely. NOTE: Commit f4b554af9931 in 2015 created the current weird logic (which used to exist in a different form in commit 5accdf82ba25c from 2012) in __sb_start_write. XFS solved this whole problem in the late 2.6 era by creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't grab intwrite freeze protection, thus making lockdep's solution unnecessary. The commit claims that Dave Chinner explained that the trylock hack + comment could be removed, but nobody ever did. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18Convert trailing spaces and periods in path componentsBoris Protopopov
commit 57c176074057531b249cf522d90c22313fa74b0b upstream. When converting trailing spaces and periods in paths, do so for every component of the path, not just the last component. If the conversion is not done for every path component, then subsequent operations in directories with trailing spaces or periods (e.g. create(), mkdir()) will fail with ENOENT. This is because on the server, the directory will have a special symbol in its name, and the client needs to provide the same. Signed-off-by: Boris Protopopov <pboris@amazon.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit archMatthew Wilcox (Oracle)
commit a1fbc6750e212c5675a4e48d7f51d44607eb8756 upstream. On 32-bit systems, this shift will overflow for files larger than 4GB as start_index is unsigned long while the calls to btrfs_delalloc_*_space expect u64. CC: stable@vger.kernel.org # 4.4+ Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> [ define the variable instead of repeating the shift ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18ocfs2: initialize ip_next_orphanWengang Wang
commit f5785283dd64867a711ca1fb1f5bb172f252ecdf upstream. Though problem if found on a lower 4.1.12 kernel, I think upstream has same issue. In one node in the cluster, there is the following callback trace: # cat /proc/21473/stack __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2] ocfs2_evict_inode+0x152/0x820 [ocfs2] evict+0xae/0x1a0 iput+0x1c6/0x230 ocfs2_orphan_filldir+0x5d/0x100 [ocfs2] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2] ocfs2_dir_foreach+0x29/0x30 [ocfs2] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2] process_one_work+0x169/0x4a0 worker_thread+0x5b/0x560 kthread+0xcb/0xf0 ret_from_fork+0x61/0x90 The above stack is not reasonable, the final iput shouldn't happen in ocfs2_orphan_filldir() function. Looking at the code, 2067 /* Skip inodes which are already added to recover list, since dio may 2068 * happen concurrently with unlink/rename */ 2069 if (OCFS2_I(iter)->ip_next_orphan) { 2070 iput(iter); 2071 return 0; 2072 } 2073 The logic thinks the inode is already in recover list on seeing ip_next_orphan is non-NULL, so it skip this inode after dropping a reference which incremented in ocfs2_iget(). While, if the inode is already in recover list, it should have another reference and the iput() at line 2070 should not be the final iput (dropping the last reference). So I don't think the inode is really in the recover list (no vmcore to confirm). Note that ocfs2_queue_orphans(), though not shown up in the call back trace, is holding cluster lock on the orphan directory when looking up for unlinked inodes. The on disk inode eviction could involve a lot of IOs which may need long time to finish. That means this node could hold the cluster lock for very long time, that can lead to the lock requests (from other nodes) to the orhpan directory hang for long time. Looking at more on ip_next_orphan, I found it's not initialized when allocating a new ocfs2_inode_info structure. This causes te reflink operations from some nodes hang for very long time waiting for the cluster lock on the orphan directory. Fix: initialize ip_next_orphan as NULL. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18btrfs: dev-replace: fail mount if we don't have replace item with target deviceAnand Jain
commit cf89af146b7e62af55470cf5f3ec3c56ec144a5e upstream. If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace item, then it means the filesystem is inconsistent state. This is either corruption or a crafted image. Fail the mount as this needs a closer look what is actually wrong. As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace item, in __btrfs_free_extra_devids() we determine that there is an extra device, and free those extra devices but continue to mount the device. However, we were wrong in keeping tack of the rw_devices so the syzbot testcase failed: WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 RSP: 0018:ffffc900091777e0 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000 RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007 RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130 R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 The fix here is, when we determine that there isn't a replace item then fail the mount if there is a replace target device (devid 0). CC: stable@vger.kernel.org # 4.19+ Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18btrfs: ref-verify: fix memory leak in btrfs_ref_tree_modDinghao Liu
commit 468600c6ec28613b756193c5f780aac062f1acdf upstream. There is one error handling path that does not free ref, which may cause a minor memory leak. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18ext4: unlock xattr_sem properly in ext4_inline_data_truncate()Joseph Qi
commit 7067b2619017d51e71686ca9756b454de0e5826a upstream. It takes xattr_sem to check inline data again but without unlock it in case not have. So unlock it before return. Fixes: aef1c8513c1f ("ext4: let ext4_truncate handle inline data correctly") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18ext4: correctly report "not supported" for {usr,grp}jquota when !CONFIG_QUOTAKaixu Xia
commit 174fe5ba2d1ea0d6c5ab2a7d4aa058d6d497ae4d upstream. The macro MOPT_Q is used to indicates the mount option is related to quota stuff and is defined to be MOPT_NOSUPPORT when CONFIG_QUOTA is disabled. Normally the quota options are handled explicitly, so it didn't matter that the MOPT_STRING flag was missing, even though the usrjquota and grpjquota mount options take a string argument. It's important that's present in the !CONFIG_QUOTA case, since without MOPT_STRING, the mount option matcher will match usrjquota= followed by an integer, and will otherwise skip the table entry, and so "mount option not supported" error message is never reported. [ Fixed up the commit description to better explain why the fix works. --TYT ] Fixes: 26092bf52478 ("ext4: use a table-driven handler for mount options") Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Link: https://lore.kernel.org/r/1603986396-28917-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xfs: fix a missing unlock on error in xfs_fs_map_blocksChristoph Hellwig
[ Upstream commit 2bd3fa793aaa7e98b74e3653fdcc72fa753913b5 ] We also need to drop the iolock when invalidate_inode_pages2 fails, not only on all other error or successful cases. Fixes: 527851124d10 ("xfs: implement pNFS export operations") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: fix brainos in the refcount scrubber's rmap fragment processorDarrick J. Wong
[ Upstream commit 54e9b09e153842ab5adb8a460b891e11b39e9c3d ] Fix some serious WTF in the reference count scrubber's rmap fragment processing. The code comment says that this loop is supposed to move all fragment records starting at or before bno onto the worklist, but there's no obvious reason why nr (the number of items added) should increment starting from 1, and breaking the loop when we've added the target number seems dubious since we could have more rmap fragments that should have been added to the worklist. This seems to manifest in xfs/411 when adding one to the refcount field. Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: fix rmap key and record comparison functionsDarrick J. Wong
[ Upstream commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f ] Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, is_unwritten, offset) This provides users the ability to look up a reverse mapping from a bmbt record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. However, the key comparison functions incorrectly remove the fork/btree/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) This means that queries can return incorrect results. On consistent filesystems this hasn't been an issue because blocks are never shared between forks or with bmbt blocks; and are never unwritten. However, this bug means that online repair cannot always detect corruption in the key information in internal rmapbt nodes. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextentsDarrick J. Wong
[ Upstream commit 5dda3897fd90783358c4c6115ef86047d8c8f503 ] When the bmbt scrubber is looking up rmap extents, we need to set the extent flags from the bmbt record fully. This will matter once we fix the rmap btree comparison functions to check those flags correctly. Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: fix flags argument to rmap lookup when converting shared file rmapsDarrick J. Wong
[ Upstream commit ea8439899c0b15a176664df62aff928010fad276 ] Pass the same oldext argument (which contains the existing rmapping's unwritten state) to xfs_rmap_lookup_le_range at the start of xfs_rmap_convert_shared. At this point in the code, flags is zero, which means that we perform lookups using the wrong key. Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18gfs2: check for live vs. read-only file system in gfs2_fitrimBob Peterson
[ Upstream commit c5c68724696e7d2f8db58a5fce3673208d35c485 ] Before this patch, gfs2_fitrim was not properly checking for a "live" file system. If the file system had something to trim and the file system was read-only (or spectator) it would start the trim, but when it starts the transaction, gfs2_trans_begin returns -EROFS (read-only file system) and it errors out. However, if the file system was already trimmed so there's no work to do, it never called gfs2_trans_begin. That code is bypassed so it never returns the error. Instead, it returns a good return code with 0 work. All this makes for inconsistent behavior: The same fstrim command can return -EROFS in one case and 0 in another. This tripped up xfstests generic/537 which reports the error as: +fstrim with unrecovered metadata just ate your filesystem This patch adds a check for a "live" (iow, active journal, iow, RW) file system, and if not, returns the error properly. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18gfs2: Add missing truncate_inode_pages_final for sd_aspaceBob Peterson
[ Upstream commit a9dd945ccef07a904e412f208f8de708a3d7159e ] Gfs2 creates an address space for its rgrps called sd_aspace, but it never called truncate_inode_pages_final on it. This confused vfs greatly which tried to reference the address space after gfs2 had freed the superblock that contained it. This patch adds a call to truncate_inode_pages_final for sd_aspace, thus avoiding the use-after-free. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-freeBob Peterson
[ Upstream commit d0f17d3883f1e3f085d38572c2ea8edbd5150172 ] Function gfs2_clear_rgrpd calls kfree(rgd->rd_bits) before calling return_all_reservations, but return_all_reservations still dereferences rgd->rd_bits in __rs_deltree. Fix that by moving the call to kfree below the call to return_all_reservations. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18Btrfs: fix missing error return if writeback for extent buffer never startedFilipe Manana
[ Upstream commit 0607eb1d452d45c5ac4c745a9e9e0d95152ea9d0 ] If lock_extent_buffer_for_io() fails, it returns a negative value, but its caller btree_write_cache_pages() ignores such error. This means that a call to flush_write_bio(), from lock_extent_buffer_for_io(), might have failed. We should make btree_write_cache_pages() notice such error values and stop immediatelly, making sure filemap_fdatawrite_range() returns an error to the transaction commit path. A failure from flush_write_bio() should also result in the endio callback end_bio_extent_buffer_writepage() being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that there's no risk a transaction or log commit doesn't catch a writeback failure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: fix scrub flagging rtinherit even if there is no rt deviceDarrick J. Wong
[ Upstream commit c1f6b1ac00756a7108e5fcb849a2f8230c0b62a5 ] The kernel has always allowed directories to have the rtinherit flag set, even if there is no rt device, so this check is wrong. Fixes: 80e4e1268802 ("xfs: scrub inodes") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: flush new eof page on truncate to avoid post-eof corruptionBrian Foster
[ Upstream commit 869ae85dae64b5540e4362d7fe4cd520e10ec05c ] It is possible to expose non-zeroed post-EOF data in XFS if the new EOF page is dirty, backed by an unwritten block and the truncate happens to race with writeback. iomap_truncate_page() will not zero the post-EOF portion of the page if the underlying block is unwritten. The subsequent call to truncate_setsize() will, but doesn't dirty the page. Therefore, if writeback happens to complete after iomap_truncate_page() (so it still sees the unwritten block) but before truncate_setsize(), the cached page becomes inconsistent with the on-disk block. A mapped read after the associated page is reclaimed or invalidated exposes non-zero post-EOF data. For example, consider the following sequence when run on a kernel modified to explicitly flush the new EOF page within the race window: $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file ... $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: 00 00 00 00 00 00 00 00 ........ $ umount /mnt/; mount <dev> /mnt/ $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: cd cd cd cd cd cd cd cd ........ Update xfs_setattr_size() to explicitly flush the new EOF page prior to the page truncate to ensure iomap has the latest state of the underlying block. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18xfs: set xefi_discard when creating a deferred agfl free log intent itemDarrick J. Wong
[ Upstream commit 2c334e12f957cd8c6bb66b4aa3f79848b7c33cab ] Make sure that we actually initialize xefi_discard when we're scheduling a deferred free of an AGFL block. This was (eventually) found by the UBSAN while I was banging on realtime rmap problems, but it exists in the upstream codebase. While we're at it, rearrange the structure to reduce the struct size from 64 to 56 bytes. Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18btrfs: reschedule when cloning lots of extentsJohannes Thumshirn
[ Upstream commit 6b613cc97f0ace77f92f7bc112b8f6ad3f52baf8 ] We have several occurrences of a soft lockup from fstest's generic/175 testcase, which look more or less like this one: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030] Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 10030 Comm: xfs_io Tainted: G L 5.9.0-rc5+ #768 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: <IRQ> dump_stack+0x77/0xa0 panic+0xfa/0x2cb watchdog_timer_fn.cold+0x85/0xa5 ? lockup_detector_update_enable+0x50/0x50 __hrtimer_run_queues+0x99/0x4c0 ? recalibrate_cpu_khz+0x10/0x10 hrtimer_run_queues+0x9f/0xb0 update_process_times+0x28/0x80 tick_handle_periodic+0x1b/0x60 __sysvec_apic_timer_interrupt+0x76/0x210 asm_call_on_stack+0x12/0x20 </IRQ> sysvec_apic_timer_interrupt+0x7f/0x90 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs] RSP: 0018:ffffc90007123a58 EFLAGS: 00000282 RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000 RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032 R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000 R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0 ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs] btrfs_release_path+0x67/0x80 [btrfs] btrfs_insert_replace_extent+0x177/0x2c0 [btrfs] btrfs_replace_file_extents+0x472/0x7c0 [btrfs] btrfs_clone+0x9ba/0xbd0 [btrfs] btrfs_clone_files.isra.0+0xeb/0x140 [btrfs] ? file_update_time+0xcd/0x120 btrfs_remap_file_range+0x322/0x3b0 [btrfs] do_clone_file_range+0xb7/0x1e0 vfs_clone_file_range+0x30/0xa0 ioctl_file_clone+0x8a/0xc0 do_vfs_ioctl+0x5b2/0x6f0 __x64_sys_ioctl+0x37/0xa0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f87977fc247 RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247 RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003 RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000 R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000 Kernel Offset: disabled ---[ end Kernel panic - not syncing: softlockup: hung tasks ]--- All of these lockup reports have the call chain btrfs_clone_files() -> btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with both source and destination extents locked and loops over the source extent to create the clones. Conditionally reschedule in the btrfs_clone() loop, to give some time back to other processes. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18btrfs: sysfs: init devices outside of the chunk_mutexJosef Bacik
[ Upstream commit ca10845a56856fff4de3804c85e6424d0f6d0cde ] While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc3+ #4 Not tainted ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc+0x37/0x270 alloc_inode+0x82/0xb0 iget_locked+0x10d/0x2c0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x240 sysfs_get_tree+0x16/0x40 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (kernfs_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 kernfs_add_one+0x23/0x150 kernfs_create_link+0x63/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x81/0x130 btrfs_init_new_device+0x67f/0x1250 btrfs_ioctl+0x1ef/0x2e20 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_insert_empty_items+0x64/0xb0 btrfs_insert_delayed_items+0x90/0x4f0 btrfs_commit_inode_delayed_items+0x93/0x140 btrfs_log_inode+0x5de/0x2020 btrfs_log_inode_parent+0x429/0xc90 btrfs_log_new_name+0x95/0x9b btrfs_rename2+0xbb9/0x1800 vfs_rename+0x64f/0x9f0 do_renameat2+0x320/0x4e0 __x64_sys_rename+0x1f/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb8 check_noncircular+0x12d/0x150 __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa7/0x3d0 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x41/0x50 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we are holding the chunk_mutex at the time of adding in a new device. However we only need to hold the device_list_mutex, as we're going to iterate over the fs_devices devices. Move the sysfs init stuff outside of the chunk_mutex to get rid of this lockdep splat. CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions CC: stable@vger.kernel.org # 4.4.x Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-10gfs2: Wake up when sd_glock_disposal becomes zeroAlexander Aring
commit da7d554f7c62d0c17c1ac3cc2586473c2d99f0bd upstream. Commit fc0e38dae645 ("GFS2: Fix glock deallocation race") fixed a sd_glock_disposal accounting bug by adding a missing atomic_dec statement, but it failed to wake up sd_glock_wait when that decrement causes sd_glock_disposal to reach zero. As a consequence, gfs2_gl_hash_clear can now run into a 10-minute timeout instead of being woken up. Add the missing wakeup. Fixes: fc0e38dae645 ("GFS2: Fix glock deallocation race") Cc: stable@vger.kernel.org # v2.6.39+ Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: fix the error message for transid errorQu Wenruo
commit f96d6960abbc52e26ad124e69e6815283d3e1674 upstream. The error message for inode transid is the same as for inode generation, which makes us unable to detect the real problem. Reported-by: Tyler Richmond <t.d.richmond@gmail.com> Fixes: 496245cac57e ("btrfs: tree-checker: Verify inode item") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.19: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Verify inode itemQu Wenruo
commit 496245cac57e26d8b738d85c7a29cf9a47610f3f upstream. There is a report in kernel bugzilla about mismatch file type in dir item and inode item. This inspires us to check inode mode in inode item. This patch will check the following members: - inode key objectid Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO. - inode key offset Should be 0 - inode item generation - inode item transid No newer than sb generation + 1. The +1 is for log tree. - inode item mode No unknown bits. No invalid S_IF* bit. NOTE: S_IFMT check is not enough, need to check every know type. - inode item nlink Dir should have no more link than 1. - inode item flags Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Enhance chunk checker to validate chunk profileQu Wenruo
commit 80e46cf22ba0bcb57b39c7c3b52961ab3a0fd5f2 upstream. Btrfs-progs already have a comprehensive type checker, to ensure there is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk profile bits. Do the same work for kernel. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Fix wrong check on max devidQu Wenruo
commit 8bb177d18f114358a57d8ae7e206861b48b8b4de upstream. [BUG] The following script will cause false alert on devid check. #!/bin/bash dev1=/dev/test/test dev2=/dev/test/scratch1 mnt=/mnt/btrfs umount $dev1 &> /dev/null umount $dev2 &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev1 mount $dev1 $mnt _fail() { echo "!!! FAILED !!!" exit 1 } for ((i = 0; i < 4096; i++)); do btrfs dev add -f $dev2 $mnt || _fail btrfs dev del $dev1 $mnt || _fail dev_tmp=$dev1 dev1=$dev2 dev2=$dev_tmp done [CAUSE] Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as upper limit for devid. But we can have devid holes just like above script. So the check for devid is incorrect and could cause false alert. [FIX] Just remove the whole devid check. We don't have any hard requirement for devid assignment. Furthermore, even devid could get corrupted by a bitflip, we still have dev extents verification at mount time, so corrupted data won't sneak in. This fixes fstests btrfs/194. Reported-by: Anand Jain <anand.jain@oracle.com> Fixes: ab4ba2e13346 ("btrfs: tree-checker: Verify dev item") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.19: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Verify dev itemQu Wenruo
commit ab4ba2e133463c702b37242560d7fabedd2dc750 upstream. [BUG] For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then kernel will just panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 #PF error: [normal kernel read fault] PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0 Call Trace: open_ctree+0x160d/0x2149 btrfs_mount_root+0x5b2/0x680 [CAUSE] If device extent verification finds a deivce with 0 total_bytes, then it assumes it's a seed dummy, then search for seed devices. But in this case, there is no seed device at all, causing NULL pointer. [FIX] Since this is caused by fuzzed image, let's go the tree-check way, just add a new verification for device item. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Check chunk item at tree block read timeQu Wenruo
commit 075cb3c78fe7976c9f29ca1fa23f9728634ecefc upstream. Since we have btrfs_check_chunk_valid() in tree-checker, let's do chunk item verification in tree-checker too. Since the tree-checker is run at endio time, if one chunk leaf fails chunk verification, we can still retry the other copy, making btrfs more robust to fuzzed image as we may still get a good chunk item. Also since we have done chunk verification in tree block read time, skip the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading chunk items from leaf. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Make btrfs_check_chunk_valid() return EUCLEAN instead ↵Qu Wenruo
of EIO commit bf871c3b43b1dcc3f2a076ff39a8f1ce7959d958 upstream. To follow the standard behavior of tree-checker. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Cherry-picked for 4.19 to ease backporting later fixes] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: tree-checker: Make chunk item checker messages more readableQu Wenruo
commit f114024376bceb1c0f61a7bad4a72a0f978767af upstream. Old error message would be something like: BTRFS error (device dm-3): invalid chunk num_stipres: 0 New error message would be: Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=2097152, invalid chunk num_stripes: 0 Or Btrfs critical (device dm-3): corrupt leaf: root=3 block=8388608 slot=3 chunk_start=2097152, invalid chunk num_stripes: 0 And for certain error message, also output expected value. The error message levels are changed from error to critical. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Cherry-picked for 4.19 to ease backporting later fixes] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export itQu Wenruo
commit 82fc28fbedbb59642f05215db3b0ef4eb91aa31d upstream. By function, chunk item verification is more suitable to be done inside tree-checker. So move btrfs_check_chunk_valid() to tree-checker.c and export it. And since it's now moved to tree-checker, also add a better comment for what this function is doing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Cherry-picked for 4.19 to ease backporting later fixes] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: Don't submit any btree write bio if the fs has errorsQu Wenruo
commit b3ff8f1d380e65dddd772542aa9bff6c86bf715a upstream. [BUG] There is a fuzzed image which could cause KASAN report at unmount time. BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390 Read of size 8 at addr ffff888067cf6848 by task umount/1922 CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x5b/0x8b print_address_description+0x70/0x280 kasan_report+0x13a/0x19b btrfs_queue_work+0x2c1/0x390 btrfs_wq_submit_bio+0x1cd/0x240 btree_submit_bio_hook+0x18c/0x2a0 submit_one_bio+0x1be/0x320 flush_write_bio.isra.41+0x2c/0x70 btree_write_cache_pages+0x3bb/0x7f0 do_writepages+0x5c/0x130 __writeback_single_inode+0xa3/0x9a0 writeback_single_inode+0x23d/0x390 write_inode_now+0x1b5/0x280 iput+0x2ef/0x600 close_ctree+0x341/0x750 generic_shutdown_super+0x126/0x370 kill_anon_super+0x31/0x50 btrfs_kill_super+0x36/0x2b0 deactivate_locked_super+0x80/0xc0 deactivate_super+0x13c/0x150 cleanup_mnt+0x9a/0x130 task_work_run+0x11a/0x1b0 exit_to_usermode_loop+0x107/0x130 do_syscall_64+0x1e5/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [CAUSE] The fuzzed image has a completely screwd up extent tree: leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938 item 0 key (12587008 168 4096) itemoff 3942 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 259 offset 0 count 1 item 1 key (12591104 168 8192) itemoff 3889 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 271 offset 0 count 1 item 2 key (12599296 168 4096) itemoff 3836 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 259 offset 4096 count 1 item 3 key (29360128 169 0) itemoff 3803 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 item 4 key (29368320 169 1) itemoff 3770 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 item 5 key (29372416 169 0) itemoff 3737 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 Note that leaf 29421568 doesn't have its backref in the extent tree. Thus extent allocator can re-allocate leaf 29421568 for other trees. In short, the bug is caused by: - Existing tree block gets allocated to log tree This got its generation bumped. - Log tree balance cleaned dirty bit of offending tree block It will not be written back to disk, thus no WRITTEN flag. - Original owner of the tree block gets COWed Since the tree block has higher transid, no WRITTEN flag, it's reused, and not traced by transaction::dirty_pages. - Transaction aborted Tree blocks get cleaned according to transaction::dirty_pages. But the offending tree block is not recorded at all. - Filesystem unmount All pages are assumed to be are clean, destroying all workqueue, then call iput(btree_inode). But offending tree block is still dirty, which triggers writeback, and causes use-after-free bug. The detailed sequence looks like this: - Initial status eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8, not traced by any dirty extent_iot_tree. - New tree block is allocated Since there is no backref for 29421568, it's re-allocated as new tree block. Keep in mind that tree block 29421568 is still referred by extent tree. - Tree block 29421568 is filled for log tree eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped) traced by btrfs_root::dirty_log_pages - Some log tree operations Since the fs is using node size 4096, the log tree can easily go a level higher. - Log tree needs balance Tree block 29421568 gets all its content pushed to right, thus now it is empty, and we don't need it. btrfs_clean_tree_block() from __push_leaf_right() get called. eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9 traced by btrfs_root::dirty_log_pages - Log tree write back btree_write_cache_pages() goes through dirty pages ranges, but since page of tree block 29421568 gets cleaned already, it's not written back to disk. Thus it doesn't have WRITTEN bit set. But ranges in dirty_log_pages are cleared. eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9 not traced by any dirty extent_iot_tree. - Extent tree update when committing transaction Since tree block 29421568 has transid equal to running trans, and has no WRITTEN bit, should_cow_block() will use it directly without adding it to btrfs_transaction::dirty_pages. eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 not traced by any dirty extent_iot_tree. At this stage, we're doomed. We have a dirty eb not tracked by any extent io tree. - Transaction gets aborted due to corrupted extent tree Btrfs cleans up dirty pages according to transaction::dirty_pages and btrfs_root::dirty_log_pages. But since tree block 29421568 is not tracked by neither of them, it's still dirty. eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 not traced by any dirty extent_iot_tree. - Filesystem unmount Since all cleanup is assumed to be done, all workqueus are destroyed. Then iput(btree_inode) is called, expecting no dirty pages. But tree 29421568 is still dirty, thus triggering writeback. Since all workqueues are already freed, we cause use-after-free. This shows us that, log tree blocks + bad extent tree can cause wild dirty pages. [FIX] To fix the problem, don't submit any btree write bio if the filesytem has any error. This is the last safe net, just in case other cleanup haven't caught catch it. Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377 CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.19: fs_info variable already exists in btree_write_cache_pages()] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10Btrfs: fix unwritten extent buffers and hangs on future writeback attemptsFilipe Manana
commit 18dfa7117a3f379862dcd3f67cadd678013bb9dd upstream. The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo #2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka <zsojka@seznam.cz> Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar <drazen.kacar@oradian.com> Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()Qu Wenruo
commit 2e3c25136adfb293d517e17f761d3b8a43a8fc22 upstream. This function needs some extra checks on locked pages and eb. For error handling we need to unlock locked pages and the eb. There is a rare >0 return value branch, where all pages get locked while write bio is not flushed. Thankfully it's handled by the only caller, btree_write_cache_pages(), as later write_one_eb() call will trigger submit_one_bio(). So there shouldn't be any problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: extent_io: Handle errors better in btree_write_cache_pages()Qu Wenruo
commit 2b952eea813b1f7e7d4b9782271acd91625b9bb9 upstream. In btree_write_cache_pages(), we can only get @ret <= 0. Add an ASSERT() for it just in case. Then instead of submitting the write bio even we got some error, check the return value first. If we have already hit some error, just clean up the corrupted or half-baked bio, and return error. If there is no error so far, then call flush_write_bio() and return the result. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: extent_io: Handle errors better in extent_write_full_page()Qu Wenruo
commit 3065976b045f77a910809fa7699f99a1e7c0dbbb upstream. Since now flush_write_bio() could return error, kill the BUG_ON() first. Then don't call flush_write_bio() unconditionally, instead we check the return value from __extent_writepage() first. If __extent_writepage() fails, we do cleanup, and return error without submitting the possible corrupted or half-baked bio. If __extent_writepage() successes, then we call flush_write_bio() and return the result. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: flush write bio if we loop in extent_write_cache_pagesJosef Bacik
commit 42ffb0bf584ae5b6b38f72259af1e0ee417ac77f upstream. There exists a deadlock with range_cyclic that has existed forever. If we loop around with a bio already built we could deadlock with a writer who has the page locked that we're attempting to write but is waiting on a page in our bio to be written out. The task traces are as follows PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5" #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3 #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42 #3 [ffffc900297bb708] __lock_page at ffffffff811f145b #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502 #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684 #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0 #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2 #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND: "aio-dio-invalid" #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3 #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42 #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6 #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7 #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359 #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933 #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8 #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032 I used drgn to find the respective pages we were stuck on page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901 page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874 As you can see the kworker is waiting for bit 0 (PG_locked) on index 7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index 8148. aio-dio-invalid has 7680, and the kworker epd looks like the following crash> struct extent_page_data ffffc900297bbbb0 struct extent_page_data { bio = 0xffff889f747ed830, tree = 0xffff889eed6ba448, extent_locked = 0, sync_io = 0 } Probably worth mentioning as well that it waits for writeback of the page to complete while holding a lock on it (at prepare_pages()). Using drgn I walked the bio pages looking for page 0xffffea00fbfc7500 which is the one we're waiting for writeback on bio = Object(prog, 'struct bio', address=0xffff889f747ed830) for i in range(0, bio.bi_vcnt.value_()): bv = bio.bi_io_vec[i] if bv.bv_page.value_() == 0xffffea00fbfc7500: print("FOUND IT") which validated what I suspected. The fix for this is simple, flush the epd before we loop back around to the beginning of the file during writeout. Fixes: b293f02e1423 ("Btrfs: Add writepages support") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10Revert "btrfs: flush write bio if we loop in extent_write_cache_pages"Ben Hutchings
This reverts commit 860473714cbe7fbedcf92bfe3eb6d69fae8c74ff. That has an incorrect upstream commit reference, and was modified in a way that conflicts with some older fixes. We can cleanly cherry-pick the upstream commit *after* those fixes. Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level upQu Wenruo
commit f4340622e02261fae599e3da936ff4808b418173 upstream. We have a BUG_ON() in flush_write_bio() to handle the return value of submit_one_bio(). Move the BUG_ON() one level up to all its callers. This patch will introduce temporary variable, @flush_ret to keep code change minimal in this patch. That variable will be cleaned up when enhancing the error handling later. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Cherry-picked for 4.19 to ease backporting later fixes] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>