summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)Author
2021-06-16btrfs: promote debugging asserts to full-fledged checks in validate_superNikolay Borisov
commit aefd7f7065567a4666f42c0fc8cdb379d2e036bf upstream. Syzbot managed to trigger this assert while performing its fuzzing. Turns out it's better to have those asserts turned into full-fledged checks so that in case buggy btrfs images are mounted the users gets an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT disabled such image would have been erroneously allowed to be mounted. Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add uuids to the messages ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01btrfs: fix overflow when copying corrupt csums for a messageJohannes Thumshirn
commit 35be8851d172c6e3db836c0f28c19087b10c9e00 upstream. Syzkaller reported a buffer overflow in btree_readpage_end_io_hook() when loop mounting a crafted image: detected buffer overflow in memcpy ------------[ cut here ]------------ kernel BUG at lib/string.c:1129! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: btrfs-endio-meta btrfs_work_helper RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129 RSP: 0018:ffffc90000e27980 EFLAGS: 00010286 RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000 RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22 RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7 R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: memcpy include/linux/string.h:405 [inline] btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642 end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854 bio_endio+0x3cf/0x7f0 block/bio.c:1449 end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695 btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Modules linked in: ---[ end trace b68924293169feef ]--- RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129 RSP: 0018:ffffc90000e27980 EFLAGS: 00010286 RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000 RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22 RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7 R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 The overflow happens, because in btree_readpage_end_io_hook() we assume that we have found a 4 byte checksum instead of the real possible 32 bytes we have for the checksums. With the fix applied: [ 35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215) [ 35.738994] BTRFS info (device loop0): disk space caching is enabled [ 35.738998] BTRFS info (device loop0): has skinny extents [ 35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0 [ 35.743420] BTRFS error (device loop0): failed to read chunk root [ 35.745899] BTRFS error (device loop0): open_ctree failed Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com Fixes: d5178578bcd4 ("btrfs: directly call into crypto framework for checksumming") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03btrfs: fix space cache memory leak after transaction abortFilipe Manana
commit bbc37d6e475eee8ffa2156ec813efc6bbb43c06d upstream. If a transaction aborts it can cause a memory leak of the pages array of a block group's io_ctl structure. The following steps explain how that can happen: 1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED and it's about to start writing out dirty extent buffers; 2) Transaction N + 1 already started and another task, task A, just called btrfs_commit_transaction() on it; 3) Block group B was dirtied (extents allocated from it) by transaction N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the very beginning of the transaction commit, it starts writeback for the block group's space cache by calling btrfs_write_out_cache(), which allocates the pages array for the block group's io_ctl with a call to io_ctl_init(). Block group A is added to the io_list of transaction N + 1 by btrfs_start_dirty_block_groups(); 4) While transaction N's commit is writing out the extent buffers, it gets an IO error and aborts transaction N, also setting the file system to RO mode; 5) Task A has already returned from btrfs_start_dirty_block_groups(), is at btrfs_commit_transaction() and has set transaction N + 1 state to TRANS_STATE_COMMIT_START. Immediately after that it checks that the filesystem was turned to RO mode, due to transaction N's abort, and jumps to the "cleanup_transaction" label. After that we end up at btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs(). That helper finds block group B in the transaction's io_list but it never releases the pages array of the block group's io_ctl, resulting in a memory leak. In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages array points to pages that were already released by us at __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end up freeing the pages array only after waiting for the ordered extent to complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do that. But in the transaction abort case we don't wait for the space cache's ordered extent to complete through a call to btrfs_wait_cache_io(), so that's why we end up with a memory leak - we wait for the ordered extent to complete indirectly by shutting down the work queues and waiting for any jobs in them to complete before returning from close_ctree(). We can solve the leak simply by freeing the pages array right after releasing the pages (with the call to io_ctl_drop_pages()) at __btrfs_write_out_cache(), since we will never use it anymore after that and the pages array points to already released pages at that point, which is currently not a problem since no one will use it after that, but not a good practice anyway since it can easily lead to use-after-free issues. So fix this by freeing the pages array right after releasing the pages at __btrfs_write_out_cache(). This issue can often be reproduced with test case generic/475 from fstests and kmemleak can detect it and reports it with the following trace: unreferenced object 0xffff9bbf009fa600 (size 512): comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s) hex dump (first 32 bytes): 00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=... 80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=... backtrace: [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0 [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs] [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs] [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs] [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs] [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs] [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs] [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs] [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0 [<0000000043ace2c9>] do_syscall_64+0x33/0x80 [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21btrfs: don't allocate anonymous block device for user invisible rootsQu Wenruo
commit 851fd730a743e072badaf67caf39883e32439431 upstream. [BUG] When a lot of subvolumes are created, there is a user report about transaction aborted: BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] The error is EMFILE (Too many files open) and comes from the anonymous block device allocation. The ids are in a shared pool of size 1<<20. The ids are assigned to live subvolumes, ie. the root structure exists in memory (eg. after creation or after the root appears in some path). The pool could be exhausted if the numbers are not reclaimed fast enough, after subvolume deletion or if other system component uses the anon block devices. [WORKAROUND] Since it's not possible to completely solve the problem, we can only minimize the time the id is allocated to a subvolume root. Firstly, we can reduce the use of anon_dev by trees that are not subvolume roots, like data reloc tree. This patch will do extra check on root objectid, to skip roots that don't need anon_dev. Currently it's only data reloc tree and orphan roots. Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17btrfs: set update the uuid generation as soon as possibleJosef Bacik
commit 75ec1db8717a8f0a9d9c8d033e542fdaa7b73898 upstream. In my EIO stress testing I noticed I was getting forced to rescan the uuid tree pretty often, which was weird. This is because my error injection stuff would sometimes inject an error after log replay but before we loaded the UUID tree. If log replay committed the transaction it wouldn't have updated the uuid tree generation, but the tree was valid and didn't change, so there's no reason to not update the generation here. Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately after reading all the fs roots if the uuid tree generation matches the fs generation. Then any transaction commits that happen during mount won't screw up our uuid tree state, forcing us to do needless uuid rescans. Fixes: 70f801754728 ("Btrfs: check UUID tree during mount if required") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17Btrfs: fix crash during unmount due to race with delayed inode workersFilipe Manana
commit f0cc2cd70164efe8f75c5d99560f0f69969c72e4 upstream. During unmount we can have a job from the delayed inode items work queue still running, that can lead to at least two bad things: 1) A crash, because the worker can try to create a transaction just after the fs roots were freed; 2) A transaction leak, because the worker can create a transaction before the fs roots are freed and just after we committed the last transaction and after we stopped the transaction kthread. A stack trace example of the crash: [79011.691214] kernel BUG at lib/radix-tree.c:982! [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2 (...) [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs] [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170 (...) [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293 [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0 [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0 [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000 [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001 [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000 [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0 [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [79011.715448] Call Trace: [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs] [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs] [79011.717925] start_transaction+0xdd/0x5c0 [btrfs] [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs] [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs] [79011.720773] process_one_work+0x26d/0x6a0 [79011.721497] worker_thread+0x4f/0x3e0 [79011.722153] ? process_one_work+0x6a0/0x6a0 [79011.722901] kthread+0x103/0x140 [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70 [79011.724379] ret_from_fork+0x3a/0x50 (...) The following diagram shows a sequence of steps that lead to the crash during ummount of the filesystem: CPU 1 CPU 2 CPU 3 btrfs_punch_hole() btrfs_btree_balance_dirty() btrfs_balance_delayed_items() --> sees fs_info->delayed_root->items with value 200, which is greater than BTRFS_DELAYED_BACKGROUND (128) and smaller than BTRFS_DELAYED_WRITEBACK (512) btrfs_wq_run_delayed_node() --> queues a job for fs_info->delayed_workers to run btrfs_async_run_delayed_root() btrfs_async_run_delayed_root() --> job queued by CPU 1 --> starts picking and running delayed nodes from the prepare_list list close_ctree() btrfs_delete_unused_bgs() btrfs_commit_super() btrfs_join_transaction() --> gets transaction N btrfs_commit_transaction(N) --> set transaction state to TRANTS_STATE_COMMIT_START btrfs_first_prepared_delayed_node() --> picks delayed node X through the prepared_list list btrfs_run_delayed_items() btrfs_first_delayed_node() --> also picks delayed node X but through the node_list list __btrfs_commit_inode_delayed_items() --> runs all delayed items from this node and drops the node's item count to 0 through call to btrfs_release_delayed_inode() --> finishes running any remaining delayed nodes --> finishes transaction commit --> stops cleaner and transaction threads btrfs_free_fs_roots() --> frees all roots and removes them from the radix tree fs_info->fs_roots_radix btrfs_join_transaction() start_transaction() btrfs_record_root_in_trans() record_root_in_trans() radix_tree_tag_set() --> crashes because the root is not in the radix tree anymore If the worker is able to call btrfs_join_transaction() before the unmount task frees the fs roots, we end up leaking a transaction and all its resources, since after the call to btrfs_commit_super() and stopping the transaction kthread, we don't expect to have any transaction open anymore. When this situation happens the worker has a delayed node that has no more items to run, since the task calling btrfs_run_delayed_items(), which is doing a transaction commit, picks the same node and runs all its items first. We can not wait for the worker to complete when running delayed items through btrfs_run_delayed_items(), because we call that function in several phases of a transaction commit, and that could cause a deadlock because the worker calls btrfs_join_transaction() and the task doing the transaction commit may have already set the transaction state to TRANS_STATE_COMMIT_DOING. Also it's not possible to get into a situation where only some of the items of a delayed node are added to the fs/subvolume tree in the current transaction and the remaining ones in the next transaction, because when running the items of a delayed inode we lock its mutex, effectively waiting for the worker if the worker is running the items of the delayed node already. Since this can only cause issues when unmounting a filesystem, fix it in a simple way by waiting for any jobs on the delayed workers queue before calling btrfs_commit_supper() at close_ctree(). This works because at this point no one can call btrfs_btree_balance_dirty() or btrfs_balance_delayed_items(), and if we end up waiting for any worker to complete, btrfs_commit_super() will commit the transaction created by the worker. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28btrfs: do not check delayed items are empty for single transaction cleanupJosef Bacik
commit 1e90315149f3fe148e114a5de86f0196d1c21fa5 upstream. btrfs_assert_delayed_root_empty() will check if the delayed root is completely empty, but this is a filesystem-wide check. On cleanup we may have allowed other transactions to begin, for whatever reason, and thus the delayed root is not empty. So remove this check from cleanup_one_transation(). This however can stay in btrfs_cleanup_transaction(), because it checks only after all of the transactions have been properly cleaned up, and thus is valid. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28btrfs: reset fs_root to NULL on error in open_ctreeJosef Bacik
commit 315bf8ef914f31d51d084af950703aa1e09a728c upstream. While running my error injection script I hit a panic when we tried to clean up the fs_root when freeing the fs_root. This is because fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by setting fs_info->fs_root = NULL; if we fail to read the root. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28btrfs: destroy qgroup extent records on transaction abortJeff Mahoney
commit 81f7eb00ff5bb8326e82503a32809421d14abb8a upstream. We clean up the delayed references when we abort a transaction but we leave the pending qgroup extent records behind, leaking memory. This patch destroys the extent records when we destroy the delayed refs and makes sure ensure they're gone before releasing the transaction. Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Rebased to latest upstream, remove to_qgroup() helper, use rbtree_postorder_for_each_entry_safe() wrapper ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19btrfs: print message when tree-log replay startsDavid Sterba
commit e8294f2f6aa6208ed0923aa6d70cea3be178309a upstream. There's no logged information about tree-log replay although this is something that points to previous unclean unmount. Other filesystems report that as well. Suggested-by: Chris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11btrfs: free block groups after free'ing fs treesJosef Bacik
[ Upstream commit 4e19443da1941050b346f8fc4c368aa68413bc88 ] Sometimes when running generic/475 we would trip the WARN_ON(cache->reserved) check when free'ing the block groups on umount. This is because sometimes we don't commit the transaction because of IO errors and thus do not cleanup the tree logs until at umount time. These blocks are still reserved until they are cleaned up, but they aren't cleaned up until _after_ we do the free block groups work. Fix this by moving the free after free'ing the fs roots, that way all of the tree logs are cleaned up and we have a properly cleaned fs. A bunch of loops of generic/475 confirmed this fixes the problem. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-11btrfs: use bool argument in free_root_pointers()Anand Jain
[ Upstream commit 4273eaff9b8d5e141113a5bdf9628c02acf3afe5 ] We don't need int argument bool shall do in free_root_pointers(). And rename the argument as it confused two people. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-11Btrfs: fix race between adding and putting tree mod seq elements and nodesFilipe Manana
commit 7227ff4de55d931bbdc156c8ef0ce4f100c78a5b upstream. There is a race between adding and removing elements to the tree mod log list and rbtree that can lead to use-after-free problems. Consider the following example that explains how/why the problems happens: 1) Task A has mod log element with sequence number 200. It currently is the only element in the mod log list; 2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to access the tree mod log. When it enters the function, it initializes 'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock' before checking if there are other elements in the mod seq list. Since the list it empty, 'min_seq' remains set to (u64)-1. Then it unlocks the lock 'tree_mod_seq_lock'; 3) Before task A acquires the lock 'tree_mod_log_lock', task B adds itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a sequence number of 201; 4) Some other task, name it task C, modifies a btree and because there elements in the mod seq list, it adds a tree mod elem to the tree mod log rbtree. That node added to the mod log rbtree is assigned a sequence number of 202; 5) Task B, which is doing fiemap and resolving indirect back references, calls btrfs get_old_root(), with 'time_seq' == 201, which in turn calls tree_mod_log_search() - the search returns the mod log node from the rbtree with sequence number 202, created by task C; 6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating the mod log rbtree and finds the node with sequence number 202. Since 202 is less than the previously computed 'min_seq', (u64)-1, it removes the node and frees it; 7) Task B still has a pointer to the node with sequence number 202, and it dereferences the pointer itself and through the call to __tree_mod_log_rewind(), resulting in a use-after-free problem. This issue can be triggered sporadically with the test case generic/561 from fstests, and it happens more frequently with a higher number of duperemove processes. When it happens to me, it either freezes the VM or it produces a trace like the following before crashing: [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1 [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 1245.321287] RIP: 0010:rb_next+0x16/0x50 [ 1245.321307] Code: .... [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202 [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80 [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000 [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038 [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8 [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000 [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0 [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1245.321706] Call Trace: [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs] [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs] [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs] [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs] [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322066] do_vfs_ioctl+0x45a/0x750 [ 1245.322081] ksys_ioctl+0x70/0x80 [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1245.322113] __x64_sys_ioctl+0x16/0x20 [ 1245.322126] do_syscall_64+0x5c/0x280 [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1245.322155] RIP: 0033:0x7fdee3942dd7 [ 1245.322177] Code: .... [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7 [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004 [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44 [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48 [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50 [ 1245.322423] Modules linked in: .... [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]--- Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum sequence number and iterates the rbtree while holding the lock 'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock' lock, since it is now redundant. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09btrfs: get rid of unique workqueue helper functionsOmar Sandoval
[ Upstream commit a0cac0ec961f0d42828eeef196ac2246a2f07659 ] Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in end_workqueue_fn()Omar Sandoval
[ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ] Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-15btrfs: don't needlessly create extent-refs kernel threadDavid Sterba
The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") removed the async delayed refs but the thread has been created, without any use. Remove it to avoid resource consumption. Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Detect unbalanced tree with empty leaf before crashing btree operationsQu Wenruo
[BUG] With crafted image, btrfs will panic at btree operations: kernel BUG at fs/btrfs/ctree.c:3894! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:__push_leaf_left+0x6b6/0x6e0 RSP: 0018:ffffc0bd4128b990 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffa0a4ab8f0e38 RCX: 0000000000000000 RDX: ffffa0a280000000 RSI: 0000000000000000 RDI: ffffa0a4b3814000 RBP: ffffc0bd4128ba38 R08: 0000000000001000 R09: ffffc0bd4128b948 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000240 R13: ffffa0a4b556fb60 R14: ffffa0a4ab8f0af0 R15: ffffa0a4ab8f0af0 FS: 0000000000000000(0000) GS:ffffa0a4b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2461c80020 CR3: 000000022b32a006 CR4: 00000000000206f0 Call Trace: ? _cond_resched+0x1a/0x50 push_leaf_left+0x179/0x190 btrfs_del_items+0x316/0x470 btrfs_del_csums+0x215/0x3a0 __btrfs_free_extent.isra.72+0x5a7/0xbe0 __btrfs_run_delayed_refs+0x539/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace c2425e6e89b5558f ]--- [CAUSE] The offending csum tree looks like this: checksum tree key (CSUM_TREE ROOT_ITEM 0) node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE ... key (EXTENT_CSUM EXTENT_CSUM 85975040) block 29630464 gen 17 key (EXTENT_CSUM EXTENT_CSUM 89911296) block 29642752 gen 17 <<< key (EXTENT_CSUM EXTENT_CSUM 92274688) block 29646848 gen 17 ... leaf 29630464 items 6 free space 1 generation 17 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 85975040) itemoff 3987 itemsize 8 range start 85975040 end 85983232 length 8192 ... leaf 29642752 items 0 free space 3995 generation 17 owner 0 ^ empty leaf invalid owner ^ leaf 29646848 items 1 free space 602 generation 17 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 92274688) itemoff 627 itemsize 3368 range start 92274688 end 95723520 length 3448832 So we have a corrupted csum tree where one tree leaf is completely empty, causing unbalanced btree, thus leading to unexpected btree balance error. [FIX] For this particular case, we handle it in two directions to catch it: - Check if the tree block is empty through btrfs_verify_level_key() So that invalid tree blocks won't be read out through btrfs_search_slot() and its variants. - Check 0 tree owner in tree checker NO tree is using 0 as its tree owner, detect it and reject at tree block read time. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Make reada_tree_block_flagged privateNikolay Borisov
This function is used only for the readahead machinery. It makes no sense to keep it external to reada.c file. Place it above its sole caller and make it static. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move basic block_group definitions to their own headerJosef Bacik
This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-07Btrfs: fix sysfs warning and missing raid sysfs directoriesFilipe Manana
In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-17btrfs: free checksum hash on in close_ctreeJohannes Thumshirn
fs_info::csum_hash gets initialized in btrfs_init_csum_hash() which is called by open_ctree(). But it only gets freed if open_ctree() fails, not on normal operation. This leads to a memory leak like the following found by kmemleak: unreferenced object 0xffff888132cb8720 (size 96): comm "mount", pid 450, jiffies 4294912436 (age 17.584s) hex dump (first 32 bytes): 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000000c9643d4>] crypto_create_tfm+0x2d/0xd0 [<00000000ae577f68>] crypto_alloc_tfm+0x4b/0xb0 [<000000002b5cdf30>] open_ctree+0xb84/0x2060 [btrfs] [<0000000043204297>] btrfs_mount_root+0x552/0x640 [btrfs] [<00000000c99b10ea>] legacy_get_tree+0x22/0x40 [<0000000071a6495f>] vfs_get_tree+0x1f/0xc0 [<00000000f180080e>] fc_mount+0x9/0x30 [<000000009e36cebd>] vfs_kern_mount.part.11+0x6a/0x80 [<0000000004594c05>] btrfs_mount+0x174/0x910 [btrfs] [<00000000c99b10ea>] legacy_get_tree+0x22/0x40 [<0000000071a6495f>] vfs_get_tree+0x1f/0xc0 [<00000000b86e92c5>] do_mount+0x6b0/0x940 [<0000000097464494>] ksys_mount+0x7b/0xd0 [<0000000057213c80>] __x64_sys_mount+0x1c/0x20 [<00000000cb689b5e>] do_syscall_64+0x43/0x130 [<000000002194e289>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Free fs_info::csum_hash in close_ctree() to avoid the memory leak. Fixes: 6d97c6e31b55 ("btrfs: add boilerplate code for directly including the crypto framework") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: prevent send failures and crashes due to concurrent relocationFilipe Manana
Send always operates on read-only trees and always expected that while it is in progress, nothing changes in those trees. Due to that expectation and the fact that send is a read-only operation, it operates on commit roots and does not hold transaction handles. However relocation can COW nodes and leafs from read-only trees, which can cause unexpected failures and crashes (hitting BUG_ONs). while send using a node/leaf, it gets COWed, the transaction used to COW it is committed, a new transaction starts, the extent previously used for that node/leaf gets allocated, possibly for another tree, and the respective extent buffer' content changes while send is still using it. When this happens send normally fails with EIO being returned to user space and messages like the following are found in dmesg/syslog: [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253 [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984 Other times, less often, we hit a BUG_ON() because an extent buffer that send is using used to be a node, and while send is still using it, it got COWed and got reused as a leaf while send is still using, producing the following trace: [ 3478.466280] ------------[ cut here ]------------ [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806! [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000 [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3478.479003] Call Trace: [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0 [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs] [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs] [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs] [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs] [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs] [ 3478.483581] ? rq_clock_task+0x2e/0x60 [ 3478.484086] ? wake_up_new_task+0x1f3/0x370 [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0 [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0 [ 3478.486016] ? __fget+0x113/0x200 [ 3478.486467] ksys_ioctl+0x70/0x80 [ 3478.486911] __x64_sys_ioctl+0x16/0x20 [ 3478.487337] do_syscall_64+0x60/0x1b0 [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7 (...) [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001 (...) [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]--- Another possibility, much less likely to happen, is that send will not fail but the contents of the stream it produces may not be correct. To avoid this, do not allow send and relocation (balance) to run in parallel. In the long term the goal is to allow for both to be able to run concurrently without any problems, but that will take a significant effort in development and testing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: directly call into crypto framework for checksummingJohannes Thumshirn
Currently btrfs_csum_data() relied on the crc32c() wrapper around the crypto framework for calculating the CRCs. As we have our own crypto_shash structure in the fs_info now, we can directly call into the crypto framework without going trough the wrapper. This way we can even remove the btrfs_csum_data() and btrfs_csum_final() wrappers. The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre: crc32c"), which was previously provided by LIBCRC32C config option doing the same. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: add boilerplate code for directly including the crypto frameworkJohannes Thumshirn
Add boilerplate code for directly including the crypto framework. This helps us flipping the switch for new algorithms. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: Simplify btrfs_check_super_csum() and get rid of size assumptionsJohannes Thumshirn
Now that we have already checked for a valid checksum type before calling btrfs_check_super_csum(), it can be simplified even further. While at it get rid of the implicit size assumption of the resulting checksum as well. This is a preparation for changing all checksum functionality to use the crypto layer later. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: check for supported superblock checksum type before checksum validationJohannes Thumshirn
Now that we have factorerd out the superblock checksum type validation, we can check for supported superblock checksum types before doing the actual validation of the superblock read from disk. This leads the path to further simplifications of btrfs_check_super_csum() later on. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: add common checksum type validationJohannes Thumshirn
Currently btrfs is only supporting CRC32C as checksumming algorithm. As this is about to change provide a function to validate the checksum type in the superblock against all possible algorithms. This makes adding new algorithms easier as there are fewer places to adjust when adding new algorithms. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: use u8 for raid_array membersDavid Sterba
The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just small numbers so we don't have to use ints. New size is 7 * 32 = 224, saving 3 cachelines. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: remove mapping tree structures indirectionDavid Sterba
fs_info::mapping_tree is the physical<->logical mapping tree and uses the same underlying structure as extents, but is embedded to another structure. There are no other members and this indirection is useless. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: detect fast implementation of crc32c on all architecturesDavid Sterba
Currently, there's only check for fast crc32c implementation on X86, based on the CPU flags. This is used to decide if checksumming should be offloaded to worker threads or can be calculated by the caller. As there are more architectures that implement a faster version of crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw cards. The detection is based on driver name, all generic C implementations contain 'generic', while the specialized versions do not. Alternatively the priority could be used, but this is not currently provided by the crypto API. The flag is set per-filesystem at mount time and used for the offloading decisions. Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-07Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
2019-04-30block: remove the i argument to bio_for_each_segment_allChristoph Hellwig
We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-29btrfs: track DIO bytes in flightJosef Bacik
When diagnosing a slowdown of generic/224 I noticed we were not doing anything when calling into shrink_delalloc(). This is because all writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes counter is 0, which short circuits most of the work inside of shrink_delalloc(). However O_DIRECT writes still consume metadata resources and generate ordered extents, which we can still wait on. Fix this by tracking outstanding DIO write bytes, and use this as well as the delalloc bytes counter to decide if we need to lookup and wait on any ordered extents. If we have more DIO writes than delalloc bytes we'll go ahead and wait on any ordered extents regardless of our flush state as flushing delalloc is likely to not gain us anything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ use dio instead of odirect in identifiers ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Remove bio_offset argument from submit_bio_hookNikolay Borisov
None of the implementers of the submit_bio_hook use the bio_offset parameter, simply remove it. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Always pass 0 bio_offset for btree_submit_bio_startNikolay Borisov
The btree submit hook queues the async csum and forwards the bio_offset parameter passed to btree_submit_bio_hook. This is redundant since btree_submit_bio_start calls btree_csum_one_bio which doesn't use the offset at all. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Remove 'tree' argument from read_extent_buffer_pagesNikolay Borisov
This function always uses the btree inode's io_tree. Stop taking the tree as a function argument and instead access it internally from read_extent_buffer_pages. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Change submit_bio_hook to taking an inode directlyNikolay Borisov
The only possible 'private_data' that is passed to this function is actually an inode. Make that explicit by changing the signature of the call back. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from trans in btrfs_create_treeDavid Sterba
We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Do mandatory tree block check before submitting bioQu Wenruo
There are at least 2 reports about a memory bit flip sneaking into on-disk data. Currently we only have a relaxed check triggered at btrfs_mark_buffer_dirty() time, as it's not mandatory and only for CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to detect such problem. This patch will address the hole by triggering comprehensive check on tree blocks before writing it back to disk. The design points are: - Timing of the check: Tree block write hook This timing is chosen to reduce the overhead. The comprehensive check should be as expensive as a checksum calculation. Doing full check at btrfs_mark_buffer_dirty() is too expensive for end user. - Loose empty leaf check Originally for an empty leaf, tree-checker will report error if it's not a tree root. The problem for such check at write time is: * False alert for tree root created in current transaction In that case, the commit root still needs to be written to disk. And since current root can differ from commit root, then it will cause false alert. This happens for log tree. * False alert for relocated tree block Relocated tree block can be written to disk due to memory pressure, in that case an empty csum tree root can be written to disk and cause false alert, since csum root node hasn't been updated. Previous patch of removing comprehensive empty leaf owner check has paved the way for this patch. The example error output will be something like: BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0) BTRFS error (device dm-3): block=1350630375424 write time tree block corruption detected BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-3): forced readonly BTRFS warning (device dm-3): Skipping commit of aborted transaction. BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure BTRFS info (device dm-3): delayed_refs has NO entry Reported-by: Leonard Lausen <leonard@lausen.nl> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_check_nodeDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_check_leaf_relaxedDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_check_leaf_fullDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: replace pending/pinned chunks lists with io treeJeff Mahoney
The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctreeNikolay Borisov
Following the introduction of the alloc_state tree, some of the callees of btrfs_mapping_tree_free will have to interact with the btrfs_device of the constituent devices. Enable this by moving the code responsible for freeing devices after the last user (btrfs_mapping_tree_free). Otherwise the kernel could crash due to use-after-free. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: combine device update operations during transaction commitNikolay Borisov
We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: qgroup: remove obsolete fs_info membersDavid Sterba
The commit fcebe4562dec ("Btrfs: rework qgroup accounting") reworked qgroups and added some new structures. Another rework of qgroup mechanics e69bcee37692 ("btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.") stopped using them and left uncleaned. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_verify_level_keyDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btree_read_extent_buffer_pagesDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in clean_tree_blockDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in check_tree_block_fsidDavid Sterba
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>