aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-10-16Merge branch 'v5.4/standard/base' into v5.4/standard/preempt-rt/bcm-2xxx-rpiBruce Ashfield
2020-10-16Merge tag 'v5.4.71' into v5.4/standard/baseBruce Ashfield
This is the 5.4.71 stable release # gpg: Signature made Wed 14 Oct 2020 04:33:29 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-10-16Merge branch 'v5.4/standard/base' into v5.4/standard/preempt-rt/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-10-16Merge tag 'v5.4.70' into v5.4/standard/baseBruce Ashfield
This is the 5.4.70 stable release # gpg: Signature made Wed 07 Oct 2020 02:02:04 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-10-14cifs: Fix incomplete memory allocation on setxattr pathVladimir Zapolskiy
commit 64b7f674c292207624b3d788eda2dde3dc1415df upstream. On setxattr() syscall path due to an apprent typo the size of a dynamically allocated memory chunk for storing struct smb2_file_full_ea_info object is computed incorrectly, to be more precise the first addend is the size of a pointer instead of the wanted object size. Coincidentally it makes no difference on 64-bit platforms, however on 32-bit targets the following memcpy() writes 4 bytes of data outside of the dynamically allocated memory. ============================================================================= BUG kmalloc-16 (Not tainted): Redzone overwritten ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: 0x79e69a6f-0x9e5cdecf @offset=368. First byte 0x73 instead of 0xcc INFO: Slab 0xd36d2454 objects=85 used=51 fp=0xf7d0fc7a flags=0x35000201 INFO: Object 0x6f171df3 @offset=352 fp=0x00000000 Redzone 5d4ff02d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................ Object 6f171df3: 00 00 00 00 00 05 06 00 73 6e 72 75 62 00 66 69 ........snrub.fi Redzone 79e69a6f: 73 68 32 0a sh2. Padding 56254d82: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ CPU: 0 PID: 8196 Comm: attr Tainted: G B 5.9.0-rc8+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 Call Trace: dump_stack+0x54/0x6e print_trailer+0x12c/0x134 check_bytes_and_report.cold+0x3e/0x69 check_object+0x18c/0x250 free_debug_processing+0xfe/0x230 __slab_free+0x1c0/0x300 kfree+0x1d3/0x220 smb2_set_ea+0x27d/0x540 cifs_xattr_set+0x57f/0x620 __vfs_setxattr+0x4e/0x60 __vfs_setxattr_noperm+0x4e/0x100 __vfs_setxattr_locked+0xae/0xd0 vfs_setxattr+0x4e/0xe0 setxattr+0x12c/0x1a0 path_setxattr+0xa4/0xc0 __ia32_sys_lsetxattr+0x1d/0x20 __do_fast_syscall_32+0x40/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 Fixes: 5517554e4313 ("cifs: Add support for writing attributes on SMB2+") Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14btrfs: allow btrfs_truncate_block() to fallback to nocow for data space ↵Qu Wenruo
reservation commit 6d4572a9d71d5fc2affee0258d8582d39859188c upstream. [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reported-by: Martin Doucha <martin.doucha@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14btrfs: fix RWF_NOWAIT write not failling when we need to cowFilipe Manana
commit 260a63395f90f67d6ab89e4266af9e3dc34a77e9 upstream. If we attempt to do a RWF_NOWAIT write against a file range for which we can only do NOCOW for a part of it, due to the existence of holes or shared extents for example, we proceed with the write as if it were possible to NOCOW the whole range. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/sdj/bar $ chattr +C /mnt/sdj/bar $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar wrote 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec) $ xfs_io -c "fpunch 64K 64K" /mnt/bar $ sync $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar wrote 131072/131072 bytes at offset 0 128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec) This last write should fail with -EAGAIN since the file range from 64K to 128K is a hole. On xfs it fails, as expected, but on ext4 it currently succeeds because apparently it is expensive to check if there are extents allocated for the whole range, but I'll check with the ext4 people. Fix the issue by checking if check_can_nocow() returns a number of NOCOW'able bytes smaller then the requested number of bytes, and if it does return -EAGAIN. Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14btrfs: Ensure we trim ranges across block group boundaryQu Wenruo
commit 6b7faadd985c990324b5b5bd18cc4ba5c395eb65 upstream. [BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); |- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: |<--- BG1 --->|<--- BG2 --->| |//////|<-- Range to discard --->|/////| To discard that range, we have the following calls: btrfs_discard_extent() |- btrfs_map_block() | Returned bbio will end at BG1's end. As btrfs_map_block() | never returns result across block group boundary. |- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14btrfs: volumes: Use more straightforward way to calculate map lengthQu Wenruo
commit 2d974619a77f106f3d1341686dea95c0eaad601f upstream. The old code goes: offset = logical - em->start; length = min_t(u64, em->len - offset, length); Where @length calculation is dependent on offset, it can take reader several more seconds to find it's just the same code as: offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); Use above code to make the length calculate independent from other variable, thus slightly increase the readability. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14Btrfs: send, fix emission of invalid clone operations within the same fileFilipe Manana
commit 9722b10148504c4153a74a9c89725af271e490fc upstream. When doing an incremental send and a file has extents shared with itself at different file offsets, it's possible for send to emit clone operations that will fail at the destination because the source range goes beyond the file's current size. This happens when the file size has increased in the send snapshot, there is a hole between the shared extents and both shared extents are at file offsets which are greater the file's size in the parent snapshot. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/1.snap /mnt/sdb/base # Create a 320K extent at file offset 512K. $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar # Clone part of that 320K extent into a lower file offset (192K). # This file offset is greater than the file's size in the parent # snapshot (64K). Also the clone range is a bit behind the offset of # the 320K extent so that we leave a hole between the shared extents. $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/1.snap /mnt/sdc $ btrfs receive -f /tmp/2.snap /mnt/sdc ERROR: failed to clone extents to foobar: Invalid argument The problem is that after processing the extent at file offset 256K, which refers to the first 128K of the 320K extent created by the buffered write operations, we have 'cur_inode_next_write_offset' set to 384K, which corresponds to the end offset of the partially shared extent (256K + 128K) and to the current file size in the receiver. Then when we process the extent at offset 512K, we do extent backreference iteration to figure out if we can clone the extent from some other inode or from the same inode, and we consider the extent at offset 256K of the same inode as a valid source for a clone operation, which is not correct because at that point the current file size in the receiver is 384K, which corresponds to the end of last processed extent (at file offset 256K), so using a clone source range from 256K to 256K + 320K is invalid because that goes past the current size of the file (384K) - this makes the receiver get an -EINVAL error when attempting the clone operation. So fix this by excluding clone sources that have a range that goes beyond the current file size in the receiver when iterating extent backreferences. A test case for fstests follows soon. Fixes: 11f2069c113e02 ("Btrfs: send, allow clone operations within the same file") CC: stable@vger.kernel.org # 5.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14Btrfs: send, allow clone operations within the same fileFilipe Manana
commit 11f2069c113e02971b8db6fda62f9b9cd31a030f upstream. For send we currently skip clone operations when the source and destination files are the same. This is so because clone didn't support this case in its early days, but support for it was added back in May 2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a file"). This change adds support for it. Example: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap $ mkfs.btrfs -f /dev/sde $ mount /dev/sde /mnt/sde $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde Without this change file foobar at the destination has a single 128Kb extent: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof /mnt/sde/snap/foobar: 1 extent found With this we get a single 64Kb extent that is shared at file offsets 0 and 64K, just like in the source filesystem: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 15: 3328.. 3343: 16: shared 1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof /mnt/sde/snap/foobar: 2 extents found Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14io_uring: Fix double list add in io_queue_async_work()Muchun Song
If we queue work in io_poll_wake(), it will leads to list double add. So we should add the list when the callback func is the io_sq_wq_submit_work. The following oops was seen: list_add double add: new=ffff9ca6a8f1b0e0, prev=ffff9ca62001cee8, next=ffff9ca6a8f1b0e0. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:31! Call Trace: <IRQ> io_poll_wake+0xf3/0x230 __wake_up_common+0x91/0x170 __wake_up_common_lock+0x7a/0xc0 io_commit_cqring+0xea/0x280 ? blkcg_iolatency_done_bio+0x2b/0x610 io_cqring_add_event+0x3e/0x60 io_complete_rw+0x58/0x80 dio_complete+0x106/0x250 blk_update_request+0xa0/0x3b0 blk_mq_end_request+0x1a/0x110 blk_mq_complete_request+0xd0/0xe0 nvme_irq+0x129/0x270 [nvme] __handle_irq_event_percpu+0x7b/0x190 handle_irq_event_percpu+0x30/0x80 handle_irq_event+0x3c/0x60 handle_edge_irq+0x91/0x1e0 do_IRQ+0x4d/0xd0 common_interrupt+0xf/0xf Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14io_uring: Fix remove irrelevant req from the task_listMuchun Song
If the process 0 has been initialized io_uring is complete, and then fork process 1. If process 1 exits and it leads to delete all reqs from the task_list. If we kill process 0. We will not send SIGINT signal to the kworker. So we can not remove the req from the task_list. The io_sq_wq_submit_work() can do that for us. Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14io_uring: Fix missing smp_mb() in io_cancel_async_work()Muchun Song
The store to req->flags and load req->work_task should not be reordering in io_cancel_async_work(). We should make sure that either we store REQ_F_CANCE flag to req->flags or we see the req->work_task setted in io_sq_wq_submit_work(). Fixes: 1c4404efcf2c ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14io_uring: Fix resource leaking when kill the processYinyin Zhu
The commit 1c4404efcf2c0> ("<io_uring: make sure async workqueue is canceled on exit>") doesn't solve the resource leak problem totally! When kworker is doing a io task for the io_uring, The process which submitted the io task has received a SIGKILL signal from the user. Then the io_cancel_async_work function could have sent a SIGINT signal to the kworker, but the judging condition is wrong. So it doesn't send a SIGINT signal to the kworker, then caused the resource leaking problem. Why the juding condition is wrong? The process is a multi-threaded process, we call the thread of the process which has submitted the io task Thread1. So the req->task is the current macro of the Thread1. when all the threads of the process have done exit procedure, the last thread will call the io_cancel_async_work, but the last thread may not the Thread1, so the task is not equal and doesn't send the SIGINT signal. To fix this bug, we alter the task attribute of the req with struct files_struct. And check the files instead. Fixes: 1c4404efcf2c0 ("io_uring: make sure async workqueue is canceled on exit") Signed-off-by: Yinyin Zhu <zhuyinyin@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07ep_create_wakeup_source(): dentry name can change under you...Al Viro
commit 3701cb59d892b88d569427586f01491552f377b1 upstream. or get freed, for that matter, if it's a long (separately stored) name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07epoll: EPOLL_CTL_ADD: close the race in decision to take fast pathAl Viro
commit fe0a916c1eae8e17e86c3753d13919177d63ed7e upstream. Checking for the lack of epitems refering to the epoll we want to insert into is not enough; we might have an insertion of that epoll into another one that has already collected the set of files to recheck for excessive reverse paths, but hasn't gotten to creating/inserting the epitem for it. However, any such insertion in progress can be detected - it will update the generation count in our epoll when it's done looking through it for files to check. That gets done under ->mtx of our epoll and that allows us to detect that safely. We are *not* holding epmutex here, so the generation count is not stable. However, since both the update of ep->gen by loop check and (later) insertion into ->f_ep_link are done with ep->mtx held, we are fine - the sequence is grab epmutex bump loop_check_gen ... grab tep->mtx // 1 tep->gen = loop_check_gen ... drop tep->mtx // 2 ... grab tep->mtx // 3 ... insert into ->f_ep_link ... drop tep->mtx // 4 bump loop_check_gen drop epmutex and if the fastpath check in another thread happens for that eventpoll, it can come * before (1) - in that case fastpath is just fine * after (4) - we'll see non-empty ->f_ep_link, slow path taken * between (2) and (3) - loop_check_gen is stable, with ->mtx providing barriers and we end up taking slow path. Note that ->f_ep_link emptiness check is slightly racy - we are protected against insertions into that list, but removals can happen right under us. Not a problem - in the worst case we'll end up taking a slow path for no good reason. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07epoll: replace ->visited/visited_list with generation countAl Viro
commit 18306c404abe18a0972587a6266830583c60c928 upstream. removes the need to clear it, along with the races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07epoll: do not insert into poll queues until all sanity checks are doneAl Viro
commit f8d4f44df056c5b504b0d49683fb7279218fd207 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07nfs: Fix security label length not being resetJeffrey Mitchell
[ Upstream commit d33030e2ee3508d65db5644551435310df86010e ] nfs_readdir_page_filler() iterates over entries in a directory, reusing the same security label buffer, but does not reset the buffer's length. This causes decode_attr_security_label() to return -ERANGE if an entry's security label is longer than the previous one's. This error, in nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another failed attempt to copy into the buffer. The second error is ignored and the remaining entries do not show up in ls, specifically the getdents64() syscall. Reproduce by creating multiple files in NFS and giving one of the later files a longer security label. ls will not see that file nor any that are added afterwards, though they will exist on the backend. In nfs_readdir_page_filler(), reset security label buffer length before every reuse Signed-off-by: Jeffrey Mitchell <jeffrey.mitchell@starlab.io> Fixes: b4487b935452 ("nfs: Fix getxattr kernel panic and memory overflow") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-07fuse: fix the ->direct_IO() treatment of iov_iterAl Viro
[ Upstream commit 933a3752babcf6513117d5773d2b70782d6ad149 ] the callers rely upon having any iov_iter_truncate() done inside ->direct_IO() countered by iov_iter_reexpand(). Reported-by: Qian Cai <cai@redhat.com> Tested-by: Qian Cai <cai@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-07xfs: trim IO to found COW extent limitEric Sandeen
A bug existed in the XFS reflink code between v5.1 and v5.5 in which the mapping for a COW IO was not trimmed to the mapping of the COW extent that was found. This resulted in a too-short copy, and corruption of other files which shared the original extent. (This happened only when extent size hints were set, which bypasses delalloc and led to this code path.) This was (inadvertently) fixed upstream with 36adcbace24e "xfs: fill out the srcmap in iomap_begin" and related patches which moved lots of this functionality to the iomap subsystem. Hence, this is a -stable only patch, targeted to fix this corruption vector without other major code changes. Fixes: 78f0cc9d55cb ("xfs: don't use delalloc extents for COW on files with extsize hints") Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-07btrfs: fix filesystem corruption after a device replaceFilipe Manana
commit 4c8f353272dd1262013873990c0fafd0e3c8f274 upstream. We use a device's allocation state tree to track ranges in a device used for allocated chunks, and we set ranges in this tree when allocating a new chunk. However after a device replace operation, we were not setting the allocated ranges in the new device's allocation state tree, so that tree is empty after a device replace. This means that a fitrim operation after a device replace will trim the device ranges that have allocated chunks and extents, as we trim every range for which there is not a range marked in the device's allocation state tree. It is also important during chunk allocation, since the device's allocation state is used to determine if a range is already allocated when allocating a new chunk. This is trivial to reproduce and the following script triggers the bug: $ cat reproducer.sh #!/bin/bash DEV1="/dev/sdg" DEV2="/dev/sdh" DEV3="/dev/sdi" wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null # Create a raid1 test fs on 2 devices. mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null mount $DEV1 /mnt/btrfs xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo echo "Starting to replace $DEV1 with $DEV3" btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs echo echo "Running fstrim" fstrim /mnt/btrfs echo echo "Unmounting filesystem" umount /mnt/btrfs echo "Mounting filesystem in degraded mode using $DEV3 only" wipefs -a $DEV1 $DEV2 &> /dev/null mount -o degraded $DEV3 /mnt/btrfs if [ $? -ne 0 ]; then dmesg | tail echo echo "Failed to mount in degraded mode" exit 1 fi echo echo "File foo data (expected all bytes = 0xab):" od -A d -t x1 /mnt/btrfs/foo umount /mnt/btrfs When running the reproducer: $ ./replace-test.sh wrote 10485760/10485760 bytes at offset 0 10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec) Starting to replace /dev/sdg with /dev/sdi Running fstrim Unmounting filesystem Mounting filesystem in degraded mode using /dev/sdi only mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error. [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished [19582.208293] BTRFS info (device sdi): allowing degraded mounts [19582.208298] BTRFS info (device sdi): disk space caching is enabled [19582.208301] BTRFS info (device sdi): has skinny extents [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0 [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5 [19582.231576] BTRFS error (device sdi): open_ctree failed Failed to mount in degraded mode So fix by setting all allocated ranges in the replace target device when the replace operation is finishing, when we are holding the chunk mutex and we can not race with new chunk allocations. A test case for fstests follows soon. Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05Merge branch 'v5.4/standard/base' into v5.4/standard/preempt-rt/baseBruce Ashfield
2020-10-05Merge tag 'v5.4.69' into v5.4/standard/baseBruce Ashfield
This is the 5.4.69 stable release # gpg: Signature made Thu 01 Oct 2020 07:19:35 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-10-01btrfs: fix overflow when copying corrupt csums for a messageJohannes Thumshirn
commit 35be8851d172c6e3db836c0f28c19087b10c9e00 upstream. Syzkaller reported a buffer overflow in btree_readpage_end_io_hook() when loop mounting a crafted image: detected buffer overflow in memcpy ------------[ cut here ]------------ kernel BUG at lib/string.c:1129! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: btrfs-endio-meta btrfs_work_helper RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129 RSP: 0018:ffffc90000e27980 EFLAGS: 00010286 RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000 RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22 RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7 R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: memcpy include/linux/string.h:405 [inline] btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642 end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854 bio_endio+0x3cf/0x7f0 block/bio.c:1449 end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695 btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Modules linked in: ---[ end trace b68924293169feef ]--- RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129 RSP: 0018:ffffc90000e27980 EFLAGS: 00010286 RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000 RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22 RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7 R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 The overflow happens, because in btree_readpage_end_io_hook() we assume that we have found a 4 byte checksum instead of the real possible 32 bytes we have for the checksums. With the fix applied: [ 35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215) [ 35.738994] BTRFS info (device loop0): disk space caching is enabled [ 35.738998] BTRFS info (device loop0): has skinny extents [ 35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0 [ 35.743420] BTRFS error (device loop0): failed to read chunk root [ 35.745899] BTRFS error (device loop0): open_ctree failed Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com Fixes: d5178578bcd4 ("btrfs: directly call into crypto framework for checksumming") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01btrfs: qgroup: fix data leak caused by race between writeback and truncateQu Wenruo
[ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ] [BUG] When running tests like generic/013 on test device with btrfs quota enabled, it can normally lead to data leak, detected at unmount time: BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 ------------[ cut here ]------------ WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace caf08beafeca2392 ]--- BTRFS error (device dm-3): qgroup reserved space leaked [CAUSE] In the offending case, the offending operations are: 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 The following sequence of events could happen after the writev(): CPU1 (writeback) | CPU2 (truncate) ----------------------------------------------------------------- btrfs_writepages() | |- extent_write_cache_pages() | |- Got page for 1003520 | | 1003520 is Dirty, no writeback | | So (!clear_page_dirty_for_io()) | | gets called for it | |- Now page 1003520 is Clean. | | | btrfs_setattr() | | |- btrfs_setsize() | | |- truncate_setsize() | | New i_size is 18388 |- __extent_writepage() | | |- page_offset() > i_size | |- btrfs_invalidatepage() | |- Page is clean, so no qgroup | callback executed This means, the qgroup reserved data space is not properly released in btrfs_invalidatepage() as the page is Clean. [FIX] Instead of checking the dirty bit of a page, call btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). As qgroup rsv are completely bound to the QGROUP_RESERVED bit of io_tree, not bound to page status, thus we won't cause double freeing anyway. Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01cifs: Fix double add page to memcg when cifs_readpagesZhang Xiaoxu
[ Upstream commit 95a3d8f3af9b0d63b43f221b630beaab9739d13a ] When xfstests generic/451, there is an BUG at mm/memcontrol.c: page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea index:0xf mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451" flags: 0x2fffff80000001(locked) raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210 raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000 page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup) page->mem_cgroup:ffff88817287d000 ------------[ cut here ]------------ kernel BUG at mm/memcontrol.c:2659! invalid opcode: 0000 [#1] SMP CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_ 073836-buildvm-ppc64le-16.ppc.4 RIP: 0010:commit_charge+0x35/0x50 Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7 c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9 RSP: 0018:ffffc90002023a50 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0 RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005 R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0 FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mem_cgroup_charge+0x166/0x4f0 __add_to_page_cache_locked+0x4a9/0x710 add_to_page_cache_locked+0x15/0x20 cifs_readpages+0x217/0x1270 read_pages+0x29a/0x670 page_cache_readahead_unbounded+0x24f/0x390 __do_page_cache_readahead+0x3f/0x60 ondemand_readahead+0x1f1/0x470 page_cache_async_readahead+0x14c/0x170 generic_file_buffered_read+0x5df/0x1100 generic_file_read_iter+0x10c/0x1d0 cifs_strict_readv+0x139/0x170 new_sync_read+0x164/0x250 __vfs_read+0x39/0x60 vfs_read+0xb5/0x1e0 ksys_pread64+0x85/0xf0 __x64_sys_pread64+0x22/0x30 do_syscall_64+0x69/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5071fcb1af Code: Bad RIP value. RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001 R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000 Modules linked in: ---[ end trace 725fa14a3e1af65c ]--- Since commit 3fea5a499d57 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API") not cancel the page charge, the pages maybe double add to pagecache: thread1 | thread2 cifs_readpages readpages_get_pages add_to_page_cache_locked(head,index=n)=0 | readpages_get_pages | add_to_page_cache_locked(head,index=n+1)=0 add_to_page_cache_locked(head, index=n+1)=-EEXIST then, will next loop with list head page's index=n+1 and the page->mapping not NULL readpages_get_pages add_to_page_cache_locked(head, index=n+1) commit_charge VM_BUG_ON_PAGE So, we should not do the next loop when any page add to page cache failed. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01NFS: nfs_xdr_status should record the procedure nameChuck Lever
[ Upstream commit 5be5945864ea143fda628e8179c8474457af1f43 ] When sunrpc trace points are not enabled, the recorded task ID information alone is not helpful. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ceph: fix potential race in ceph_check_capsJeff Layton
[ Upstream commit dc3da0461cc4b76f2d0c5b12247fcb3b520edbbf ] Nothing ensures that session will still be valid by the time we dereference the pointer. Take and put a reference. In principle, we should always be able to get a reference here, but throw a warning if that's ever not the case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01btrfs: fix double __endio_write_update_ordered in direct I/OOmar Sandoval
[ Upstream commit c36cac28cb94e58f7e21ff43bdc6064346dab32c ] In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private, we complete the ordered extent range. However, we don't mark that the range doesn't need to be cleaned up from btrfs_direct_IO() until later. Therefore, if we fail to allocate the btrfs_dio_private, we complete the ordered extent range twice. We could fix this by updating unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so that creating the btrfs_dio_private and submitting the bios are separate, and once the btrfs_dio_private is created, cleanup always happens through the btrfs_dio_private. The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start is really subtle. We have the following: 1. btrfs_direct_IO sets those two to the same value. 2. When we call __blockdev_direct_IO unless btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to modify unsubmitted_oe_range_start so that start < end. Cleanup won't happen. 3. We come into btrfs_submit_direct - if it dip allocation fails we'd return with oe_range_end now modified so cleanup will happen. 4. If we manage to allocate the dip we reset the unsubmitted range members to be equal so that cleanup happens from btrfs_endio_direct_write. This 4-step logic is not really obvious, especially given it's scattered across 3 functions. Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> [ add range start/end logic explanation from Nikolay ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01btrfs: don't force read-only after error in drop snapshotDavid Sterba
[ Upstream commit 7c09c03091ac562ddca2b393e5d65c1d37da79f1 ] Deleting a subvolume on a full filesystem leads to ENOSPC followed by a forced read-only. This is not a transaction abort and the filesystem is otherwise ok, so the error should be just propagated to the callers. This is caused by unnecessary call to btrfs_handle_fs_error for all errors, except EAGAIN. This does not make sense as the standard transaction abort mechanism is in btrfs_drop_snapshot so all relevant failures are handled. Originally in commit cb1b69f4508a ("Btrfs: forced readonly when btrfs_drop_snapshot() fails") there was no return value at all, so the btrfs_std_error made some sense but once the error handling and propagation has been implemented we don't need it anymore. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01fuse: update attr_version counter on fuse_notify_inval_inode()Miklos Szeredi
[ Upstream commit 5ddd9ced9aef6cfa76af27d384c17c9e2d610ce8 ] A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the attribute cache being updated with stale information after the invalidation. Fix this by bumping the attribute version in fuse_reverse_inval_inode(). Reported-by: Krzysztof Rusek <rusek@9livesdata.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01fuse: don't check refcount after stealing pageMiklos Szeredi
[ Upstream commit 32f98877c57bee6bc27f443a96f49678a2cd6a50 ] page_count() is unstable. Unless there has been an RCU grace period between when the page was removed from the page cache and now, a speculative reference may exist from the page cache. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01bdev: Reduce time holding bd_mutex in sync in blkdev_close()Douglas Anderson
[ Upstream commit b849dd84b6ccfe32622988b79b7b073861fcf9f7 ] While trying to "dd" to the block device for a USB stick, I encountered a hung task warning (blocked for > 120 seconds). I managed to come up with an easy way to reproduce this on my system (where /dev/sdb is the block device for my USB stick) with: while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done With my reproduction here are the relevant bits from the hung task detector: INFO: task udevd:294 blocked for more than 122 seconds. ... udevd D 0 294 1 0x00400008 Call trace: ... mutex_lock_nested+0x40/0x50 __blkdev_get+0x7c/0x3d4 blkdev_get+0x118/0x138 blkdev_open+0x94/0xa8 do_dentry_open+0x268/0x3a0 vfs_open+0x34/0x40 path_openat+0x39c/0xdf4 do_filp_open+0x90/0x10c do_sys_open+0x150/0x3c8 ... ... Showing all locks held in the system: ... 1 lock held by dd/2798: #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204 ... dd D 0 2798 2764 0x00400208 Call trace: ... schedule+0x8c/0xbc io_schedule+0x1c/0x40 wait_on_page_bit_common+0x238/0x338 __lock_page+0x5c/0x68 write_cache_pages+0x194/0x500 generic_writepages+0x64/0xa4 blkdev_writepages+0x24/0x30 do_writepages+0x48/0xa8 __filemap_fdatawrite_range+0xac/0xd8 filemap_write_and_wait+0x30/0x84 __blkdev_put+0x88/0x204 blkdev_put+0xc4/0xe4 blkdev_close+0x28/0x38 __fput+0xe0/0x238 ____fput+0x1c/0x28 task_work_run+0xb0/0xe4 do_notify_resume+0xfc0/0x14bc work_pending+0x8/0x14 The problem appears related to the fact that my USB disk is terribly slow and that I have a lot of RAM in my system to cache things. Specifically my writes seem to be happening at ~15 MB/s and I've got ~4 GB of RAM in my system that can be used for buffering. To write 4 GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds. The 267 second number is a problem because in __blkdev_put() we call sync_blockdev() while holding the bd_mutex. Any other callers who want the bd_mutex will be blocked for the whole time. The problem is made worse because I believe blkdev_put() specifically tells other tasks (namely udev) to go try to access the device at right around the same time we're going to hold the mutex for a long time. Putting some traces around this (after disabling the hung task detector), I could confirm: dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb A simple fix for this is to realize that sync_blockdev() works fine if you're not holding the mutex. Also, it's not the end of the world if you sync a little early (though it can have performance impacts). Thus we can make a guess that we're going to need to do the sync and then do it without holding the mutex. We still do one last sync with the mutex but it should be much, much faster. With this, my hung task warnings for my test case are gone. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01btrfs: fix setting last_trans for reloc rootsJosef Bacik
[ Upstream commit aec7db3b13a07d515c15ada752a7287a44a79ea0 ] I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()Trond Myklebust
[ Upstream commit 08ca8b21f760c0ed5034a5c122092eec22ccf8f4 ] When a subrequest is being detached from the subgroup, we want to ensure that it is not holding the group lock, or in the process of waiting for the group lock. Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ubifs: Fix out-of-bounds memory access caused by abnormal value of node_lenLiu Song
[ Upstream commit acc5af3efa303d5f36cc8c0f61716161f6ca1384 ] In “ubifs_check_node”, when the value of "node_len" is abnormal, the code will goto label of "out_len" for execution. Then, in the following "ubifs_dump_node", if inode type is "UBIFS_DATA_NODE", in "print_hex_dump", an out-of-bounds access may occur due to the wrong "ch->len". Therefore, when the value of "node_len" is abnormal, data length should to be adjusted to a reasonable safe range. At this time, structured data is not credible, so dump the corrupted data directly for analysis. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ubifs: ubifs_add_orphan: Fix a memory leak bugZhihao Cheng
[ Upstream commit 927cc5cec35f01fe4f8af0ba80830a90b0533983 ] Memory leak occurs when files with extended attributes are added to orphan list. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: 988bec41318f3fa897e2f8 ("ubifs: orphan: Handle xattrs like files") Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ubifs: ubifs_jnl_write_inode: Fix a memory leak bugZhihao Cheng
[ Upstream commit 81423c78551654953d746250f1721300b470be0e ] When inodes with extended attributes are evicted, xent is not freed in one exit branch. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: 9ca2d732644484488db3112 ("ubifs: Limit number of xattrs per inode") Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01xfs: prohibit fs freezing when using empty transactionsDarrick J. Wong
[ Upstream commit 27fb5a72f50aa770dd38b0478c07acacef97e3e7 ] I noticed that fsfreeze can take a very long time to freeze an XFS if there happens to be a GETFSMAP caller running in the background. I also happened to notice the following in dmesg: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs] CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs] Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54 RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000 R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8 FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0 Call Trace: xfs_fs_freeze+0x25/0x40 [xfs] freeze_super+0xc8/0x180 do_vfs_ioctl+0x70b/0x750 ? __fget_files+0x135/0x210 ksys_ioctl+0x3a/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe These two things appear to be related. The assertion trips when another thread initiates a fsmap request (which uses an empty transaction) after the freezer waited for m_active_trans to hit zero but before the the freezer executes the WARN_ON just prior to calling xfs_log_quiesce. The lengthy delays in freezing happen because the freezer calls xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the GETFSMAP caller is continuing to grab and release buffers, which means that it can take a very long time for the buffer lru list to empty out. We fix both of these races by calling sb_start_write to obtain freeze protection while using empty transactions for GETFSMAP and for metadata scrubbing. The other two users occur during mount, during which time we cannot fs freeze. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01proc: io_accounting: Use new infrastructure to fix deadlocks in execveBernd Edlinger
[ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ] This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex. This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance. This should be safe, as the credentials are only used for reading. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01proc: Use new infrastructure to fix deadlocks in execveBernd Edlinger
[ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ] This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex. This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance. This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01exec: Add exec_update_mutex to replace cred_guard_mutexEric W. Biederman
[ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ] The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01btrfs: free the reloc_control in a consistent wayJosef Bacik
[ Upstream commit 1a0afa0ecfc4dbc8d7583d03cafd3f68f781df0c ] If we have an error while processing the reloc roots we could leak roots that were added to rc->reloc_roots before we hit the error. We could have also not removed the reloc tree mapping from our rb_tree, so clean up any remaining nodes in the reloc root rb_tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use rbtree_postorder_for_each_entry_safe ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01btrfs: do not init a reloc root if we aren't relocatingJosef Bacik
[ Upstream commit 2abc726ab4b83db774e315c660ab8da21477092f ] We previously were checking if the root had a dead root before accessing root->reloc_root in order to avoid a use-after-free type bug. However this scenario happens after we've unset the reloc control, so we would have been saved if we'd simply checked for fs_info->reloc_control. At this point during relocation we no longer need to be creating new reloc roots, so simply move this check above the reloc_root checks to avoid any future races and confusion. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01nfsd: Don't add locks to closed or closing open stateidsTrond Myklebust
[ Upstream commit a451b12311aa8c96c6f6e01c783a86995dc3ec6b ] In NFSv4, the lock stateids are tied to the lockowner, and the open stateid, so that the action of closing the file also results in either an automatic loss of the locks, or an error of the form NFS4ERR_LOCKS_HELD. In practice this means we must not add new locks to the open stateid after the close process has been invoked. In fact doing so, can result in the following panic: kernel BUG at lib/list_debug.c:51! invalid opcode: 0000 [#1] SMP NOPTI CPU: 2 PID: 1085 Comm: nfsd Not tainted 5.6.0-rc3+ #2 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.14410784.B64.1908150010 08/15/2019 RIP: 0010:__list_del_entry_valid.cold+0x31/0x55 Code: 1a 3d 9b e8 74 10 c2 ff 0f 0b 48 c7 c7 f0 1a 3d 9b e8 66 10 c2 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 b0 1a 3d 9b e8 52 10 c2 ff <0f> 0b 48 89 fe 4c 89 c2 48 c7 c7 78 1a 3d 9b e8 3e 10 c2 ff 0f 0b RSP: 0018:ffffb296c1d47d90 EFLAGS: 00010246 RAX: 0000000000000054 RBX: ffff8ba032456ec8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8ba039e99cc8 RDI: ffff8ba039e99cc8 RBP: ffff8ba032456e60 R08: 0000000000000781 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ba009a4abe0 R13: ffff8ba032456e8c R14: 0000000000000000 R15: ffff8ba00adb01d8 FS: 0000000000000000(0000) GS:ffff8ba039e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb213f0b008 CR3: 00000001347de006 CR4: 00000000003606e0 Call Trace: release_lock_stateid+0x2b/0x80 [nfsd] nfsd4_free_stateid+0x1e9/0x210 [nfsd] nfsd4_proc_compound+0x414/0x700 [nfsd] ? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd] nfsd_dispatch+0xc1/0x200 [nfsd] svc_process_common+0x476/0x6f0 [sunrpc] ? svc_sock_secure_port+0x12/0x30 [sunrpc] ? svc_recv+0x313/0x9c0 [sunrpc] ? nfsd_svc+0x2d0/0x2d0 [nfsd] svc_process+0xd4/0x110 [sunrpc] nfsd+0xe3/0x140 [nfsd] kthread+0xf9/0x130 ? nfsd_destroy+0x50/0x50 [nfsd] ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x40 The fix is to ensure that lock creation tests for whether or not the open stateid is unhashed, and to fail if that is the case. Fixes: 659aefb68eca ("nfsd: Ensure we don't recognise lock stateids after freeing them") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ext4: mark block bitmap corrupted when found instead of BUGONDmitry Monakhov
[ Upstream commit eb5760863fc28feab28b567ddcda7e667e638da0 ] We already has similar code in ext4_mb_complex_scan_group(), but ext4_mb_simple_scan_group() still affected. Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01xfs: mark dir corrupt when lookup-by-hash failsDarrick J. Wong
[ Upstream commit 2e107cf869eecc770e3f630060bb4e5f547d0fd8 ] In xchk_dir_actor, we attempt to validate the directory hash structures by performing a directory entry lookup by (hashed) name. If the lookup returns ENOENT, that means that the hash information is corrupt. The _process_error functions don't catch this, so we have to add that explicitly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01xfs: don't ever return a stale pointer from __xfs_dir3_free_readDarrick J. Wong
[ Upstream commit 1cb5deb5bc095c070c09a4540c45f9c9ba24be43 ] If we decide that a directory free block is corrupt, we must take care not to leak a buffer pointer to the caller. After xfs_trans_brelse returns, the buffer can be freed or reused, which means that we have to set *bpp back to NULL. Callers are supposed to notice the nonzero return value and not use the buffer pointer, but we should code more defensively, even if all current callers handle this situation correctly. Fixes: de14c5f541e7 ("xfs: verify free block header fields") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>