summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-01-12fs: avoid softlockups in s_inodes iteratorsEric Sandeen
[ Upstream commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 ] Anything that walks all inodes on sb->s_inodes list without rescheduling risks softlockups. Previous efforts were made in 2 functions, see: c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() ac05fbb inode: don't softlockup when evicting inodes but there hasn't been an audit of all walkers, so do that now. This also consistently moves the cond_resched() calls to the bottom of each loop in cases where it already exists. One loop remains: remove_dquot_ref(), because I'm not quite sure how to deal with that one w/o taking the i_lock. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-12btrfs: Fix error messages in qgroup_rescan_initNikolay Borisov
[ Upstream commit 37d02592f11bb76e4ab1dcaa5b8a2a0715403207 ] The branch of qgroup_rescan_init which is executed from the mount path prints wrong errors messages. The textual print out in case BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not set are transposed. Fix it by exchanging their place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gapsZhihao Cheng
[ Upstream commit 6abf57262166b4f4294667fb5206ae7ba1ba96f5 ] Running stress-test test_2 in mtd-utils on ubi device, sometimes we can get following oops message: BUG: unable to handle page fault for address: ffffffff00000140 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 280a067 P4D 280a067 PUD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 60 Comm: kworker/u16:1 Kdump: loaded Not tainted 5.2.0 #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0 -0-ga698c8995f-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-ubifs_0_0) RIP: 0010:rb_next_postorder+0x2e/0xb0 Code: 80 db 03 01 48 85 ff 0f 84 97 00 00 00 48 8b 17 48 83 05 bc 80 db 03 01 48 83 e2 fc 0f 84 82 00 00 00 48 83 05 b2 80 db 03 01 <48> 3b 7a 10 48 89 d0 74 02 f3 c3 48 8b 52 08 48 83 05 a3 80 db 03 RSP: 0018:ffffc90000887758 EFLAGS: 00010202 RAX: ffff888129ae4700 RBX: ffff888138b08400 RCX: 0000000080800001 RDX: ffffffff00000130 RSI: 0000000080800024 RDI: ffff888138b08400 RBP: ffff888138b08400 R08: ffffea0004a6b920 R09: 0000000000000000 R10: ffffc90000887740 R11: 0000000000000001 R12: ffff888128d48000 R13: 0000000000000800 R14: 000000000000011e R15: 00000000000007c8 FS: 0000000000000000(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff00000140 CR3: 000000013789d000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: destroy_old_idx+0x5d/0xa0 [ubifs] ubifs_tnc_start_commit+0x4fe/0x1380 [ubifs] do_commit+0x3eb/0x830 [ubifs] ubifs_run_commit+0xdc/0x1c0 [ubifs] Above Oops are due to the slab-out-of-bounds happened in do-while of function layout_in_gaps indirectly called by ubifs_tnc_start_commit. In function layout_in_gaps, there is a do-while loop placing index nodes into the gaps created by obsolete index nodes in non-empty index LEBs until rest index nodes can totally be placed into pre-allocated empty LEBs. @c->gap_lebs points to a memory area(integer array) which records LEB numbers used by 'in-the-gaps' method. Whenever a fitable index LEB is found, corresponding lnum will be incrementally written into the memory area pointed by @c->gap_lebs. The size ((@c->lst.idx_lebs + 1) * sizeof(int)) of memory area is allocated before do-while loop and can not be changed in the loop. But @c->lst.idx_lebs could be increased by function ubifs_change_lp (called by layout_leb_in_gaps->ubifs_find_dirty_idx_leb->get_idx_gc_leb) during the loop. So, sometimes oob happens when number of cycles in do-while loop exceeds the original value of @c->lst.idx_lebs. See detail in https://bugzilla.kernel.org/show_bug.cgi?id=204229. This patch fixes oob in layout_in_gaps. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09xfs: periodically yield scrub threads to the schedulerDarrick J. Wong
[ Upstream commit 5d1116d4c6af3e580f1ed0382ca5a94bd65a34cf ] Christoph Hellwig complained about the following soft lockup warning when running scrub after generic/175 when preemption is disabled and slub debugging is enabled: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161] Modules linked in: irq event stamp: 41692326 hardirqs last enabled at (41692325): [<ffffffff8232c3b7>] _raw_0 hardirqs last disabled at (41692326): [<ffffffff81001c5a>] trace0 softirqs last enabled at (41684994): [<ffffffff8260031f>] __do_e softirqs last disabled at (41684987): [<ffffffff81127d8c>] irq_e0 CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124 RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40 Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2 RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3 RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006 RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040 RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00 R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040 FS: 00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0 Call Trace: free_debug_processing+0x1dd/0x240 __slab_free+0x231/0x410 kmem_cache_free+0x30e/0x360 xchk_ag_btcur_free+0x76/0xb0 xchk_ag_free+0x10/0x80 xchk_bmap_iextent_xref.isra.14+0xd9/0x120 xchk_bmap_iextent+0x187/0x210 xchk_bmap+0x2e0/0x3b0 xfs_scrub_metadata+0x2e7/0x500 xfs_ioc_scrub_metadata+0x4a/0xa0 xfs_file_ioctl+0x58a/0xcd0 do_vfs_ioctl+0xa0/0x6f0 ksys_ioctl+0x5b/0x90 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x4b/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe If preemption is disabled, all metadata buffers needed to perform the scrub are already in memory, and there are a lot of records to check, it's possible that the scrub thread will run for an extended period of time without sleeping for IO or any other reason. Then the watchdog timer or the RCU stall timeout can trigger, producing the backtrace above. To fix this problem, call cond_resched() from the scrub thread so that we back out to the scheduler whenever necessary. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09bdev: Refresh bdev size for disks without partitioningJan Kara
commit cba22d86e0a10b7070d2e6a7379dbea51aa0883c upstream. Currently, block device size in not updated on second and further open for block devices where partition scan is disabled. This is particularly annoying for example for DVD drives as that means block device size does not get updated once the media is inserted into a drive if the device is already open when inserting the media. This is actually always the case for example when pktcdvd is in use. Fix the problem by revalidating block device size on every open even for devices with partition scan disabled. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09bdev: Factor out bdev revalidation into a common helperJan Kara
commit 731dc4868311ee097757b8746eaa1b4f8b2b4f1c upstream. Factor out code handling revalidation of bdev on disk change into a common helper. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09fix compat handling of FICLONERANGE, FIDEDUPERANGE and FS_IOC_FIEMAPAl Viro
commit 6b2daec19094a90435abe67d16fb43b1a5527254 upstream. Unlike FICLONE, all of those take a pointer argument; they do need compat_ptr() applied to arg. Fixes: d79bdd52d8be ("vfs: wire up compat ioctl for CLONE/CLONE_RANGE") Fixes: 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs") Fixes: ceac204e1da9 ("fs: make fiemap work from compat_ioctl") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09xfs: don't check for AG deadlock for realtime files in bunmapiOmar Sandoval
commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream. Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added a check in __xfs_bunmapi() to stop early if we would touch multiple AGs in the wrong order. However, this check isn't applicable for realtime files. In most cases, it just makes us do unnecessary commits. However, without the fix from the previous commit ("xfs: fix realtime file data space leak"), if the last and second-to-last extents also happen to have different "AG numbers", then the break actually causes __xfs_bunmapi() to return without making any progress, which sends xfs_itruncate_extents_flags() into an infinite loop. Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09nfsd4: fix up replay_matches_cache()Scott Mayhew
commit 6e73e92b155c868ff7fce9d108839668caf1d9be upstream. When running an nfs stress test, I see quite a few cached replies that don't match up with the actual request. The first comment in replay_matches_cache() makes sense, but the code doesn't seem to match... fix it. This isn't exactly a bugfix, as the server isn't required to catch every case of a false retry. So, we may as well do this, but if this is fixing a problem then that suggests there's a client bug. Fixes: 53da6a53e1d4 ("nfsd4: catch some false session retries") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09locks: print unsigned ino in /proc/locksAmir Goldstein
commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream. An ino is unsigned, so display it as such in /proc/locks. Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09pstore/ram: Write new dumps to start of recycled zonesAleksandr Yashkin
commit 9e5f1c19800b808a37fb9815a26d382132c26c3d upstream. The ram_core.c routines treat przs as circular buffers. When writing a new crash dump, the old buffer needs to be cleared so that the new dump doesn't end up in the wrong place (i.e. at the end). The solution to this problem is to reset the circular buffer state before writing a new Oops dump. Signed-off-by: Aleksandr Yashkin <a.yashkin@inango-systems.com> Signed-off-by: Nikolay Merinov <n.merinov@inango-systems.com> Signed-off-by: Ariel Gilman <a.gilman@inango-systems.com> Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com Fixes: 896fc1f0c4c6 ("pstore/ram: Switch to persistent_ram routines") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09xfs: fix mount failure crash on invalid iclog memory accessBrian Foster
[ Upstream commit 798a9cada4694ca8d970259f216cec47e675bfd5 ] syzbot (via KASAN) reports a use-after-free in the error path of xlog_alloc_log(). Specifically, the iclog freeing loop doesn't handle the case of a fully initialized ->l_iclog linked list. Instead, it assumes that the list is partially constructed and NULL terminated. This bug manifested because there was no possible error scenario after iclog list setup when the original code was added. Subsequent code and associated error conditions were added some time later, while the original error handling code was never updated. Fix up the error loop to terminate either on a NULL iclog or reaching the end of the list. Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPPDavid Howells
[ Upstream commit 1da4bd9f9d187f53618890d7b66b9628bbec3c70 ] Fix the lookup method on the dynamic root directory such that creation calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP rather than failing with some odd error (such as EEXIST). lookup() itself tries to create automount directories when it is invoked. These are cached locally in RAM and not committed to storage. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09afs: Fix SELinux setting security label on /afsDavid Howells
[ Upstream commit bcbccaf2edcf1b76f73f890e968babef446151a4 ] Make the AFS dynamic root superblock R/W so that SELinux can set the security label on it. Without this, upgrades to, say, the Fedora filesystem-afs RPM fail if afs is mounted on it because the SELinux label can't be (re-)applied. It might be better to make it possible to bypass the R/O check for LSM label application through setxattr. Fixes: 4d673da14533 ("afs: Support the AFS dynamic root") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: selinux@vger.kernel.org cc: linux-security-module@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09afs: Fix afs_find_server lookups for ipv4 peersMarc Dionne
[ Upstream commit 9bd0160d12370a076e44f8d1320cde9c83f2c647 ] afs_find_server tries to find a server that has an address that matches the transport address of an rxrpc peer. The code assumes that the transport address is always ipv6, with ipv4 represented as ipv4 mapped addresses, but that's not the case. If the transport family is AF_INET, srx->transport.sin6.sin6_addr.s6_addr32[] will be beyond the actual ipv4 address and will always be 0, and all ipv4 addresses will be seen as matching. As a result, the first ipv4 address seen on any server will be considered a match, and the server returned may be the wrong one. One of the consequences is that callbacks received over ipv4 will only be correctly applied for the server that happens to have the first ipv4 address on the fs_addresses4 list. Callbacks over ipv4 from all other servers are dropped, causing the client to serve stale data. This is fixed by looking at the transport family, and comparing ipv4 addresses based on a sockaddr_in structure rather than a sockaddr_in6. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04filldir[64]: remove WARN_ON_ONCE() for bad directory entriesLinus Torvalds
commit b9959c7a347d6adbb558fba7e36e9fef3cba3b07 upstream. This was always meant to be a temporary thing, just for testing and to see if it actually ever triggered. The only thing that reported it was syzbot doing disk image fuzzing, and then that warning is expected. So let's just remove it before -rc4, because the extra sanity testing should probably go to -stable, but we don't want the warning to do so. Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com Fixes: 8a23eb804ca4 ("Make filldir[64]() verify the directory entry filename is valid") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Siddharth Chandrasekaran <csiddharth@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04Make filldir[64]() verify the directory entry filename is validLinus Torvalds
commit 8a23eb804ca4f2be909e372cf5a9e7b30ae476cd upstream. This has been discussed several times, and now filesystem people are talking about doing it individually at the filesystem layer, so head that off at the pass and just do it in getdents{64}(). This is partially based on a patch by Jann Horn, but checks for NUL bytes as well, and somewhat simplified. There's also commentary about how it might be better if invalid names due to filesystem corruption don't cause an immediate failure, but only an error at the end of the readdir(), so that people can still see the filenames that are ok. There's also been discussion about just how much POSIX strictly speaking requires this since it's about filesystem corruption. It's really more "protect user space from bad behavior" as pointed out by Jann. But since Eric Biederman looked up the POSIX wording, here it is for context: "From readdir: The readdir() function shall return a pointer to a structure representing the directory entry at the current position in the directory stream specified by the argument dirp, and position the directory stream at the next entry. It shall return a null pointer upon reaching the end of the directory stream. The structure dirent defined in the <dirent.h> header describes a directory entry. From definitions: 3.129 Directory Entry (or Link) An object that associates a filename with a file. Several directory entries can associate names with the same file. ... 3.169 Filename A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the slash character and the null byte. The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a 'pathname component'." Note that I didn't bother adding the checks to any legacy interfaces that nobody uses. Also note that if this ends up being noticeable as a performance regression, we can fix that to do a much more optimized model that checks for both NUL and '/' at the same time one word at a time. We haven't really tended to optimize 'memchr()', and it only checks for one pattern at a time anyway, and we really _should_ check for NUL too (but see the comment about "soft errors" in the code about why it currently only checks for '/') See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name lookup code looks for pathname terminating characters in parallel. Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/ Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Siddharth Chandrasekaran <csiddharth@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORKMike Rapoport
[ Upstream commit 3c1c24d91ffd536de0a64688a9df7f49e58fadbc ] A while ago Andy noticed (http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com) that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have security implications. As the first step of the solution the following patch limits the availably of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE. The usage of CAP_SYS_PTRACE ensures compatibility with CRIU. Yet, if there are other users of non-cooperative userfaultfd that run without CAP_SYS_PTRACE, they would be broken :( Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file descriptor table from the read() implementation of uffd, which may have security implications for unprivileged use of the userfaultfd. Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have CAP_SYS_PTRACE. Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Nosh Minwalla <nosh@google.com> Cc: Pavel Emelyanov <ovzxemul@gmail.com> Cc: Tim Murray <timmurray@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ocfs2: fix passing zero to 'PTR_ERR' warningDing Xiang
[ Upstream commit 188c523e1c271d537f3c9f55b6b65bf4476de32f ] Fix a static code checker warning: fs/ocfs2/acl.c:331 ocfs2_acl_chmod() warn: passing zero to 'PTR_ERR' Link: http://lkml.kernel.org/r/1dee278b-6c96-eec2-ce76-fe6e07c6e20f@linux.alibaba.com Fixes: 5ee0fbd50fd ("ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned longKonstantin Khlebnikov
[ Upstream commit 6fcbcec9cfc7b3c6a2c1f1a23ebacedff7073e0a ] Quota statistics counted as 64-bit per-cpu counter. Reading sums per-cpu fractions as signed 64-bit int, filters negative values and then reports lower half as signed 32-bit int. Result may looks like: fs.quota.allocated_dquots = 22327 fs.quota.cache_hits = -489852115 fs.quota.drops = -487288718 fs.quota.free_dquots = 22083 fs.quota.lookups = -486883485 fs.quota.reads = 22327 fs.quota.syncs = 335064 fs.quota.writes = 3088689 Values bigger than 2^31-1 reported as negative. All counters except "allocated_dquots" and "free_dquots" are monotonic, thus they should be reported as is without filtering negative values. Kernel doesn't have generic helper for 64-bit sysctl yet, let's use at least unsigned long. Link: https://lore.kernel.org/r/157337934693.2078.9842146413181153727.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04f2fs: fix to update dir's i_pino during cross_renameChao Yu
[ Upstream commit 2a60637f06ac94869b2e630eaf837110d39bf291 ] As Eric reported: RENAME_EXCHANGE support was just added to fsstress in xfstests: commit 65dfd40a97b6bbbd2a22538977bab355c5bc0f06 Author: kaixuxia <xiakaixu1987@gmail.com> Date: Thu Oct 31 14:41:48 2019 +0800 fsstress: add EXCHANGE renameat2 support This is causing xfstest generic/579 to fail due to fsck.f2fs reporting errors. I'm not sure what the problem is, but it still happens even with all the fs-verity stuff in the test commented out, so that the test just runs fsstress. generic/579 23s ... [10:02:25] [ 7.745370] run fstests generic/579 at 2019-11-04 10:02:25 _check_generic_filesystem: filesystem on /dev/vdc is inconsistent (see /results/f2fs/results-default/generic/579.full for details) [10:02:47] Ran: generic/579 Failures: generic/579 Failed 1 of 1 tests Xunit report: /results/f2fs/results-default/result.xml Here's the contents of 579.full: _check_generic_filesystem: filesystem on /dev/vdc is inconsistent *** fsck.f2fs output *** [ASSERT] (__chk_dots_dentries:1378) --> Bad inode number[0x24] for '..', parent parent ino is [0xd10] The root cause is that we forgot to update directory's i_pino during cross_rename, fix it. Fixes: 32f9bc25cbda0 ("f2fs: support ->rename2()") Signed-off-by: Chao Yu <yuchao0@huawei.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04jbd2: Fix statistics for the number of logged blocksJan Kara
[ Upstream commit 015c6033068208d6227612c878877919f3fcf6b6 ] jbd2 statistics counting number of blocks logged in a transaction was wrong. It didn't count the commit block and more importantly it didn't count revoke descriptor blocks. Make sure these get properly counted. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ext4: iomap that extends beyond EOF should be marked dirtyMatthew Bobrowski
[ Upstream commit 2e9b51d78229d5145725a481bb5464ebc0a3f9b2 ] This patch addresses what Dave Chinner had discovered and fixed within commit: 7684e2c4384d. This changes does not have any user visible impact for ext4 as none of the current users of ext4_iomap_begin() that extend files depend on IOMAP_F_DIRTY. When doing a direct IO that spans the current EOF, and there are written blocks beyond EOF that extend beyond the current write, the only metadata update that needs to be done is a file size extension. However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that there is IO completion metadata updates required, and hence we may fail to correctly sync file size extensions made in IO completion when O_DSYNC writes are being used and the hardware supports FUA. Hence when setting IOMAP_F_DIRTY, we need to also take into account whether the iomap spans the current EOF. If it does, then we need to mark it dirty so that IO completion will call generic_write_sync() to flush the inode size update to stable storage correctly. Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8b43ee9ee94bee5328da56ba0909b7d2229ef150.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ext4: update direct I/O read lock pattern for IOCB_NOWAITMatthew Bobrowski
[ Upstream commit 548feebec7e93e58b647dba70b3303dcb569c914 ] This patch updates the lock pattern in ext4_direct_IO_read() to not block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The locking condition implemented here is similar to that of 942491c9e6d6 ("xfs: fix AIM7 regression"). Fixes: 16c54688592c ("ext4: Allow parallel DIO reads") Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04f2fs: fix to update time in lazytime modeChao Yu
[ Upstream commit fe1897eaa6646f5a64a4cee0e6473ed9887d324b ] generic/018 reports an inconsistent status of atime, the testcase is as below: - open file with O_SYNC - write file to construct fraged space - calc md5 of file - record {a,c,m}time - defrag file --- do nothing - umount & mount - check {a,c,m}time The root cause is, as f2fs enables lazytime by default, atime update will dirty vfs inode, rather than dirtying f2fs inode (by set with FI_DIRTY_INODE), so later f2fs_write_inode() called from VFS will fail to update inode page due to our skip: f2fs_write_inode() if (is_inode_flag_set(inode, FI_DIRTY_INODE)) return 0; So eventually, after evict(), we lose last atime for ever. To fix this issue, we need to check whether {a,c,m,cr}time is consistent in between inode cache and inode page, and only skip f2fs_update_inode() if f2fs inode is not dirty and time is consistent as well. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31ext4: unlock on error in ext4_expand_extra_isize()Dan Carpenter
commit 7f420d64a08c1dcd65b27be82a27cf2bdb2e7847 upstream. We need to unlock the xattr before returning on this error path. Cc: stable@kernel.org # 4.13 Fixes: c03b45b853f5 ("ext4, project: expand inode extra size if possible") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: check for directory entries too close to block endJan Kara
commit 109ba779d6cca2d519c5dd624a3276d03e21948e upstream. ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: fix ext4_empty_dir() for directories with holesJan Kara
commit 64d4ce892383b2ad6d782e080d25502f91bf2a38 upstream. Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: return error pointer from alloc_test_extent_bufferDan Carpenter
[ Upstream commit b6293c821ea8fa2a631a2112cd86cd435effeb8b ] Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in scrub_missing_raid56_worker()Omar Sandoval
[ Upstream commit 57d4f0b863272ba04ba85f86bfdc0f976f0af91c ] Currently, scrub_missing_raid56_worker() puts and potentially frees sblock (which embeds the work item) and then submits a bio through scrub_wr_submit(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". Fix it by dropping the reference after we submit the bio. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in reada_start_machine_worker()Omar Sandoval
[ Upstream commit e732fe95e4cad35fc1df278c23a32903341b08b3 ] Currently, reada_start_machine_worker() frees the reada_machine_work and then calls __reada_start_machine() to do readahead. This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". There _might_ already be a deadlock here: reada_start_machine_worker() can depend on itself through stacked filesystems (__read_start_machine() -> reada_start_machine_dev() -> reada_tree_block_flagged() -> read_extent_buffer_pages() -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() -> submit_bio() onto a loop device can trigger readahead on the lower filesystem). Either way, let's fix it by freeing the work at the end. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in run_ordered_work()Omar Sandoval
[ Upstream commit c495dcd6fbe1dce51811a76bb85b4675f6494938 ] We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in end_workqueue_fn()Omar Sandoval
[ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ] Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31Btrfs: fix removal logic of the tree mod log that leads to use-after-free issuesFilipe Manana
commit 6609fee8897ac475378388238456c84298bff802 upstream. When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: handle ENOENT in btrfs_uuid_tree_iterateJosef Bacik
commit 714cd3e8cba6841220dce9063a7388a81de03825 upstream. If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: do not leak reloc root if we fail to read the fs rootJosef Bacik
commit ca1aa2818a53875cfdd175fb5e9a2984e997cce9 upstream. If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: skip log replay on orphaned rootsJosef Bacik
commit 9bc574de590510eff899c3ca8dbaf013566b5efe upstream. My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: abort transaction after failed inode updates in create_subvolJosef Bacik
commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 upstream. We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: send: remove WARN_ON for readonly mountAnand Jain
commit fbd542971aa1e9ec33212afe1d9b4f1106cd85a1 upstream. We log warning if root::orphan_cleanup_state is not set to ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is mounted as readonly we skip the orphan item cleanup during the lookup and root::orphan_cleanup_state remains at the init state 0 instead of ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the warning as below. WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE); WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: Call Trace: :: _btrfs_ioctl_send+0x7b/0x110 [btrfs] btrfs_ioctl+0x150a/0x2b00 [btrfs] :: do_vfs_ioctl+0xa9/0x620 ? __fget+0xac/0xe0 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproducer: mkfs.btrfs -fq /dev/sdb mount /dev/sdb /btrfs btrfs subvolume create /btrfs/sv1 btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1 umount /btrfs mount -o ro /dev/sdb /btrfs btrfs send /btrfs/ss1 -f /tmp/f The warning exists because having orphan inodes could confuse send and cause it to fail or produce incorrect streams. The two cases that would cause such send failures, which are already fixed are: 1) Inodes that were unlinked - these are orphanized and remain with a link count of 0. These caused send operations to fail because it expected to always find at least one path for an inode. However this is no longer a problem since send is now able to deal with such inodes since commit 46b2f4590aab ("Btrfs: fix send failure when root has deleted files still open") and treats them as having been completely removed (the state after an orphan cleanup is performed). 2) Inodes that were in the process of being truncated. These resulted in send not knowing about the truncation and potentially issue write operations full of zeroes for the range from the new file size to the old file size. This is no longer a problem because we no longer create orphan items for truncation since commit f7e9e8fc792f ("Btrfs: stop creating orphan items for truncate"). As such before these commits, the WARN_ON here provided a clue in case something went wrong. Instead of being a warning against the root::orphan_cleanup_state value, it could have been more accurate by checking if there were actually any orphan items, and then issue a warning only if any exists, but that would be more expensive to check. Since orphanized inodes no longer cause problems for send, just remove the warning. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/ CC: stable@vger.kernel.org # 4.19+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31Btrfs: fix missing data checksums after replaying a log treeFilipe Manana
commit 40e046acbd2f369cfbf93c3413639c66514cec2d upstream. When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg | tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik
commit f72ff01df9cf5db25c76674cac16605992d15467 upstream. Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: don't double lock the subvol_sem for rename exchangeJosef Bacik
commit 943eb3bf25f4a7b745dd799e031be276aa104d82 upstream. If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21CIFS: Close open handle after interrupted closePavel Shilovsky
commit 9150c3adbf24d77cfba37f03639d4a908ca4ac25 upstream. If Close command is interrupted before sending a request to the server the client ends up leaking an open file handle. This wastes server resources and can potentially block applications that try to remove the file or any directory containing this file. Fix this by putting the close command into a worker queue, so another thread retries it later. Cc: Stable <stable@vger.kernel.org> Tested-by: Frank Sorenson <sorenson@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21CIFS: Respect O_SYNC and O_DIRECT flags during reconnectPavel Shilovsky
commit 44805b0e62f15e90d233485420e1847133716bdc upstream. Currently the client translates O_SYNC and O_DIRECT flags into corresponding SMB create options when openning a file. The problem is that on reconnect when the file is being re-opened the client doesn't set those flags and it causes a server to reject re-open requests because create options don't match. The latter means that any subsequent system call against that open file fail until a share is re-mounted. Fix this by properly setting SMB create options when re-openning files after reconnects. Fixes: 1013e760d10e6: ("SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags") Cc: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21cifs: Don't display RDMA transport on reconnectLong Li
commit 14cc639c17ab0b6671526a7459087352507609e4 upstream. On reconnect, the transport data structure is NULL and its information is not available. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21cifs: smbd: Return -EINVAL when the number of iovs exceeds SMBDIRECT_MAX_SGELong Li
commit 37941ea17d3f8eb2f5ac2f59346fab9e8439271a upstream. While it's not friendly to fail user processes that issue more iovs than we support, at least we should return the correct error code so the user process gets a chance to retry with smaller number of iovs. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21cifs: smbd: Add messages on RDMA session destroy and reconnectionLong Li
commit d63cdbae60ac6fbb2864bd3d8df7404f12b7407d upstream. Log these activities to help production support. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21cifs: smbd: Return -EAGAIN when transport is reconnectingLong Li
commit 4357d45f50e58672e1d17648d792f27df01dfccd upstream. During reconnecting, the transport may have already been destroyed and is in the process being reconnected. In this case, return -EAGAIN to not fail and to retry this I/O. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17cifs: Fix potential softlockups while refreshing DFS cachePaulo Alcantara (SUSE)
[ Upstream commit 84a1f5b1cc6fd7f6cd99fc5630c36f631b19fa60 ] We used to skip reconnects on all SMB2_IOCTL commands due to SMB3+ FSCTL_VALIDATE_NEGOTIATE_INFO - which made sense since we're still establishing a SMB session. However, when refresh_cache_worker() calls smb2_get_dfs_refer() and we're under reconnect, SMB2_ioctl() will not be able to get a proper status error (e.g. -EHOSTDOWN in case we failed to reconnect) but an -EAGAIN from cifs_send_recv() thus looping forever in refresh_cache_worker(). Fixes: e99c63e4d86d ("SMB3: Fix deadlock in validate negotiate hits reconnect") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Suggested-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17gfs2: fix glock reference problem in gfs2_trans_remove_revokeBob Peterson
[ Upstream commit fe5e7ba11fcf1d75af8173836309e8562aefedef ] Commit 9287c6452d2b fixed a situation in which gfs2 could use a glock after it had been freed. To do that, it temporarily added a new glock reference by calling gfs2_glock_hold in function gfs2_add_revoke. However, if the bd element was removed by gfs2_trans_remove_revoke, it failed to drop the additional reference. This patch adds logic to gfs2_trans_remove_revoke to properly drop the additional glock reference. Fixes: 9287c6452d2b ("gfs2: Fix occasional glock use-after-free") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>