summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2019-08-29io_uring: add need_resched() check in inner poll loopJens Axboe
[ Upstream commit 08f5439f1df25a6cf6cf4c72cf6c13025599ce67 ] The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29io_uring: don't enter poll loop if we have CQEs pendingJens Axboe
[ Upstream commit a3a0e43fd77013819e4b6f55e37e0efe8e35d805 ] We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29io_uring: fix potential hang with polled IOJens Axboe
[ Upstream commit 500f9fbadef86466a435726192f4ca4df7d94236 ] If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in. Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress. Reported-by: Jeffrey M. Birnbaum <jmbnyc@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOTDarrick J. Wong
commit 1fb254aa983bf190cfd685d40c64a480a9bafaee upstream. Benjamin Moody reported to Debian that XFS partially wedges when a chgrp fails on account of being out of disk quota. I ran his reproducer script: # adduser dummy # adduser dummy plugdev # dd if=/dev/zero bs=1M count=100 of=test.img # mkfs.xfs test.img # mount -t xfs -o gquota test.img /mnt # mkdir -p /mnt/dummy # chown -c dummy /mnt/dummy # xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt (and then as user dummy) $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo $ chgrp plugdev /mnt/dummy/foo and saw: ================================================ WARNING: lock held when returning to user space! 5.3.0-rc5 #rc5 Tainted: G W ------------------------------------------------ chgrp/47006 is leaving the kernel with locks still held! 1 lock held by chgrp/47006: #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs] ...which is clearly caused by xfs_setattr_nonsize failing to unlock the ILOCK after the xfs_qm_vop_chown_reserve call fails. Add the missing unlock. Reported-by: benjamin.moody@gmail.com Fixes: 253f4911f297 ("xfs: better xfs_trans_alloc interface") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctxOleg Nesterov
commit 46d0b24c5ee10a15dfb25e20642f5a5ed59c5003 upstream. userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if mm->core_state != NULL. Otherwise a page fault can see userfaultfd_missing() == T and use an already freed userfaultfd_ctx. Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29ceph: don't try fill file_lock on unsuccessful GETFILELOCK replyJeff Layton
commit 28a282616f56990547b9dcd5c6fbd2001344664c upstream. When ceph_mdsc_do_request returns an error, we can't assume that the filelock_reply pointer will be set. Only try to fetch fields out of the r_reply_info when it returns success. Cc: stable@vger.kernel.org Reported-by: Hector Martin <hector@marcansoft.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29ceph: clear page dirty before invalidate pageErqi Chen
commit c95f1c5f436badb9bb87e9b30fd573f6b3d59423 upstream. clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage(). invalidatepage() clears page's private flag, if dirty flag is not cleared, the page may cause BUG_ON failure in ceph_set_page_dirty(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/40862 Signed-off-by: Erqi Chen <chenerqi@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29NFSv4: Ensure state recovery handles ETIMEDOUT correctlyTrond Myklebust
[ Upstream commit 67e7b52d44e3d539dfbfcd866c3d3d69da23a909 ] Ensure that the state recovery code handles ETIMEDOUT correctly, and also that we set RPC_TASK_TIMEOUT when recovering open state. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29SMB3: Kernel oops mounting a encryptData share with CONFIG_DEBUG_VIRTUALSebastien Tisserant
[ Upstream commit ee9d66182392695535cc9fccfcb40c16f72de2a9 ] Fix kernel oops when mounting a encryptData CIFS share with CONFIG_DEBUG_VIRTUAL Signed-off-by: Sebastien Tisserant <stisserant@wallix.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29SMB3: Fix potential memory leak when processing compound chainPavel Shilovsky
[ Upstream commit 3edeb4a4146dc3b54d6fa71b7ee0585cb52ebfdf ] When a reconnect happens in the middle of processing a compound chain the code leaks a buffer from the memory pool. Fix this by properly checking for a return code and freeing buffers in case of error. Also maintain a buf variable to be equal to either smallbuf or bigbuf depending on a response buffer size while parsing a chain and when returning to the caller. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mountsTrond Myklebust
[ Upstream commit dea1bb35c5f35e0577cfc61f79261d80b8715221 ] People are reporing seeing fscache errors being reported concerning duplicate cookies even in cases where they are not setting up fscache at all. The rule needs to be that if fscache is not enabled, then it should have no side effects at all. To ensure this is the case, we disable fscache completely on all superblocks for which the 'fsc' mount option was not set. In order to avoid issues with '-oremount', we also disable the ability to turn fscache on via remount. Fixes: f1fe29b4a02d ("NFS: Use i_writecount to control whether...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145 Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Steve Dickson <steved@redhat.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()Trond Myklebust
[ Upstream commit c77e22834ae9a11891cb613bd9a551be1b94f2bc ] John Hubbard reports seeing the following stack trace: nfs4_do_reclaim rcu_read_lock /* we are now in_atomic() and must not sleep */ nfs4_purge_state_owners nfs4_free_state_owner nfs4_destroy_seqid_counter rpc_destroy_wait_queue cancel_delayed_work_sync __cancel_work_timer __flush_work start_flush_work might_sleep: (kernel/workqueue.c:2975: BUG) The solution is to separate out the freeing of the state owners from nfs4_purge_state_owners(), and perform that outside the atomic context. Reported-by: John Hubbard <jhubbard@nvidia.com> Fixes: 0aaaf5c424c7f ("NFS: Cache state owners after files are closed") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFSv4.1: Only reap expired delegationsTrond Myklebust
[ Upstream commit ad11408970df79d5f481aa9964e91f183133424c ] Fix nfs_reap_expired_delegations() to ensure that we only reap delegations that are actually expired, rather than triggering on random errors. Fixes: 45870d6909d5a ("NFSv4.1: Test delegation stateids when server...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFSv4.1: Fix open stateid recoveryTrond Myklebust
[ Upstream commit 27a30cf64a5cbe2105e4ff9613246b32d584766a ] The logic for checking in nfs41_check_open_stateid() whether the state is supported by a delegation is inverted. In addition, it makes more sense to perform that check before we check for expired locks. Fixes: 8a64c4ef106d1 ("NFSv4.1: Even if the stateid is OK,...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFSv4: When recovering state fails with EAGAIN, retry the same recoveryTrond Myklebust
[ Upstream commit c34fae003c79570b6c930b425fea3f0b7b1e7056 ] If the server returns with EAGAIN when we're trying to recover from a server reboot, we currently delay for 1 second, but then mark the stateid as needing recovery after the grace period has expired. Instead, we should just retry the same recovery process immediately after the 1 second delay. Break out of the loop after 10 retries. Fixes: 35a61606a612 ("NFS: Reduce indentation of the switch statement...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateidTrond Myklebust
[ Upstream commit 8c39a39e28b86a4021d9be314ce01019bafa5fdc ] It is unsafe to dereference delegation outside the rcu lock, and in any case, the refcount is guaranteed held if cred is non-zero. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25ocfs2: remove set but not used variable 'last_hash'YueHaibing
[ Upstream commit 7bc36e3ce91471b6377c8eadc0a2f220a2280083 ] Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find: fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable] It's never used and can be removed. Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25Btrfs: fix deadlock between fiemap and transaction commitsFilipe Manana
[ Upstream commit a6d155d2e363f26290ffd50591169cb96c2a609e ] The fiemap handler locks a file range that can have unflushed delalloc, and after locking the range, it tries to attach to a running transaction. If the running transaction started its commit, that is, it is in state TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the flushoncommit option or the transaction is creating a snapshot for the subvolume that contains the file that fiemap is operating on, we end up deadlocking. This happens because fiemap is blocked on the transaction, waiting for it to complete, and the transaction is waiting for the flushed dealloc to complete, which requires locking the file range that the fiemap task already locked. The following stack traces serve as an example of when this deadlock happens: (...) [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [404571.515956] Call Trace: [404571.516360] ? __schedule+0x3ae/0x7b0 [404571.516730] schedule+0x3a/0xb0 [404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs] [404571.517465] ? remove_wait_queue+0x60/0x60 [404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs] [404571.518202] normal_work_helper+0xea/0x530 [btrfs] [404571.518566] process_one_work+0x21e/0x5c0 [404571.518990] worker_thread+0x4f/0x3b0 [404571.519413] ? process_one_work+0x5c0/0x5c0 [404571.519829] kthread+0x103/0x140 [404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70 [404571.520565] ret_from_fork+0x3a/0x50 [404571.520915] kworker/u8:6 D 0 31651 2 0x80004000 [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs] (...) [404571.537000] fsstress D 0 13117 13115 0x00004000 [404571.537263] Call Trace: [404571.537524] ? __schedule+0x3ae/0x7b0 [404571.537788] schedule+0x3a/0xb0 [404571.538066] wait_current_trans+0xc8/0x100 [btrfs] [404571.538349] ? remove_wait_queue+0x60/0x60 [404571.538680] start_transaction+0x33c/0x500 [btrfs] [404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs] [404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs] [404571.539866] extent_fiemap+0x2ce/0x650 [btrfs] [404571.540170] do_vfs_ioctl+0x526/0x6f0 [404571.540436] ksys_ioctl+0x70/0x80 [404571.540734] __x64_sys_ioctl+0x16/0x20 [404571.540997] do_syscall_64+0x60/0x1d0 [404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [404571.543729] btrfs D 0 14210 14208 0x00004000 [404571.544023] Call Trace: [404571.544275] ? __schedule+0x3ae/0x7b0 [404571.544526] ? wait_for_completion+0x112/0x1a0 [404571.544795] schedule+0x3a/0xb0 [404571.545064] schedule_timeout+0x1ff/0x390 [404571.545351] ? lock_acquire+0xa6/0x190 [404571.545638] ? wait_for_completion+0x49/0x1a0 [404571.545890] ? wait_for_completion+0x112/0x1a0 [404571.546228] wait_for_completion+0x131/0x1a0 [404571.546503] ? wake_up_q+0x70/0x70 [404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs] [404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs] [404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs] [404571.547703] ? remove_wait_queue+0x60/0x60 [404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs] [404571.548226] ? __sb_start_write+0xd4/0x1c0 [404571.548512] ? mnt_want_write_file+0x24/0x50 [404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs] [404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs] [404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs] [404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0 [404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0 [404571.550064] ? __handle_mm_fault+0xe3e/0x11f0 [404571.550306] ? do_raw_spin_unlock+0x49/0xc0 [404571.550608] ? _raw_spin_unlock+0x24/0x30 [404571.550976] ? __handle_mm_fault+0xedf/0x11f0 [404571.551319] ? do_vfs_ioctl+0xa2/0x6f0 [404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [404571.552087] do_vfs_ioctl+0xa2/0x6f0 [404571.552355] ksys_ioctl+0x70/0x80 [404571.552621] __x64_sys_ioctl+0x16/0x20 [404571.552864] do_syscall_64+0x60/0x1d0 [404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) If we were joining the transaction instead of attaching to it, we would not risk a deadlock because a join only blocks if the transaction is in a state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc flush performed by a transaction is done before it reaches that state, when it is in the state TRANS_STATE_COMMIT_START. However a transaction join is intended for use cases where we do modify the filesystem, and fiemap only needs to peek at delayed references from the current transaction in order to determine if extents are shared, and, besides that, when there is no current transaction or when it blocks to wait for a current committing transaction to complete, it creates a new transaction without reserving any space. Such unnecessary transactions, besides doing unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary rotation of the precious backup roots. So fix this by adding a new transaction join variant, named join_nostart, which behaves like the regular join, but it does not create a transaction when none currently exists or after waiting for a committing transaction to complete. Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25f2fs: fix to read source block before invalidating itJaegeuk Kim
[ Upstream commit 543b8c468f55f27f3c0178a22a91a51aabbbc428 ] f2fs_allocate_data_block() invalidates old block address and enable new block address. Then, if we try to read old block by f2fs_submit_page_bio(), it will give WARN due to reading invalid blocks. Let's make the order sanely back. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25io_uring: fix manual setup of iov_iter for fixed buffersAleix Roca Nonell
commit 99c79f6692ccdc42e04deea8a36e22bb48168a62 upstream. Commit bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") introduced an optimization to avoid using the slow iov_iter_advance by manually populating the iov_iter iterator in some cases. However, the computation of the iterator count field was erroneous: The first bvec was always accounted for an extent of page size even if the bvec length was smaller. In consequence, some I/O operations on fixed buffers were unable to operate on the full extent of the buffer, consistently skipping some bytes at the end of it. Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Cc: stable@vger.kernel.org Signed-off-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-25seq_file: fix problem when seeking mid-recordNeilBrown
commit 6a2aeab59e97101b4001bac84388fc49a992f87e upstream. If you use lseek or similar (e.g. pread) to access a location in a seq_file file that is within a record, rather than at a record boundary, then the first read will return the remainder of the record, and the second read will return the whole of that same record (instead of the next record). When seeking to a record boundary, the next record is correctly returned. This bug was introduced by a recent patch (identified below). Before that patch, seq_read() would increment m->index when the last of the buffer was returned (m->count == 0). After that patch, we rely on ->next to increment m->index after filling the buffer - but there was one place where that didn't happen. Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/ Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: Sergei Turchanov <turchanov@farpost.com> Tested-by: Sergei Turchanov <turchanov@farpost.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16NFSv4: Fix an Oops in nfs4_do_setattrTrond Myklebust
commit 09a54f0ebfe263bc27c90bbd80187b9a93283887 upstream. If the user specifies an open mode of 3, then we don't have a NFSv4 state attached to the context, and so we Oops when we try to dereference it. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: 29b59f9416937 ("NFSv4: change nfs4_do_setattr to take...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.10: 991eedb1371dc: NFSv4: Only pass the... Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16NFSv4: Check the return value of update_open_stateid()Trond Myklebust
commit e3c8dc761ead061da2220ee8f8132f729ac3ddfe upstream. Ensure that we always check the return value of update_open_stateid() so that we can retry if the update of local state failed. This fixes infinite looping on state recovery. Fixes: e23008ec81ef3 ("NFSv4 reduce attribute requests for open reclaim") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v3.7+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16NFSv4: Fix delegation state recoveryTrond Myklebust
commit 5eb8d18ca0e001c6055da2b7f30d8f6dca23a44f upstream. Once we clear the NFS_DELEGATED_STATE flag, we're telling nfs_delegation_claim_opens() that we're done recovering all open state for that stateid, so we really need to ensure that we test for all open modes that are currently cached and recover them before exiting nfs4_open_delegation_recall(). Fixes: 24311f884189d ("NFSv4: Recovery of recalled read delegations...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16smb3: send CAP_DFS capability during session setupSteve French
commit 8d33096a460d5b9bd13300f01615df5bb454db10 upstream. We had a report of a server which did not do a DFS referral because the session setup Capabilities field was set to 0 (unlike negotiate protocol where we set CAP_DFS). Better to send it session setup in the capabilities as well (this also more closely matches Windows client behavior). Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16SMB3: Fix deadlock in validate negotiate hits reconnectPavel Shilovsky
commit e99c63e4d86d3a94818693147b469fa70de6f945 upstream. Currently we skip SMB2_TREE_CONNECT command when checking during reconnect because Tree Connect happens when establishing an SMB session. For SMB 3.0 protocol version the code also calls validate negotiate which results in SMB2_IOCL command being sent over the wire. This may deadlock on trying to acquire a mutex when checking for reconnect. Fix this by skipping SMB2_IOCL command when doing the reconnect check. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16dax: dax_layout_busy_page() should not unmap cow pagesVivek Goyal
commit d75996dd022b6d83bd14af59b2775b1aa639e4b9 upstream. Vivek: "As of now dax_layout_busy_page() calls unmap_mapping_range() with last argument as 1, which says even unmap cow pages. I am wondering who needs to get rid of cow pages as well. I noticed one interesting side affect of this. I mount xfs with -o dax and mmaped a file with MAP_PRIVATE and wrote some data to a page which created cow page. Then I called fallocate() on that file to zero a page of file. fallocate() called dax_layout_busy_page() which unmapped cow pages as well and then I tried to read back the data I wrote and what I get is old data from persistent memory. I lost the data I had written. This read basically resulted in new fault and read back the data from persistent memory. This sounds wrong. Are there any users which need to unmap cow pages as well? If not, I am proposing changing it to not unmap cow pages. I noticed this while while writing virtio_fs code where when I tried to reclaim a memory range and that corrupted the executable and I was running from virtio-fs and program got segment violation." Dan: "In fact the unmap_mapping_range() in this path is only to synchronize against get_user_pages_fast() and force it to call back into the filesystem to re-establish the mapping. COW pages should be left untouched by dax_layout_busy_page()." Cc: <stable@vger.kernel.org> Fixes: 5fac7408d828 ("mm, fs, dax: handle layout changes to pinned dax mappings") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20190802192956.GA3032@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16gfs2: gfs2_walk_metadata fixAndreas Gruenbacher
commit a27a0c9b6a208722016c8ec5ad31ec96082b91ec upstream. It turns out that the current version of gfs2_metadata_walker suffers from multiple problems that can cause gfs2_hole_size to report an incorrect size. This will confuse fiemap as well as lseek with the SEEK_DATA flag. Fix that by changing gfs2_hole_walker to compute the metapath to the first data block after the hole (if any), and compute the hole size based on that. Fixes xfstest generic/490. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16bdev: Fixup error handling in blkdev_get()Jan Kara
commit e91455bad5cff40a8c232f2204a5104127e3fec2 upstream. Commit 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD") converted blkdev_get() to use the new helpers for finishing claiming of a block device. However the conversion botched the error handling in blkdev_get() and thus the bdev has been marked as held even in case __blkdev_get() returned error. This led to occasional warnings with block/001 test from blktests like: kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0 Correct the error handling. CC: stable@vger.kernel.org Fixes: 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09compat_ioctl: pppoe: fix PPPOEIOCSFWD handlingArnd Bergmann
[ Upstream commit 055d88242a6046a1ceac3167290f054c72571cd9 ] Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in linux-2.5.69 along with hundreds of other commands, but was always broken sincen only the structure is compatible, but the command number is not, due to the size being sizeof(size_t), or at first sizeof(sizeof((struct sockaddr_pppox)), which is different on 64-bit architectures. Guillaume Nault adds: And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe: fix reference counting in PPPoE proxy")), and nobody ever noticed. I should probably have removed this ioctl entirely instead of fixing it. Clearly, it has never been used. Fix it by adding a compat_ioctl handler for all pppoe variants that translates the command number and then calls the regular ioctl function. All other ioctl commands handled by pppoe are compatible between 32-bit and 64-bit, and require compat_ptr() conversion. This should apply to all stable kernels. Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06io_uring: fix KASAN use after free in io_sq_wq_submit_workJackie Liu
commit d0ee879187df966ef638031b5f5183078d672141 upstream. [root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06loop: Fix mount(2) failure due to race with LOOP_SET_FDJan Kara
commit 89e524c04fa966330e2e80ab2bc50b9944c5847a upstream. Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") made LOOP_SET_FD ioctl acquire exclusive block device reference while it updates loop device binding. However this can make perfectly valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding temporarily the exclusive bdev reference in cases like this: for i in {a..z}{a..z}; do dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024 mkfs.ext2 $i.image mkdir mnt$i done echo "Run" for i in {a..z}{a..z}; do mount -o loop -t ext2 $i.image mnt$i & done Fix the problem by not getting full exclusive bdev reference in LOOP_SET_FD but instead just mark the bdev as being claimed while we update the binding information. This just blocks new exclusive openers instead of failing them with EBUSY thus fixing the problem. Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") Cc: stable@vger.kernel.org Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06dax: Fix missed wakeup in put_unlocked_entry()Jan Kara
commit 61c30c98ef17e5a330d7bb8494b78b3d6dffe9b8 upstream. The condition checking whether put_unlocked_entry() needs to wake up following waiter got broken by commit 23c84eb78375 ("dax: Fix missed wakeup with PMD faults"). We need to wake the waiter whenever the passed entry is valid (i.e., non-NULL and not special conflict entry). This could lead to processes never being woken up when waiting for entry lock. Fix the condition. Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06Btrfs: fix race leading to fs corruption after transaction abortFilipe Manana
commit cb2d3daddbfb6318d170e79aac1f7d5e4d49f0d7 upstream. When one transaction is finishing its commit, it is possible for another transaction to start and enter its initial commit phase as well. If the first ends up getting aborted, we have a small time window where the second transaction commit does not notice that the previous transaction aborted and ends up committing, writing a superblock that points to btrees that reference extent buffers (nodes and leafs) that were not persisted to disk. The consequence is that after mounting the filesystem again, we will be unable to load some btree nodes/leafs, either because the content on disk is either garbage (or just zeroes) or corresponds to the old content of a previouly COWed or deleted node/leaf, resulting in the well known error messages "parent transid verify failed on ...". The following sequence diagram illustrates how this can happen. CPU 1 CPU 2 <at transaction N> btrfs_commit_transaction() (...) --> sets transaction state to TRANS_STATE_UNBLOCKED --> sets fs_info->running_transaction to NULL (...) btrfs_start_transaction() start_transaction() wait_current_trans() --> returns immediately because fs_info->running_transaction is NULL join_transaction() --> creates transaction N + 1 --> sets fs_info->running_transaction to transaction N + 1 --> adds transaction N + 1 to the fs_info->trans_list list --> returns transaction handle pointing to the new transaction N + 1 (...) btrfs_sync_file() btrfs_start_transaction() --> returns handle to transaction N + 1 (...) btrfs_write_and_wait_transaction() --> writeback of some extent buffer fails, returns an error btrfs_handle_fs_error() --> sets BTRFS_FS_STATE_ERROR in fs_info->fs_state --> jumps to label "scrub_continue" cleanup_transaction() btrfs_abort_transaction(N) --> sets BTRFS_FS_STATE_TRANS_ABORTED flag in fs_info->fs_state --> sets aborted field in the transaction and transaction handle structures, for transaction N only --> removes transaction from the list fs_info->trans_list btrfs_commit_transaction(N + 1) --> transaction N + 1 was not aborted, so it proceeds (...) --> sets the transaction's state to TRANS_STATE_COMMIT_START --> does not find the previous transaction (N) in the fs_info->trans_list, so it doesn't know that transaction was aborted, and the commit of transaction N + 1 proceeds (...) --> sets transaction N + 1 state to TRANS_STATE_UNBLOCKED btrfs_write_and_wait_transaction() --> succeeds writing all extent buffers created in the transaction N + 1 write_all_supers() --> succeeds --> we now have a superblock on disk that points to trees that refer to at least one extent buffer that was never persisted So fix this by updating the transaction commit path to check if the flag BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting the transaction to the TRANS_STATE_COMMIT_START we do not find any previous transaction in the fs_info->trans_list. If the flag is set, just fail the transaction commit with -EROFS, as we do in other places. The exact error code for the previous transaction abort was already logged and reported. Fixes: 49b25e0540904b ("btrfs: enhance transaction abort infrastructure") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06Btrfs: fix incremental send failure after deduplicationFilipe Manana
commit b4f9a1a87a48c255bb90d8a6c3d555a1abb88130 upstream. When doing an incremental send operation we can fail if we previously did deduplication operations against a file that exists in both snapshots. In that case we will fail the send operation with -EIO and print a message to dmesg/syslog like the following: BTRFS error (device sdc): Send: inconsistent snapshot, found updated \ extent for inode 257 without updated inode item, send root is 258, \ parent root is 257 This requires that we deduplicate to the same file in both snapshots for the same amount of times on each snapshot. The issue happens because a deduplication only updates the iversion of an inode and does not update any other field of the inode, therefore if we deduplicate the file on each snapshot for the same amount of time, the inode will have the same iversion value (stored as the "sequence" field on the inode item) on both snapshots, therefore it will be seen as unchanged between in the send snapshot while there are new/updated/deleted extent items when comparing to the parent snapshot. This makes the send operation return -EIO and print an error message. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our first file. The first half of the file has several 64Kb # extents while the second half as a single 512Kb extent. $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo # Create the base snapshot and the parent send stream from it. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 # Create our second file, that has exactly the same data as the first # file. $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar # Create the second snapshot, used for the incremental send, before # doing the file deduplication. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 # Now before creating the incremental send stream: # # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This # will drop several extent items and add a new one, also updating # the inode's iversion (sequence field in inode item) by 1, but not # any other field of the inode; # # 2) Deduplicate into a different subrange of file foo in snapshot # mysnap2. This will replace an extent item with a new one, also # updating the inode's iversion by 1 but not any other field of the # inode. # # After these two deduplication operations, the inode items, for file # foo, are identical in both snapshots, but we have different extent # items for this inode in both snapshots. We want to check this doesn't # cause send to fail with an error or produce an incorrect stream. $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo # Create the incremental send stream. $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2 ERROR: send ioctl failed with -5: Input/output error This issue started happening back in 2015 when deduplication was updated to not update the inode's ctime and mtime and update only the iversion. Back then we would hit a BUG_ON() in send, but later in 2016 send was updated to return -EIO and print the error message instead of doing the BUG_ON(). A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933 Fixes: 1c919a5e13702c ("btrfs: don't update mtime/ctime on deduped inodes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06coda: add error handling for fgetZhouyang Jia
[ Upstream commit 02551c23bcd85f0c68a8259c7b953d49d44f86af ] When fget fails, the lack of error-handling code may cause unexpected results. This patch adds error-handling code after calling fget. Link: http://lkml.kernel.org/r/2514ec03df9c33b86e56748513267a80dd8004d9.1558117389.git.jaharkes@cs.cmu.edu Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com> Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Howells <dhowells@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Mikko Rapeli <mikko.rapeli@iki.fi> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06cifs: fix crash in cifs_dfs_do_automountRonnie Sahlberg
[ Upstream commit ce465bf94b70f03136171a62b607864f00093b19 ] RHBZ: 1649907 Fix a crash that happens while attempting to mount a DFS referral from the same server on the root of a filesystem. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: return -ERANGE if virtual xattr value didn't fit in bufferJeff Layton
[ Upstream commit 3b421018f48c482bdc9650f894aa1747cf90e51d ] The getxattr manpage states that we should return ERANGE if the destination buffer size is too small to hold the value. ceph_vxattrcb_layout does this internally, but we should be doing this for all vxattrs. Fix the only caller of getxattr_cb to check the returned size against the buffer length and return -ERANGE if it doesn't fit. Drop the same check in ceph_vxattrcb_layout and just rely on the caller to handle it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: fix dir_lease_is_valid()Yan, Zheng
[ Upstream commit feab6ac25dbfe3ab96299cb741925dc8d2da0caf ] It should call __ceph_dentry_dir_lease_touch() under dentry->d_lock. Besides, ceph_dentry(dentry) can be NULL when called by LOOKUP_RCU d_revalidate() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: fix improper use of smp_mb__before_atomic()Andrea Parri
[ Upstream commit 749607731e26dfb2558118038c40e9c0c80d23b5 ] This barrier only applies to the read-modify-write operations; in particular, it does not apply to the atomic64_set() primitive. Replace the barrier with an smp_mb(). Fixes: fdd4e15838e59 ("ceph: rework dcache readdir") Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06cifs: Fix a race condition with cifs_echo_requestRonnie Sahlberg
[ Upstream commit f2caf901c1b7ce65f9e6aef4217e3241039db768 ] There is a race condition with how we send (or supress and don't send) smb echos that will cause the client to incorrectly think the server is unresponsive and thus needs to be reconnected. Summary of the race condition: 1) Daisy chaining scheduling creates a gap. 2) If traffic comes unfortunate shortly after the last echo, the planned echo is suppressed. 3) Due to the gap, the next echo transmission is delayed until after the timeout, which is set hard to twice the echo interval. This is fixed by changing the timeouts from 2 to three times the echo interval. Detailed description of the bug: https://lutz.donnerhacke.de/eng/Blog/Groundhog-Day-with-SMB-remount Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit()Qu Wenruo
[ Upstream commit e88439debd0a7f969b3ddba6f147152cd0732676 ] [BUG] Lockdep will report the following circular locking dependency: WARNING: possible circular locking dependency detected 5.2.0-rc2-custom #24 Tainted: G O ------------------------------------------------------ btrfs/8631 is trying to acquire lock: 000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs] but task is already holding lock: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->tree_log_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x475/0xa00 [btrfs] btrfs_commit_super+0x71/0x80 [btrfs] close_ctree+0x2bd/0x320 [btrfs] btrfs_put_super+0x15/0x20 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x16/0xa0 [btrfs] deactivate_locked_super+0x3a/0x80 deactivate_super+0x51/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0x94/0xb0 exit_to_usermode_loop+0xd8/0xe0 do_syscall_64+0x210/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&fs_info->reloc_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x40d/0xa00 [btrfs] btrfs_quota_enable+0x2da/0x730 [btrfs] btrfs_ioctl+0x2691/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}: lock_acquire+0xa7/0x190 __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_qgroup_inherit+0x40/0x620 [btrfs] create_pending_snapshot+0x9d7/0xe60 [btrfs] create_pending_snapshots+0x94/0xb0 [btrfs] btrfs_commit_transaction+0x415/0xa00 [btrfs] btrfs_mksubvol+0x496/0x4e0 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0xa90/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->tree_log_mutex); lock(&fs_info->reloc_mutex); lock(&fs_info->tree_log_mutex); lock(&fs_info->qgroup_ioctl_lock#2); *** DEADLOCK *** 6 locks held by btrfs/8631: #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60 #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs] #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs] #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs] #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs] #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] [CAUSE] Due to the delayed subvolume creation, we need to call btrfs_qgroup_inherit() inside commit transaction code, with a lot of other mutex hold. This hell of lock chain can lead to above problem. [FIX] On the other hand, we don't really need to hold qgroup_ioctl_lock if we're in the context of create_pending_snapshot(). As in that context, we're the only one being able to modify qgroup. All other qgroup functions which needs qgroup_ioctl_lock are either holding a transaction handle, or will start a new transaction: Functions will start a new transaction(): * btrfs_quota_enable() * btrfs_quota_disable() Functions hold a transaction handler: * btrfs_add_qgroup_relation() * btrfs_del_qgroup_relation() * btrfs_create_qgroup() * btrfs_remove_qgroup() * btrfs_limit_qgroup() * btrfs_qgroup_inherit() call inside create_subvol() So we have a higher level protection provided by transaction, thus we don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit(). Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in create_pending_snapshot() is already protected by transaction. So the fix is to detect the context by checking trans->transaction->state. If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction context and no need to get the mutex. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: Flush before reflinking any extent to prevent NOCOW write falling ↵Qu Wenruo
back to COW without data reservation [ Upstream commit a94d1d0cb3bf1983fcdf05b59d914dbff4f1f52c ] [BUG] The following script can cause unexpected fsync failure: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs mkfs.btrfs -f $dev -b 512M > /dev/null mount $dev $mnt -o nospace_cache # Prealloc one extent xfs_io -f -c "falloc 8k 64m" $mnt/file1 # Fill the remaining data space xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding sync # Write into the prealloc extent xfs_io -c "pwrite 1m 16m" $mnt/file1 # Reflink then fsync, fsync would fail due to ENOSPC xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 umount $dev The fsync fails with ENOSPC, and the last page of the buffered write is lost. [CAUSE] This is caused by: - Btrfs' back reference only has extent level granularity So write into shared extent must be COWed even only part of the extent is shared. So for above script we have: - fallocate Create a preallocated extent where we can do NOCOW write. - fill all the remaining data and unallocated space - buffered write into preallocated space As we have not enough space available for data and the extent is not shared (yet) we fall into NOCOW mode. - reflink Now part of the large preallocated extent is shared, later write into that extent must be COWed. - fsync triggers writeback But now the extent is shared and therefore we must fallback into COW mode, which fails with ENOSPC since there's not enough space to allocate data extents. [WORKAROUND] The workaround is to ensure any buffered write in the related extents (not just the reflink source range) get flushed before reflink/dedupe, so that NOCOW writes succeed that happened before reflinking succeed. The workaround is expensive, we could do it better by only flushing NOCOW range, but that needs extra accounting for NOCOW range. For now, fix the possible data loss first. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: fix minimum number of chunk errors for DUPDavid Sterba
[ Upstream commit 0ee5f8ae082e1f675a2fb6db601c31ac9958a134 ] The list of profiles in btrfs_chunk_max_errors lists DUP as a profile DUP able to tolerate 1 device missing. Though this profile is special with 2 copies, it still needs the device, unlike the others. Looking at the history of changes, thre's no clear reason why DUP is there, functions were refactored and blocks of code merged to one helper. d20983b40e828 Btrfs: fix writing data into the seed filesystem - factor code to a helper de11cc12df173 Btrfs: don't pre-allocate btrfs bio - unrelated change, DUP still in the list with max errors 1 a236aed14ccb0 Btrfs: Deal with failed writes in mirrored configurations - introduced the max errors, leaves DUP and RAID1 in the same group Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: tree-checker: Check if the file extent end overflowsQu Wenruo
[ Upstream commit 4c094c33c9ed4b8d0d814bd1d7ff78e123d15d00 ] Under certain conditions, we could have strange file extent item in log tree like: item 18 key (69599 108 397312) itemoff 15208 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 18446744073709547520 ram 18446744073709547520 The num_bytes + ram_bytes overflow 64 bit type. For num_bytes part, we can detect such overflow along with file offset (key->offset), as file_offset + num_bytes should never go beyond u64. For ram_bytes part, it's about the decompressed size of the extent, not directly related to the size. In theory it is OK to have a large value, and put extra limitation on RAM bytes may cause unexpected false alerts. So in tree-checker, we only check if the file offset and num bytes overflow. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06fs/adfs: super: fix use-after-free bugRussell King
[ Upstream commit 5808b14a1f52554de612fee85ef517199855e310 ] Fix a use-after-free bug during filesystem initialisation, where we access the disc record (which is stored in a buffer) after we have released the buffer. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-04ceph: hold i_ceph_lock when removing caps for freeing inodeYan, Zheng
commit d6e47819721ae2d9d090058ad5570a66f3c42e39 upstream. ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask() on a freeing inode. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04/proc/<pid>/cmdline: add back the setproctitle() special caseLinus Torvalds
commit d26d0cd97c88eb1a5704b42e41ab443406807810 upstream. This makes the setproctitle() special case very explicit indeed, and handles it with a separate helper function entirely. In the process, it re-instates the original semantics of simply stopping at the first NUL character when the original last NUL character is no longer there. [ The original semantics can still be seen in mm/util.c: get_cmdline() that is limited to a fixed-size buffer ] This makes the logic about when we use the string lengths etc much more obvious, and makes it easier to see what we do and what the two very different cases are. Note that even when we allow walking past the end of the argument array (because the setproctitle() might have overwritten and overflowed the original argv[] strings), we only allow it when it overflows into the environment region if it is immediately adjacent. [ Fixed for missing 'count' checks noted by Alexey Izbyshev ] Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/ Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function") Cc: Jakub Jankowski <shasta@toxcorp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04/proc/<pid>/cmdline: remove all the special casesLinus Torvalds
commit 3d712546d8ba9f25cdf080d79f90482aa4231ed4 upstream. Start off with a clean slate that only reads exactly from arg_start to arg_end, without any oddities. This simplifies the code and in the process removes the case that caused us to potentially leak an uninitialized byte from the temporary kernel buffer. Note that in order to start from scratch with an understandable base, this simplifies things _too_ much, and removes all the legacy logic to handle setproctitle() having changed the argument strings. We'll add back those special cases very differently in the next commit. Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/ Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite") Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04sched/fair: Don't free p->numa_faults with concurrent readersJann Horn
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>