summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-05-11nfs: fix NULL deference in nfs4_get_valid_delegationJ. Bruce Fields
We add the new state to the nfsi->open_states list, making it potentially visible to other threads, before we've finished initializing it. That wasn't a problem when all the readers were also taking the i_lock (as we do here), but since we switched to RCU, there's now a possibility that a reader could see the partially initialized state. Symptoms observed were a crash when another thread called nfs4_get_valid_delegation() on a NULL inode, resulting in an oops like: BUG: unable to handle page fault for address: ffffffffffffffb0 ... RIP: 0010:nfs4_get_valid_delegation+0x6/0x30 [nfsv4] ... Call Trace: nfs4_open_prepare+0x80/0x1c0 [nfsv4] __rpc_execute+0x75/0x390 [sunrpc] ? finish_task_switch+0x75/0x260 rpc_async_schedule+0x29/0x40 [sunrpc] process_one_work+0x1ad/0x370 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x10c/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x22/0x30 Fixes: 9ae075fdd190 "NFSv4: Convert open state lookup to use RCU" Reviewed-by: Seiichi Ikarashi <s.ikarashi@fujitsu.com> Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-11rxrpc: Fix the excessive initial retransmission timeoutDavid Howells
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is sufficiently sampled. This can cause problems with some fileservers with calls to the cache manager in the afs filesystem being dropped from the fileserver because a packet goes missing and the retransmission timeout is greater than the call expiry timeout. Fix this by: (1) Copying the RTT/RTO calculation code from Linux's TCP implementation and altering it to fit rxrpc. (2) Altering the various users of the RTT to make use of the new SRTT value. (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO value instead (which is needed in jiffies), along with a backoff. Notes: (1) rxrpc provides RTT samples by matching the serial numbers on outgoing DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets against the reference serial number in incoming REQUESTED ACK and PING-RESPONSE ACK packets. (2) Each packet that is transmitted on an rxrpc connection gets a new per-connection serial number, even for retransmissions, so an ACK can be cross-referenced to a specific trigger packet. This allows RTT information to be drawn from retransmitted DATA packets also. (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than on an rxrpc_call because many RPC calls won't live long enough to generate more than one sample. (4) The calculated SRTT value is in units of 8ths of a microsecond rather than nanoseconds. The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers. Fixes: 17926a79320a ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"") Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-10Merge tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - a small series fixing a use-after-free of bdi name (Christoph,Yufen) - NVMe fix for a regression with the smaller CQ update (Alexey) - NVMe fix for a hang at namespace scanning error recovery (Sagi) - fix race with blk-iocost iocg->abs_vdebt updates (Tejun) * tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block: nvme: fix possible hang when ns scanning fails during error recovery nvme-pci: fix "slimmer CQ head update" bdi: add a ->dev_name field to struct backing_dev_info bdi: use bdi_dev_name() to get device name bdi: move bdi_dev_name out of line vboxsf: don't use the source name in the bdi name iocost: protect iocg->abs_vdebt with iocg->waitq.lock
2020-05-09bdi: use bdi_dev_name() to get device nameYufen Yu
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09Merge tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix finish_wait() balancing in file cancelation (Xiaoguang) - Ensure early cleanup of resources in ring map failure (Xiaoguang) - Ensure IORING_OP_SLICE does the right file mode checks (Pavel) - Remove file opening from openat/openat2/statx, it's not needed and messes with O_PATH * tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block: io_uring: don't use 'fd' for openat/openat2/statx splice: move f_mode checks to do_{splice,tee}() io_uring: handle -EFAULT properly in io_uring_setup() io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
2020-05-09io_uring: fix zero len do_splice()Pavel Begunkov
do_splice() doesn't expect len to be 0. Just always return 0 in this case as splice(2) does. Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-08cachefiles: Fix race between read_waiter and read_copier involving op->to_doLei Xue
There is a potential race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs. The problem can be seen easily on a heavy loaded system (for example many processes reading files continually on an NFS share covered by fscache triggered this problem within a few minutes). The race is due to cachefiles_read_waiter() adding the op to the monitor to_do list and then then drop the object->work_lock spinlock before completing fscache_enqueue_operation(). Once the lock is dropped, cachefiles_read_copier() grabs the op, completes processing it, and makes it through fscache_retrieval_complete() which sets the op->state to the final state of FSCACHE_OP_ST_COMPLETE(4). When cachefiles_read_waiter() finally gets through the remainder of fscache_enqueue_operation() it sees the invalid state, and hits the ASSERTCMP and the following oops is seen: [ 2259.612361] FS-Cache: [ 2259.614785] FS-Cache: Assertion failed [ 2259.618639] FS-Cache: 4 == 5 is false [ 2259.622456] ------------[ cut here ]------------ [ 2259.627190] kernel BUG at fs/fscache/operation.c:70! ... [ 2259.791675] RIP: 0010:[<ffffffffc061b4cf>] [<ffffffffc061b4cf>] fscache_enqueue_operation+0xff/0x170 [fscache] [ 2259.802059] RSP: 0000:ffffa0263d543be0 EFLAGS: 00010046 [ 2259.807521] RAX: 0000000000000019 RBX: ffffa01a4d390480 RCX: 0000000000000006 [ 2259.814847] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffa0263d553890 [ 2259.822176] RBP: ffffa0263d543be8 R08: 0000000000000000 R09: ffffa0263c2d8708 [ 2259.829502] R10: 0000000000001e7f R11: 0000000000000000 R12: ffffa01a4d390480 [ 2259.844483] R13: ffff9fa9546c5920 R14: ffffa0263d543c80 R15: ffffa0293ff9bf10 [ 2259.859554] FS: 00007f4b6efbd700(0000) GS:ffffa0263d540000(0000) knlGS:0000000000000000 [ 2259.875571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2259.889117] CR2: 00007f49e1624ff0 CR3: 0000012b38b38000 CR4: 00000000007607e0 [ 2259.904015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2259.918764] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2259.933449] PKRU: 55555554 [ 2259.943654] Call Trace: [ 2259.953592] <IRQ> [ 2259.955577] [<ffffffffc03a7c12>] cachefiles_read_waiter+0x92/0xf0 [cachefiles] [ 2259.978039] [<ffffffffa34d3942>] __wake_up_common+0x82/0x120 [ 2259.991392] [<ffffffffa34d3a63>] __wake_up_common_lock+0x83/0xc0 [ 2260.004930] [<ffffffffa34d3510>] ? task_rq_unlock+0x20/0x20 [ 2260.017863] [<ffffffffa34d3ab3>] __wake_up+0x13/0x20 [ 2260.030230] [<ffffffffa34c72a0>] __wake_up_bit+0x50/0x70 [ 2260.042535] [<ffffffffa35bdcdb>] unlock_page+0x2b/0x30 [ 2260.054495] [<ffffffffa35bdd09>] page_endio+0x29/0x90 [ 2260.066184] [<ffffffffa368fc81>] mpage_end_io+0x51/0x80 CPU1 cachefiles_read_waiter() 20 static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode, 21 int sync, void *_key) 22 { ... 61 spin_lock(&object->work_lock); 62 list_add_tail(&monitor->op_link, &op->to_do); 63 spin_unlock(&object->work_lock); <begin race window> 64 65 fscache_enqueue_retrieval(op); 182 static inline void fscache_enqueue_retrieval(struct fscache_retrieval *op) 183 { 184 fscache_enqueue_operation(&op->op); 185 } 58 void fscache_enqueue_operation(struct fscache_operation *op) 59 { 60 struct fscache_cookie *cookie = op->object->cookie; 61 62 _enter("{OBJ%x OP%x,%u}", 63 op->object->debug_id, op->debug_id, atomic_read(&op->usage)); 64 65 ASSERT(list_empty(&op->pend_link)); 66 ASSERT(op->processor != NULL); 67 ASSERT(fscache_object_is_available(op->object)); 68 ASSERTCMP(atomic_read(&op->usage), >, 0); <end race window> CPU2 cachefiles_read_copier() 168 while (!list_empty(&op->to_do)) { ... 202 fscache_end_io(op, monitor->netfs_page, error); 203 put_page(monitor->netfs_page); 204 fscache_retrieval_complete(op, 1); CPU1 58 void fscache_enqueue_operation(struct fscache_operation *op) 59 { ... 69 ASSERTIFCMP(op->state != FSCACHE_OP_ST_IN_PROGRESS, 70 op->state, ==, FSCACHE_OP_ST_CANCELLED); Signed-off-by: Lei Xue <carmark.dlut@gmail.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08NFSv4: Fix fscache cookie aux_data to ensure change_attr is includedDave Wysochanski
Commit 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") added the aux_data and aux_data_len to parameters to fscache_acquire_cookie(), and updated the callers in the NFS client. In the process it modified the aux_data to include the change_attr, but missed adding change_attr to a couple places where aux_data was used. Specifically, when opening a file and the change_attr is not added, the following attempt to lookup an object will fail inside cachefiles_check_object_xattr() = -116 due to nfs_fscache_inode_check_aux() failing memcmp on auxdata and returning FSCACHE_CHECKAUX_OBSOLETE. Fix this by adding nfs_fscache_update_auxdata() to set the auxdata from all relevant fields in the inode, including the change_attr. Fixes: 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08NFS: Fix fscache super_cookie allocationDave Wysochanski
Commit f2aedb713c28 ("NFS: Add fs_context support.") reworked NFS mount code paths for fs_context support which included super_block initialization. In the process there was an extra return left in the code and so we never call nfs_fscache_get_super_cookie even if 'fsc' is given on as mount option. In addition, there is an extra check inside nfs_fscache_get_super_cookie for the NFS_OPTION_FSCACHE which is unnecessary since the only caller nfs_get_cache_cookie checks this flag. Fixes: f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08NFS: Fix fscache super_cookie index_key from changing after umountDave Wysochanski
Commit 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") added the index_key and index_key_len parameters to fscache_acquire_cookie(), and updated the callers in the NFS client. One of the callers was inside nfs_fscache_get_super_cookie() and was changed to use the full struct nfs_fscache_key as the index_key. However, a couple members of this structure contain pointers and thus will change each time the same NFS share is remounted. Since index_key is used for fscache_cookie->key_hash and this subsequently is used to compare cookies, the effectiveness of fscache with NFS is reduced to the point at which a umount occurs. Any subsequent remount of the same share will cause a unique NFS super_block index_key and key_hash to be generated for the same data, rendering any prior fscache data unable to be found. A simple reproducer demonstrates the problem. 1. Mount share with 'fsc', create a file, drop page cache systemctl start cachefilesd mount -o vers=3,fsc 127.0.0.1:/export /mnt dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches 2. Read file into page cache and fscache, then unmount dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1 umount /mnt 3. Remount and re-read which should come from fscache mount -o vers=3,fsc 127.0.0.1:/export /mnt echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1 4. Check for READ ops in mountstats - there should be none grep READ: /proc/self/mountstats Looking at the history and the removed function, nfs_super_get_key(), we should only use nfs_fscache_key.key plus any uniquifier, for the fscache index_key. Fixes: 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-08Revert "gfs2: Don't demote a glock until its revokes are written"Bob Peterson
This reverts commit df5db5f9ee112e76b5202fbc331f990a0fc316d6. This patch fixes a regression: patch df5db5f9ee112 allowed function run_queue() to bypass its call to do_xmote() if revokes were queued for the glock. That's wrong because its call to do_xmote() is what is responsible for calling the go_sync() glops functions to sync both the ail list and any revokes queued for it. By bypassing the call, gfs2 could get into a stand-off where the glock could not be demoted until its revokes are written back, but the revokes would not be written back because do_xmote() was never called. It "sort of" works, however, because there are other mechanisms like the log flush daemon (logd) that can sync the ail items and revokes, if it deems it necessary. The problem is: without file system pressure, it might never deem it necessary. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: If go_sync returns error, withdraw but skip invalidateBob Peterson
Before this patch, if the go_sync operation returned an error during the do_xmote process (such as unable to sync metadata to the journal) the code did goto out. That kept the glock locked, so it could not be given away, which correctly avoids file system corruption. However, it never set the withdraw bit or requeueing the glock work. So it would hang forever, unable to ever demote the glock. This patch changes to goto to a new label, skip_inval, so that errors from go_sync are treated the same way as errors from go_inval: The delayed withdraw bit is set and the work is requeued. That way, the logd should eventually figure out there's a problem and withdraw properly there. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08Merge tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Fixes for an endianness handling bug that prevented mounts on big-endian arches, a spammy log message and a couple error paths. Also included a MAINTAINERS update" * tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client: ceph: demote quotarealm lookup warning to a debug message MAINTAINERS: remove myself as ceph co-maintainer ceph: fix double unlock in handle_cap_export() ceph: fix special error code in ceph_try_get_caps() ceph: fix endianness bug when handling MDS session feature bits
2020-05-08gfs2: Grab glock reference sooner in gfs2_add_revokeAndreas Gruenbacher
This patch rearranges gfs2_add_revoke so that the extra glock reference is added earlier on in the function to avoid races in which the glock is freed before the new reference is taken. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: don't call quota_unhold if quotas are not lockedBob Peterson
Before this patch, function gfs2_quota_unlock checked if quotas are turned off, and if so, it branched to label out, which called gfs2_quota_unhold. With the new system of gfs2_qa_get and put, we no longer want to call gfs2_quota_unhold or we won't balance our gets and puts. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: move privileged user check to gfs2_quota_lock_checkBob Peterson
Before this patch, function gfs2_quota_lock checked if it was called from a privileged user, and if so, it bypassed the quota check: superuser can operate outside the quotas. That's the wrong place for the check because the lock/unlock functions are separate from the lock_check function, and you can do lock and unlock without actually checking the quotas. This patch moves the check to gfs2_quota_lock_check. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: remove check for quotas on in gfs2_quota_checkBob Peterson
This patch removes a check from gfs2_quota_check for whether quotas are enabled by the superblock. There is a test just prior for the GIF_QD_LOCKED bit in the inode, and that can only be set by functions that already check that quotas are enabled in the superblock. Therefore, the check is redundant. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Change BUG_ON to an assert_withdraw in gfs2_quota_changeBob Peterson
Before this patch, gfs2_quota_change() would BUG_ON if the qa_ref counter was not a positive number. This patch changes it to be a withdraw instead. That way we can debug things more easily. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Fix problems regarding gfs2_qa_get and _putBob Peterson
This patch fixes a couple of places in which gfs2_qa_get and gfs2_qa_put are not balanced: we now keep references around whenever a file is open for writing (see gfs2_open_common and gfs2_release), so we need to put all references we grab in function gfs2_create_inode. This was broken in the successful case and on one error path. This also means that we don't have a reference to put in gfs2_evict_inode. In addition, gfs2_qa_put was called for the wrong inode in gfs2_link. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08ceph: demote quotarealm lookup warning to a debug messageLuis Henriques
A misconfigured cephx can easily result in having the kernel client flooding the logs with: ceph: Can't lookup inode 1 (err: -13) Change this message to debug level. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-08Merge tag 'driver-core-5.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are a number of small driver core fixes for 5.7-rc5 to resolve a bunch of reported issues with the current tree. Biggest here are the reverts and patches from John Stultz to resolve a bunch of deferred probe regressions we have been seeing in 5.7-rc right now. Along with those are some other smaller fixes: - coredump crash fix - devlink fix for when permissive mode was enabled - amba and platform device dma_parms fixes - component error silenced for when deferred probe happens All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work" driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings driver core: Revert default driver_deferred_probe_timeout value to 0 component: Silence bind error on -EPROBE_DEFER driver core: Fix handling of fw_devlink=permissive coredump: fix crash when umh is disabled amba: Initialize dma_parms for amba devices driver core: platform: Initialize dma_parms for platform devices
2020-05-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "14 fixes and one selftest to verify the ipc fixes herein" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: limit boost_watermark on small zones ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST mm/vmscan: remove unnecessary argument description of isolate_lru_pages() epoll: atomically remove wait entry on wake up kselftests: introduce new epoll60 testcase for catching lost wakeups percpu: make pcpu_alloc() aware of current gfp context mm/slub: fix incorrect interpretation of s->offset scripts/gdb: repair rb_first() and rb_last() eventpoll: fix missing wakeup for ovflist in ep_poll_callback arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() scripts/decodecode: fix trapping instruction formatting kernel/kcov.c: fix typos in kcov_remote_start documentation mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() mm, memcg: fix error return value of mem_cgroup_css_alloc() ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
2020-05-08gfs2: More gfs2_find_jhead fixesAndreas Gruenbacher
It turns out that when extending an existing bio, gfs2_find_jhead fails to check if the block number is consecutive, which leads to incorrect reads for fragmented journals. In addition, limit the maximum bio size to an arbitrary value of 2 megabytes: since commit 07173c3ec276 ("block: enable multipage bvecs"), if we just keep adding pages until bio_add_page fails, bios will grow much larger than useful, which pins more memory than necessary with barely any additional performance gains. Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: Another gfs2_walk_metadata fixAndreas Gruenbacher
Make sure we don't walk past the end of the metadata in gfs2_walk_metadata: the inode holds fewer pointers than indirect blocks. Slightly clean up gfs2_iomap_get. Fixes: a27a0c9b6a20 ("gfs2: gfs2_walk_metadata fix") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-05-08gfs2: Fix use-after-free in gfs2_logd after withdrawBob Peterson
When the gfs2_logd daemon withdrew, the withdraw sequence called into make_fs_ro() to make the file system read-only. That caused the journal descriptors to be freed. However, those journal descriptors were used by gfs2_logd's call to gfs2_ail_flush_reqd(). This caused a use-after free and NULL pointer dereference. This patch changes function gfs2_logd() so that it stops all logd work until the thread is told to stop. Once a withdraw is done, it only does an interruptible sleep. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Fix BUG during unmount after file system withdrawBob Peterson
Before this patch, when the logd daemon was forced to withdraw, it would try to request its journal be recovered by another cluster node. However, in single-user cases with lock_nolock, there are no other nodes to recover the journal. Function signal_our_withdraw() was recognizing the lock_nolock situation, but not until after it had evicted its journal inode. Since the journal descriptor that points to the inode was never removed from the master list, when the unmount occurred, it did another iput on the evicted inode, which resulted in a BUG_ON(inode->i_state & I_CLEAR). This patch moves the check for this situation earlier in function signal_our_withdraw(), which avoids the extra iput, so the unmount may happen normally. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-08gfs2: Fix error exit in do_xmoteBob Peterson
Before this patch, if an error was detected from glock function go_sync by function do_xmote, it would return. But the function had temporarily unlocked the gl_lockref spin_lock, and it never re-locked it. When the caller of do_xmote tried to unlock it again, it was already unlocked, which resulted in a corrupted spin_lock value. This patch makes sure the gl_lockref spin_lock is re-locked after it is unlocked. Thanks to Wu Bo <wubo40@huawei.com> for reporting this problem. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-07epoll: atomically remove wait entry on wake upRoman Penyaev
This patch does two things: - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") - improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Heiher <r@hev.cc> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07eventpoll: fix missing wakeup for ovflist in ep_poll_callbackKhazhismel Kumykov
In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch. [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Heiher <r@hev.cc> Cc: Jason Baron <jbaron@akamai.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07io_uring: don't use 'fd' for openat/openat2/statxJens Axboe
We currently make some guesses as when to open this fd, but in reality we have no business (or need) to do so at all. In fact, it makes certain things fail, like O_PATH. Remove the fd lookup from these opcodes, we're just passing the 'fd' to generic helpers anyway. With that, we can also remove the special casing of fd values in io_req_needs_file(), and the 'fd_non_neg' check that we have. And we can ensure that we only read sqe->fd once. This fixes O_PATH usage with openat/openat2, and ditto statx path side oddities. Cc: stable@vger.kernel.org: # v5.6 Reported-by: Max Kellermann <mk@cm4all.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07Merge tag 'configfs-for-5.7' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs fix from Christoph Hellwig: "Fix a refcount leak in configfs_rmdir (Xiyu Yang)" * tag 'configfs-for-5.7' of git://git.infradead.org/users/hch/configfs: configfs: fix config_item refcnt leak in configfs_rmdir()
2020-05-07splice: move f_mode checks to do_{splice,tee}()Pavel Begunkov
do_splice() is used by io_uring, as will be do_tee(). Move f_mode checks from sys_{splice,tee}() to do_{splice,tee}(), so they're enforced for io_uring as well. Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07vboxsf: don't use the source name in the bdi nameChristoph Hellwig
Simplify the bdi name to mirror what we are doing elsewhere, and drop them name in favor of just using a number. This avoids a potentially very long bdi name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-06gfs2: fix withdraw sequence deadlockBob Peterson
After a gfs2 file system withdraw, any attempt to read metadata is automatically rejected by function gfs2_meta_read() except for reads of the journal inode. This turns out to be a problem because function signal_our_withdraw() repeatedly calls check_journal_clean() which reads the metadata (both its dinode and indirect blocks) to see if the entire journal is mapped. The dinode read works, but reading the indirect blocks returns -EIO which gets sent back up and causes a consistency error. This results in withdraw-from-withdraw, which becomes a deadlock. This patch changes the test in gfs2_meta_read() to allow all metadata reads for the journal. Instead of checking the journal block, it now checks for the journal inode glock which is the same for all blocks in the journal. This allows check_journal_clean() to properly check the journal without trying to withdraw recursively. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-05-06CIFS: Spelling s/EACCESS/EACCES/Geert Uytterhoeven
As per POSIX, the correct spelling is EACCES: include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */ Fixes: b8f7442bc46e48fb ("CIFS: refactor cifs_get_inode_info()") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-05-05io_uring: handle -EFAULT properly in io_uring_setup()Xiaoguang Wang
If copy_to_user() in io_uring_setup() failed, we'll leak many kernel resources, which will be recycled until process terminates. This bug can be reproduced by using mprotect to set params to PROT_READ. To fix this issue, refactor io_uring_create() a bit to add a new 'struct io_uring_params __user *params' parameter and move the copy_to_user() in io_uring_setup() to io_uring_setup(), if copy_to_user() failed, we can free kernel resource properly. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04ceph: fix double unlock in handle_cap_export()Wu Bo
If the ceph_mdsc_open_export_target_session() return fails, it will do a "goto retry", but the session mutex has already been unlocked. Re-lock the mutex in that case to ensure that we don't unlock it twice. Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04ceph: fix special error code in ceph_try_get_caps()Wu Bo
There are 3 speical error codes: -EAGAIN/-EFBIG/-ESTALE. After calling try_get_cap_refs, ceph_try_get_caps test for the -EAGAIN twice. Ensure that it tests for -ESTALE instead. Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04ceph: fix endianness bug when handling MDS session feature bitsJeff Layton
Eduard reported a problem mounting cephfs on s390 arch. The feature mask sent by the MDS is little-endian, so we need to convert it before storing and testing against it. Cc: stable@vger.kernel.org Reported-and-Tested-by: Eduard Shishkin <edward6@linux.ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04cachefiles: Fix corruption of the return value in ↵David Howells
cachefiles_read_or_alloc_pages() The patch which changed cachefiles from calling ->bmap() to using the bmap() wrapper overwrote the running return value with the result of calling bmap(). This causes an assertion failure elsewhere in the code. Fix this by using ret2 rather than ret to hold the return value. The oops looks like: kernel BUG at fs/nfs/fscache.c:468! ... RIP: 0010:__nfs_readpages_from_fscache+0x18b/0x190 [nfs] ... Call Trace: nfs_readpages+0xbf/0x1c0 [nfs] ? __alloc_pages_nodemask+0x16c/0x320 read_pages+0x67/0x1a0 __do_page_cache_readahead+0x1cf/0x1f0 ondemand_readahead+0x172/0x2b0 page_cache_async_readahead+0xaa/0xe0 generic_file_buffered_read+0x852/0xd50 ? mem_cgroup_commit_charge+0x6e/0x140 ? nfs4_have_delegation+0x19/0x30 [nfsv4] generic_file_read_iter+0x100/0x140 ? nfs_revalidate_mapping+0x176/0x2b0 [nfs] nfs_file_read+0x6d/0xc0 [nfs] new_sync_read+0x11a/0x1c0 __vfs_read+0x29/0x40 vfs_read+0x8e/0x140 ksys_read+0x61/0xd0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5d148267e0 Fixes: 10d83e11a582 ("cachefiles: drop direct usage of ->bmap method.") Reported-by: David Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: David Wysochanski <dwysocha@redhat.com> cc: Carlos Maiolino <cmaiolino@redhat.com>
2020-05-04io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()Xiaoguang Wang
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files() are mismatched. Currently I don't see any issues related this bug, just find it by learning codes. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-03Merge tag 'for-5.7-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: "A few more stability fixes, minor build warning fixes and git url fixup: - fix partial loss of prealloc extent past i_size after fsync - fix potential deadlock due to wrong transaction handle passing via journal_info - fix gcc 4.8 struct intialization warning - update git URL in MAINTAINERS entry" * tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: btrfs: fix git repo URL btrfs: fix gcc-4.8 build warning for struct initializer btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info btrfs: fix partial loss of prealloc extent past i_size after fsync
2020-05-02Merge tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fix from Darrick Wong: "Hoist the check for an unrepresentable FIBMAP return value into ioctl_fibmap. The internal kernel function can handle 64-bit values (and is needed to fix a regression on ext4 + jbd2). It is only the userspace ioctl that is so old that it cannot deal" * tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fibmap: Warn and return an error in case of block > INT_MAX
2020-05-02Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - fix handling of backchannel binding in BIND_CONN_TO_SESSION Bugfixes: - Fix a credential use-after-free issue in pnfs_roc() - Fix potential posix_acl refcnt leak in nfs3_set_acl - defer slow parts of rpc_free_client() to a workqueue - Fix an Oopsable race in __nfs_list_for_each_server() - Fix trace point use-after-free race - Regression: the RDMA client no longer responds to server disconnect requests - Fix return values of xdr_stream_encode_item_{present, absent} - _pnfs_return_layout() must always wait for layoutreturn completion Cleanups: - Remove unreachable error conditions" * tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a race in __nfs_list_for_each_server() NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION SUNRPC: defer slow parts of rpc_free_client() to a workqueue. NFSv4: Remove unreachable error condition due to rpc_run_task() SUNRPC: Remove unreachable error condition xprtrdma: Fix use of xdr_stream_encode_item_{present, absent} xprtrdma: Fix trace point use-after-free race xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler() nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc() NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
2020-05-01Merge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail - Cover a few cases where async poll can handle retry, eliminating the need for an async thread - fallback request busy/free fix (Bijan) - syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang) - Fix extra put of req for sync_file_range (Pavel) - Always punt splice async. We'll improve this for 5.8, but wanted to eliminate the inode mutex lock from the non-blocking path for 5.7 (Pavel) * tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block: io_uring: punt splice async because of inode mutex io_uring: check non-sync defer_list carefully io_uring: fix extra put in sync_file_range() io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() io_uring: use proper references for fallback_req locking io_uring: only force async punt if poll based retry can't handle it io_uring: enable poll retry for any file with ->read_iter / ->write_iter io_uring: statx must grab the file table for valid fd
2020-05-01io_uring: punt splice async because of inode mutexPavel Begunkov
Nonblocking do_splice() still may wait for some time on an inode mutex. Let's play safe and always punt it async. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01io_uring: check non-sync defer_list carefullyPavel Begunkov
io_req_defer() do double-checked locking. Use proper helpers for that, i.e. list_empty_careful(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01io_uring: fix extra put in sync_file_range()Pavel Begunkov
[ 40.179474] refcount_t: underflow; use-after-free. [ 40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0 ... [ 40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0 [ 40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6 [ 40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286 [ 40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000 [ 40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff [ 40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d [ 40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000 [ 40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8 [ 40.179634] FS: 0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000 [ 40.179636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0 [ 40.179641] Call Trace: [ 40.179652] io_put_req+0x36/0x40 [ 40.179657] io_free_work+0x15/0x20 [ 40.179661] io_worker_handle_work+0x2f5/0x480 [ 40.179667] io_wqe_worker+0x2a9/0x360 [ 40.179674] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 40.179681] kthread+0x12c/0x170 [ 40.179685] ? io_worker_handle_work+0x480/0x480 [ 40.179690] ? kthread_park+0x90/0x90 [ 40.179695] ret_from_fork+0x35/0x40 [ 40.179702] ---[ end trace 85027405f00110aa ]--- Opcode handler must never put submission ref, but that's what io_sync_file_range_finish() do. use io_steal_work() there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()Xiaoguang Wang
While working on to make io_uring sqpoll mode support syscalls that need struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(), while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cpu_relax(); above loop never has an chance to exit, it's because preempt isn't enabled in the kernel, and the context calling io_ring_ctx_wait_and_kill() and io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched() yield cpu and another context enters above loop, then io_sq_thread() will always in runqueue and never exit. Use cond_resched() can fix this issue. Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30io_uring: use proper references for fallback_req lockingBijan Mottahedeh
Use ctx->fallback_req address for test_and_set_bit_lock() and clear_bit_unlock(). Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>