summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-06-17fat: don't allow to mount if the FAT length == 0OGAWA Hirofumi
commit b1b65750b8db67834482f758fc385bfa7560d228 upstream. If FAT length == 0, the image doesn't have any data. And it can be the cause of overlapping the root dir and FAT entries. Also Windows treats it as invalid format. Reported-by: syzbot+6f1624f937d9d6911e2d@syzkaller.appspotmail.com Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/87r1wz8mrd.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17proc: Use new_inode not new_inode_pseudoEric W. Biederman
commit ef1548adada51a2f32ed7faef50aa465e1b4c5da upstream. Recently syzbot reported that unmounting proc when there is an ongoing inotify watch on the root directory of proc could result in a use after free when the watch is removed after the unmount of proc when the watcher exits. Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc") made it easier to unmount proc and allowed syzbot to see the problem, but looking at the code it has been around for a long time. Looking at the code the fsnotify watch should have been removed by fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode was allocated with new_inode_pseudo instead of new_inode so the inode was not on the sb->s_inodes list. Which prevented fsnotify_unmount_inodes from finding the inode and removing the watch as well as made it so the "VFS: Busy inodes after unmount" warning could not find the inodes to warn about them. Make all of the inodes in proc visible to generic_shutdown_super, and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo. The only functional difference is that new_inode places the inodes on the sb->s_inodes list. I wrote a small test program and I can verify that without changes it can trigger this issue, and by replacing new_inode_pseudo with new_inode the issues goes away. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread") Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry") Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17ovl: initialize error in ovl_copy_xattrYuxuan Shui
commit 520da69d265a91c6536c63851cbb8a53946974f0 upstream. In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private xattrs, the copy loop will terminate without assigning anything to the error variable, thus returning an uninitialized value. If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error value is put into a pointer by ERR_PTR(), causing potential invalid memory accesses down the line. This commit initialize error with 0. This is the correct value because when there's no xattr to copy, because all xattrs are private, ovl_copy_xattr should succeed. This bug is discovered with the help of INIT_STACK_ALL and clang. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405 Fixes: 0956254a2d5b ("ovl: don't copy up opaqueness") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()Ryusuke Konishi
commit 8301c719a2bd131436438e49130ee381d30933f5 upstream. After commit c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages"), the following null pointer dereference has been reported on nilfs2: BUG: kernel NULL pointer dereference, address: 00000000000000a8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... RIP: 0010:percpu_counter_add_batch+0xa/0x60 ... Call Trace: __test_set_page_writeback+0x2d3/0x330 nilfs_segctor_do_construct+0x10d3/0x2110 [nilfs2] nilfs_segctor_construct+0x168/0x260 [nilfs2] nilfs_segctor_thread+0x127/0x3b0 [nilfs2] kthread+0xf8/0x130 ... This crash turned out to be caused by set_page_writeback() call for segment summary buffers at nilfs_segctor_prepare_write(). set_page_writeback() can call inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK) where inode_to_wb(inode) is NULL if the inode of underlying block device does not have an associated wb. This fixes the issue by calling inode_attach_wb() in advance to ensure to associate the bdev inode with its wb. Fixes: c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages") Reported-by: Walton Hoops <me@waltonhoops.com> Reported-by: Tomas Hlavaty <tom@logand.com> Reported-by: ARAI Shun-ichi <hermes@ceres.dti.ne.jp> Reported-by: Hideki EIRAKU <hdk1983@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/20200608.011819.1399059588922299158.konishi.ryusuke@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17smb3: add indatalen that can be a non-zero value to calculation of credit ↵Namjae Jeon
charge in smb2 ioctl commit ebf57440ec59a36e1fc5fe91e31d66ae0d1662d0 upstream. Some of tests in xfstests failed with cifsd kernel server since commit e80ddeb2f70e. cifsd kernel server validates credit charge from client by calculating it base on max((InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse)) according to specification. MS-SMB2 specification describe credit charge calculation of smb2 ioctl : If Connection.SupportsMultiCredit is TRUE, the server MUST validate CreditCharge based on the maximum of (InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse), as specified in section 3.3.5.2.5. If the validation fails, it MUST fail the IOCTL request with STATUS_INVALID_PARAMETER. This patch add indatalen that can be a non-zero value to calculation of credit charge in SMB2_ioctl_init(). Fixes: e80ddeb2f70e ("smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64KSteve French
commit e80ddeb2f70ebd0786aa7cdba3e58bc931fa0bb5 upstream. We were not checking to see if ioctl requests asked for more than 64K (ie when CIFSMaxBufSize was > 64K) so when setting larger CIFSMaxBufSize then ioctls would fail with invalid parameter errors. When requests ask for more than 64K in MaxOutputResponse then we need to ask for more than 1 credit. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17io_uring: use kvfree() in io_sqe_buffer_register()Denis Efremov
commit a8c73c1a614f6da6c0b04c393f87447e28cb6de4 upstream. Use kvfree() to free the pages and vmas, since they are allocated by kvmalloc_array() in a loop. Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings") Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17aio: fix async fsync credsMiklos Szeredi
commit 530f32fc370fd1431ea9802dbc53ab5601dfccdb upstream. Avi Kivity reports that on fuse filesystems running in a user namespace asyncronous fsync fails with EOVERFLOW. The reason is that f_ops->fsync() is called with the creds of the kthread performing aio work instead of the creds of the process originally submitting IOCB_CMD_FSYNC. Fuse sends the creds of the caller in the request header and it needs to translate the uid and gid into the server's user namespace. Since the kthread is running in init_user_ns, the translation will fail and the operation returns an error. It can be argued that fsync doesn't actually need any creds, but just zeroing out those fields in the header (as with requests that currently don't take creds) is a backward compatibility risk. Instead of working around this issue in fuse, solve the core of the problem by calling the filesystem with the proper creds. Reported-by: Avi Kivity <avi@scylladb.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Fixes: c9582eb0ff7d ("fuse: Fail all requests with invalid uids or gids") Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17fanotify: fix ignore mask logic for events on child and on dirAmir Goldstein
[ Upstream commit 2f02fd3fa13e51713b630164f8a8e5b42de8283b ] The comments in fanotify_group_event_mask() say: "If the event is on dir/child and this mark doesn't care about events on dir/child, don't send it!" Specifically, mount and filesystem marks do not care about events on child, but they can still specify an ignore mask for those events. For example, a group that has: - A mount mark with mask 0 and ignore_mask FAN_OPEN - An inode mark on a directory with mask FAN_OPEN | FAN_OPEN_EXEC with flag FAN_EVENT_ON_CHILD A child file open for exec would be reported to group with the FAN_OPEN event despite the fact that FAN_OPEN is in ignore mask of mount mark, because the mark iteration loop skips over non-inode marks for events on child when calculating the ignore mask. Move ignore mask calculation to the top of the iteration loop block before excluding marks for events on dir/child. Link: https://lore.kernel.org/r/20200524072441.18258-1-amir73il@gmail.com Reported-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/linux-fsdevel/20200521162443.GA26052@quack2.suse.cz/ Fixes: 55bf882c7f13 "fanotify: fix merging marks masks with FAN_ONDIR" Fixes: b469e7e47c8a "fanotify: fix handling of events on child..." Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-17gfs2: Even more gfs2_find_jhead fixesAndreas Gruenbacher
[ Upstream commit 20be493b787cd581c9fffad7fcd6bfbe6af1050c ] Fix several issues in the previous gfs2_find_jhead fix: * When updating @blocks_submitted, @block refers to the first block block not submitted yet, not the last block submitted, so fix an off-by-one error. * We want to ensure that @blocks_submitted is far enough ahead of @blocks_read to guarantee that there is in-flight I/O. Otherwise, we'll eventually end up waiting for pages that haven't been submitted, yet. * It's much easier to compare the number of blocks added with the number of blocks submitted to limit the maximum bio size. * Even with bio chaining, we can keep adding blocks until we reach the maximum bio size, as long as we stop at a page boundary. This simplifies the logic. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-07io_uring: initialize ctx->sqo_wait earlierJens Axboe
[ Upstream commit 583863ed918136412ddf14de2e12534f17cfdc6f ] Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated, instead of deferring it to the offload setup. This fixes a syzbot reported lockdep complaint, which is really due to trying to wake_up on an uninitialized wait queue: RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319 RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260 R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160 io_poll_remove_all fs/io_uring.c:4357 [inline] io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305 io_uring_create fs/io_uring.c:7843 [inline] io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x441319 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()Alexander Potapenko
[ Upstream commit 1d605416fb7175e1adf094251466caa52093b413 ] KMSAN reported uninitialized data being written to disk when dumping core. As a result, several kilobytes of kmalloc memory may be written to the core file and then read by a non-privileged user. Reported-by: sam <sunhaoyl@outlook.com> Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com Link: https://github.com/google/kmsan/issues/76 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03ceph: flush release queue when handling caps for unknown inodeJeff Layton
[ Upstream commit fb33c114d3ed5bdac230716f5b0a93b56b92a90d ] It's possible for the VFS to completely forget about an inode, but for it to still be sitting on the cap release queue. If the MDS sends the client a cap message for such an inode, it just ignores it today, which can lead to a stall of up to 5s until the cap release queue is flushed. If we get a cap message for an inode that can't be located, then go ahead and flush the cap release queue. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/45532 Fixes: 1e9c2eb6811e ("ceph: delete stale dentry when last reference is dropped") Reported-and-Tested-by: Andrej Filipčič <andrej.filipcic@ijs.si> Suggested-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03cifs: Fix null pointer check in cifs_readSteve French
[ Upstream commit 9bd21d4b1a767c3abebec203342f3820dcb84662 ] Coverity scan noted a redundant null check Coverity-id: 728517 Reported-by: Coverity <scan-admin@coverity.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03cachefiles: Fix race between read_waiter and read_copier involving op->to_doLei Xue
[ Upstream commit 7bb0c5338436dae953622470d52689265867f032 ] There is a potential race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs. The problem can be seen easily on a heavy loaded system (for example many processes reading files continually on an NFS share covered by fscache triggered this problem within a few minutes). The race is due to cachefiles_read_waiter() adding the op to the monitor to_do list and then then drop the object->work_lock spinlock before completing fscache_enqueue_operation(). Once the lock is dropped, cachefiles_read_copier() grabs the op, completes processing it, and makes it through fscache_retrieval_complete() which sets the op->state to the final state of FSCACHE_OP_ST_COMPLETE(4). When cachefiles_read_waiter() finally gets through the remainder of fscache_enqueue_operation() it sees the invalid state, and hits the ASSERTCMP and the following oops is seen: [ 2259.612361] FS-Cache: [ 2259.614785] FS-Cache: Assertion failed [ 2259.618639] FS-Cache: 4 == 5 is false [ 2259.622456] ------------[ cut here ]------------ [ 2259.627190] kernel BUG at fs/fscache/operation.c:70! ... [ 2259.791675] RIP: 0010:[<ffffffffc061b4cf>] [<ffffffffc061b4cf>] fscache_enqueue_operation+0xff/0x170 [fscache] [ 2259.802059] RSP: 0000:ffffa0263d543be0 EFLAGS: 00010046 [ 2259.807521] RAX: 0000000000000019 RBX: ffffa01a4d390480 RCX: 0000000000000006 [ 2259.814847] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffa0263d553890 [ 2259.822176] RBP: ffffa0263d543be8 R08: 0000000000000000 R09: ffffa0263c2d8708 [ 2259.829502] R10: 0000000000001e7f R11: 0000000000000000 R12: ffffa01a4d390480 [ 2259.844483] R13: ffff9fa9546c5920 R14: ffffa0263d543c80 R15: ffffa0293ff9bf10 [ 2259.859554] FS: 00007f4b6efbd700(0000) GS:ffffa0263d540000(0000) knlGS:0000000000000000 [ 2259.875571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2259.889117] CR2: 00007f49e1624ff0 CR3: 0000012b38b38000 CR4: 00000000007607e0 [ 2259.904015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2259.918764] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2259.933449] PKRU: 55555554 [ 2259.943654] Call Trace: [ 2259.953592] <IRQ> [ 2259.955577] [<ffffffffc03a7c12>] cachefiles_read_waiter+0x92/0xf0 [cachefiles] [ 2259.978039] [<ffffffffa34d3942>] __wake_up_common+0x82/0x120 [ 2259.991392] [<ffffffffa34d3a63>] __wake_up_common_lock+0x83/0xc0 [ 2260.004930] [<ffffffffa34d3510>] ? task_rq_unlock+0x20/0x20 [ 2260.017863] [<ffffffffa34d3ab3>] __wake_up+0x13/0x20 [ 2260.030230] [<ffffffffa34c72a0>] __wake_up_bit+0x50/0x70 [ 2260.042535] [<ffffffffa35bdcdb>] unlock_page+0x2b/0x30 [ 2260.054495] [<ffffffffa35bdd09>] page_endio+0x29/0x90 [ 2260.066184] [<ffffffffa368fc81>] mpage_end_io+0x51/0x80 CPU1 cachefiles_read_waiter() 20 static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode, 21 int sync, void *_key) 22 { ... 61 spin_lock(&object->work_lock); 62 list_add_tail(&monitor->op_link, &op->to_do); 63 spin_unlock(&object->work_lock); <begin race window> 64 65 fscache_enqueue_retrieval(op); 182 static inline void fscache_enqueue_retrieval(struct fscache_retrieval *op) 183 { 184 fscache_enqueue_operation(&op->op); 185 } 58 void fscache_enqueue_operation(struct fscache_operation *op) 59 { 60 struct fscache_cookie *cookie = op->object->cookie; 61 62 _enter("{OBJ%x OP%x,%u}", 63 op->object->debug_id, op->debug_id, atomic_read(&op->usage)); 64 65 ASSERT(list_empty(&op->pend_link)); 66 ASSERT(op->processor != NULL); 67 ASSERT(fscache_object_is_available(op->object)); 68 ASSERTCMP(atomic_read(&op->usage), >, 0); <end race window> CPU2 cachefiles_read_copier() 168 while (!list_empty(&op->to_do)) { ... 202 fscache_end_io(op, monitor->netfs_page, error); 203 put_page(monitor->netfs_page); 204 fscache_retrieval_complete(op, 1); CPU1 58 void fscache_enqueue_operation(struct fscache_operation *op) 59 { ... 69 ASSERTIFCMP(op->state != FSCACHE_OP_ST_IN_PROGRESS, 70 op->state, ==, FSCACHE_OP_ST_CANCELLED); Signed-off-by: Lei Xue <carmark.dlut@gmail.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03gfs2: Grab glock reference sooner in gfs2_add_revokeAndreas Gruenbacher
[ Upstream commit f4e2f5e1a527ce58fc9f85145b03704779a3123e ] This patch rearranges gfs2_add_revoke so that the extra glock reference is added earlier on in the function to avoid races in which the glock is freed before the new reference is taken. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-03gfs2: move privileged user check to gfs2_quota_lock_checkBob Peterson
[ Upstream commit 4ed0c30811cb4d30ef89850b787a53a84d5d2bcb ] Before this patch, function gfs2_quota_lock checked if it was called from a privileged user, and if so, it bypassed the quota check: superuser can operate outside the quotas. That's the wrong place for the check because the lock/unlock functions are separate from the lock_check function, and you can do lock and unlock without actually checking the quotas. This patch moves the check to gfs2_quota_lock_check. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27rxrpc: Fix the excessive initial retransmission timeoutDavid Howells
commit c410bf01933e5e09d142c66c3df9ad470a7eec13 upstream. rxrpc currently uses a fixed 4s retransmission timeout until the RTT is sufficiently sampled. This can cause problems with some fileservers with calls to the cache manager in the afs filesystem being dropped from the fileserver because a packet goes missing and the retransmission timeout is greater than the call expiry timeout. Fix this by: (1) Copying the RTT/RTO calculation code from Linux's TCP implementation and altering it to fit rxrpc. (2) Altering the various users of the RTT to make use of the new SRTT value. (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO value instead (which is needed in jiffies), along with a backoff. Notes: (1) rxrpc provides RTT samples by matching the serial numbers on outgoing DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets against the reference serial number in incoming REQUESTED ACK and PING-RESPONSE ACK packets. (2) Each packet that is transmitted on an rxrpc connection gets a new per-connection serial number, even for retransmissions, so an ACK can be cross-referenced to a specific trigger packet. This allows RTT information to be drawn from retransmitted DATA packets also. (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than on an rxrpc_call because many RPC calls won't live long enough to generate more than one sample. (4) The calculated SRTT value is in units of 8ths of a microsecond rather than nanoseconds. The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers. Fixes: 17926a79320a ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27Revert "gfs2: Don't demote a glock until its revokes are written"Bob Peterson
[ Upstream commit b14c94908b1b884276a6608dea3d0b1b510338b7 ] This reverts commit df5db5f9ee112e76b5202fbc331f990a0fc316d6. This patch fixes a regression: patch df5db5f9ee112 allowed function run_queue() to bypass its call to do_xmote() if revokes were queued for the glock. That's wrong because its call to do_xmote() is what is responsible for calling the go_sync() glops functions to sync both the ail list and any revokes queued for it. By bypassing the call, gfs2 could get into a stand-off where the glock could not be demoted until its revokes are written back, but the revokes would not be written back because do_xmote() was never called. It "sort of" works, however, because there are other mechanisms like the log flush daemon (logd) that can sync the ail items and revokes, if it deems it necessary. The problem is: without file system pressure, it might never deem it necessary. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27ceph: fix double unlock in handle_cap_export()Wu Bo
[ Upstream commit 4d8e28ff3106b093d98bfd2eceb9b430c70a8758 ] If the ceph_mdsc_open_export_target_session() return fails, it will do a "goto retry", but the session mutex has already been unlocked. Re-lock the mutex in that case to ensure that we don't unlock it twice. Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27configfs: fix config_item refcnt leak in configfs_rmdir()Xiyu Yang
[ Upstream commit 8aebfffacfa379ba400da573a5bf9e49634e38cb ] configfs_rmdir() invokes configfs_get_config_item(), which returns a reference of the specified config_item object to "parent_item" with increased refcnt. When configfs_rmdir() returns, local variable "parent_item" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of configfs_rmdir(). When down_write_killable() fails, the function forgets to decrease the refcnt increased by configfs_get_config_item(), causing a refcnt leak. Fix this issue by calling config_item_put() when down_write_killable() fails. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27afs: Don't unlock fetched data pages until the op completes successfullyDavid Howells
[ Upstream commit 9d1be4f4dc5ff1c66c86acfd2c35765d9e3776b3 ] Don't call req->page_done() on each page as we finish filling it with the data coming from the network. Whilst this might speed up the application a bit, it's a problem if there's a network failure and the operation has to be reissued. If this happens, an oops occurs because afs_readpages_page_done() clears the pointer to each page it unlocks and when a retry happens, the pointers to the pages it wants to fill are now NULL (and the pages have been unlocked anyway). Instead, wait till the operation completes successfully and only then release all the pages after clearing any terminal gap (the server can give us less data than we requested as we're allowed to ask for more than is available). KASAN produces a bug like the following, and even without KASAN, it can oops and panic. BUG: KASAN: wild-memory-access in _copy_to_iter+0x323/0x5f4 Write of size 1404 at addr 0005088000000000 by task md5sum/5235 CPU: 0 PID: 5235 Comm: md5sum Not tainted 5.7.0-rc3-fscache+ #250 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: memcpy+0x39/0x58 _copy_to_iter+0x323/0x5f4 __skb_datagram_iter+0x89/0x2a6 skb_copy_datagram_iter+0x129/0x135 rxrpc_recvmsg_data.isra.0+0x615/0xd42 rxrpc_kernel_recv_data+0x1e9/0x3ae afs_extract_data+0x139/0x33a yfs_deliver_fs_fetch_data64+0x47a/0x91b afs_deliver_to_call+0x304/0x709 afs_wait_for_call_to_complete+0x1cc/0x4ad yfs_fs_fetch_data+0x279/0x288 afs_fetch_data+0x1e1/0x38d afs_readpages+0x593/0x72e read_pages+0xf5/0x21e __do_page_cache_readahead+0x128/0x23f ondemand_readahead+0x36e/0x37f generic_file_buffered_read+0x234/0x680 new_sync_read+0x109/0x17e vfs_read+0xe6/0x138 ksys_read+0xd8/0x14d do_syscall_64+0x6e/0x8a entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fixes: 196ee9cd2d04 ("afs: Make afs_fs_fetch_data() take a list of pages") Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27ubifs: remove broken lazytime supportChristoph Hellwig
[ Upstream commit ecf84096a526f2632ee85c32a3d05de3fa60ce80 ] When "ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs" introduced atime support to ubifs, it also added lazytime support. As far as I can tell the lazytime support is terminally broken, as it causes mark_inode_dirty_sync to be called from __writeback_single_inode, which will then trigger the locking assert in ubifs_dirty_inode. Just remove the broken lazytime support for now, it can be added back later, especially as some infrastructure changes should make that easier soon. Fixes: 8c1c5f263833 ("ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27fix multiplication overflow in copy_fdtable()Al Viro
[ Upstream commit 4e89b7210403fa4a8acafe7c602b6212b7af6c3b ] cpy and set really should be size_t; we won't get an overflow on that, since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *), so nr that would've managed to overflow size_t on that multiplication won't get anywhere near copy_fdtable() - we'll fail with EMFILE before that. Cc: stable@kernel.org # v2.6.25+ Fixes: 9cfe015aa424 (get rid of NR_OPEN and introduce a sysctl_nr_open) Reported-by: Thiago Macieira <thiago.macieira@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27ubifs: fix wrong use of crypto_shash_descsize()Eric Biggers
[ Upstream commit 3c3c32f85b6cc05e5db78693457deff03ac0f434 ] crypto_shash_descsize() returns the size of the shash_desc context needed to compute the hash, not the size of the hash itself. crypto_shash_digestsize() would be correct, or alternatively using c->hash_len and c->hmac_desc_len which already store the correct values. But actually it's simpler to just use stack arrays, so do that instead. Fixes: 49525e5eecca ("ubifs: Add helper functions for authentication support") Fixes: da8ef65f9573 ("ubifs: Authenticate replayed journal") Cc: <stable@vger.kernel.org> # v4.20+ Cc: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20fanotify: fix merging marks masks with FAN_ONDIRAmir Goldstein
commit 55bf882c7f13dda8bbe624040c6d5b4fbb812d16 upstream. Change the logic of FAN_ONDIR in two ways that are similar to the logic of FAN_EVENT_ON_CHILD, that was fixed in commit 54a307ba8d3c ("fanotify: fix logic of events on child"): 1. The flag is meaningless in ignore mask 2. The flag refers only to events in the mask of the mark where it is set This is what the fanotify_mark.2 man page says about FAN_ONDIR: "Without this flag, only events for files are created." It doesn't say anything about setting this flag in ignore mask to stop getting events on directories nor can I think of any setup where this capability would be useful. Currently, when marks masks are merged, the FAN_ONDIR flag set in one mark affects the events that are set in another mark's mask and this behavior causes unexpected results. For example, a user adds a mark on a directory with mask FAN_ATTRIB | FAN_ONDIR and a mount mark with mask FAN_OPEN (without FAN_ONDIR). An opendir() of that directory (which is inside that mount) generates a FAN_OPEN event even though neither of the marks requested to get open events on directories. Link: https://lore.kernel.org/r/20200319151022.31456-10-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Rachel Sibley <rasibley@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20exec: Move would_dump into flush_old_execEric W. Biederman
commit f87d1c9559164294040e58f5e3b74a162bf7c6e8 upstream. I goofed when I added mm->user_ns support to would_dump. I missed the fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and binfmt_script bprm->file is reassigned. Which made the move of would_dump from setup_new_exec to __do_execve_file before exec_binprm incorrect as it can result in would_dump running on the script instead of the interpreter of the script. The net result is that the code stopped making unreadable interpreters undumpable. Which allows them to be ptraced and written to disk without special permissions. Oops. The move was necessary because the call in set_new_exec was after bprm->mm was no longer valid. To correct this mistake move the misplaced would_dump from __do_execve_file into flos_old_exec, before exec_mmap is called. I tested and confirmed that without this fix I can attach with gdb to a script with an unreadable interpreter, and with this fix I can not. Cc: stable@vger.kernel.org Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20cifs: fix leaked reference on requeued writeAdam McCoy
commit a48137996063d22ffba77e077425f49873856ca5 upstream. Failed async writes that are requeued may not clean up a refcount on the file, which can result in a leaked open. This scenario arises very reliably when using persistent handles and a reconnect occurs while writing. cifs_writev_requeue only releases the reference if the write fails (rc != 0). The server->ops->async_writev operation will take its own reference, so the initial reference can always be released. Signed-off-by: Adam McCoy <adam@forsedomani.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20NFSv3: fix rpc receive buffer size for MOUNT callOlga Kornievskaia
[ Upstream commit 8eed292bc8cbf737e46fb1c119d4c8f6dcb00650 ] Prior to commit e3d3ab64dd66 ("SUNRPC: Use au_rslack when computing reply buffer size"), there was enough slack in the reply buffer to commodate filehandles of size 60bytes. However, the real problem was that the reply buffer size for the MOUNT operation was not correctly calculated. Received buffer size used the filehandle size for NFSv2 (32bytes) which is much smaller than the allowed filehandle size for the v3 mounts. Fix the reply buffer size (decode arguments size) for the MNT command. Fixes: 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20nfs: fix NULL deference in nfs4_get_valid_delegationJ. Bruce Fields
[ Upstream commit 29fe839976266bc7c55b927360a1daae57477723 ] We add the new state to the nfsi->open_states list, making it potentially visible to other threads, before we've finished initializing it. That wasn't a problem when all the readers were also taking the i_lock (as we do here), but since we switched to RCU, there's now a possibility that a reader could see the partially initialized state. Symptoms observed were a crash when another thread called nfs4_get_valid_delegation() on a NULL inode, resulting in an oops like: BUG: unable to handle page fault for address: ffffffffffffffb0 ... RIP: 0010:nfs4_get_valid_delegation+0x6/0x30 [nfsv4] ... Call Trace: nfs4_open_prepare+0x80/0x1c0 [nfsv4] __rpc_execute+0x75/0x390 [sunrpc] ? finish_task_switch+0x75/0x260 rpc_async_schedule+0x29/0x40 [sunrpc] process_one_work+0x1ad/0x370 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x10c/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x22/0x30 Fixes: 9ae075fdd190 "NFSv4: Convert open state lookup to use RCU" Reviewed-by: Seiichi Ikarashi <s.ikarashi@fujitsu.com> Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20NFSv4: Fix fscache cookie aux_data to ensure change_attr is includedDave Wysochanski
[ Upstream commit 50eaa652b54df1e2b48dc398d9e6114c9ed080eb ] Commit 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") added the aux_data and aux_data_len to parameters to fscache_acquire_cookie(), and updated the callers in the NFS client. In the process it modified the aux_data to include the change_attr, but missed adding change_attr to a couple places where aux_data was used. Specifically, when opening a file and the change_attr is not added, the following attempt to lookup an object will fail inside cachefiles_check_object_xattr() = -116 due to nfs_fscache_inode_check_aux() failing memcmp on auxdata and returning FSCACHE_CHECKAUX_OBSOLETE. Fix this by adding nfs_fscache_update_auxdata() to set the auxdata from all relevant fields in the inode, including the change_attr. Fixes: 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20nfs: fscache: use timespec64 in inode auxdataArnd Bergmann
[ Upstream commit 6e31ded6895adfca97211118cc9b72236e8f6d53 ] nfs currently behaves differently on 32-bit and 64-bit kernels regarding the on-disk format of nfs_fscache_inode_auxdata. That format should really be the same on any kernel, and we should avoid the 'timespec' type in order to remove that from the kernel later on. Using plain 'timespec64' would not be good here, since that includes implied padding and would possibly leak kernel stack data to the on-disk format on 32-bit architectures. struct __kernel_timespec would work as a replacement, but open-coding the two struct members in nfs_fscache_inode_auxdata makes it more obvious what's going on here, and keeps the current format for 64-bit architectures. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20NFS: Fix fscache super_cookie index_key from changing after umountDave Wysochanski
[ Upstream commit d9bfced1fbcb35b28d8fbed4e785d2807055ed2b ] Commit 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") added the index_key and index_key_len parameters to fscache_acquire_cookie(), and updated the callers in the NFS client. One of the callers was inside nfs_fscache_get_super_cookie() and was changed to use the full struct nfs_fscache_key as the index_key. However, a couple members of this structure contain pointers and thus will change each time the same NFS share is remounted. Since index_key is used for fscache_cookie->key_hash and this subsequently is used to compare cookies, the effectiveness of fscache with NFS is reduced to the point at which a umount occurs. Any subsequent remount of the same share will cause a unique NFS super_block index_key and key_hash to be generated for the same data, rendering any prior fscache data unable to be found. A simple reproducer demonstrates the problem. 1. Mount share with 'fsc', create a file, drop page cache systemctl start cachefilesd mount -o vers=3,fsc 127.0.0.1:/export /mnt dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1 echo 3 > /proc/sys/vm/drop_caches 2. Read file into page cache and fscache, then unmount dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1 umount /mnt 3. Remount and re-read which should come from fscache mount -o vers=3,fsc 127.0.0.1:/export /mnt echo 3 > /proc/sys/vm/drop_caches dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1 4. Check for READ ops in mountstats - there should be none grep READ: /proc/self/mountstats Looking at the history and the removed function, nfs_super_get_key(), we should only use nfs_fscache_key.key plus any uniquifier, for the fscache index_key. Fixes: 402cb8dda949 ("fscache: Attach the index key and aux data to the cookie") Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20gfs2: More gfs2_find_jhead fixesAndreas Gruenbacher
[ Upstream commit aa83da7f47b26c9587bade6c4bc4736ffa308f0a ] It turns out that when extending an existing bio, gfs2_find_jhead fails to check if the block number is consecutive, which leads to incorrect reads for fragmented journals. In addition, limit the maximum bio size to an arbitrary value of 2 megabytes: since commit 07173c3ec276 ("block: enable multipage bvecs"), if we just keep adding pages until bio_add_page fails, bios will grow much larger than useful, which pins more memory than necessary with barely any additional performance gains. Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20gfs2: Another gfs2_walk_metadata fixAndreas Gruenbacher
[ Upstream commit 566a2ab3c9005f62e784bd39022d58d34ef4365c ] Make sure we don't walk past the end of the metadata in gfs2_walk_metadata: the inode holds fewer pointers than indirect blocks. Slightly clean up gfs2_iomap_get. Fixes: a27a0c9b6a20 ("gfs2: gfs2_walk_metadata fix") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-14fanotify: merge duplicate events on parent and childAmir Goldstein
[ Upstream commit f367a62a7cad2447d835a9f14fc63997a9137246 ] With inotify, when a watch is set on a directory and on its child, an event on the child is reported twice, once with wd of the parent watch and once with wd of the child watch without the filename. With fanotify, when a watch is set on a directory and on its child, an event on the child is reported twice, but it has the exact same information - either an open file descriptor of the child or an encoded fid of the child. The reason that the two identical events are not merged is because the object id used for merging events in the queue is the child inode in one event and parent inode in the other. For events with path or dentry data, use the victim inode instead of the watched inode as the object id for event merging, so that the event reported on parent will be merged with the event reported on the child. Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-14fsnotify: replace inode pointer with an object idAmir Goldstein
[ Upstream commit dfc2d2594e4a79204a3967585245f00644b8f838 ] The event inode field is used only for comparison in queue merges and cannot be dereferenced after handle_event(), because it does not hold a refcount on the inode. Replace it with an abstract id to do the same thing. Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-14coredump: fix crash when umh is disabledLuis Chamberlain
commit 3740d93e37902b31159a82da2d5c8812ed825404 upstream. Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did. This has been reported by Sergey when one triggers a a coredump with the following configuration: ``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ``` The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file. [ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ``` Cc: <stable@vger.kernel.org> # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon <chainsaw@gentoo.org> Reported-by: Sergey Kvachonok <ravenexp@gmail.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14ceph: demote quotarealm lookup warning to a debug messageLuis Henriques
commit 12ae44a40a1be891bdc6463f8c7072b4ede746ef upstream. A misconfigured cephx can easily result in having the kernel client flooding the logs with: ceph: Can't lookup inode 1 (err: -13) Change this message to debug level. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14ceph: fix endianness bug when handling MDS session feature bitsJeff Layton
commit 0fa8263367db9287aa0632f96c1a5f93cc478150 upstream. Eduard reported a problem mounting cephfs on s390 arch. The feature mask sent by the MDS is little-endian, so we need to convert it before storing and testing against it. Cc: stable@vger.kernel.org Reported-and-Tested-by: Eduard Shishkin <edward6@linux.ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14eventpoll: fix missing wakeup for ovflist in ep_poll_callbackKhazhismel Kumykov
commit 0c54a6a44bf3d41e76ce3f583a6ece267618df2e upstream. In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch. [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Heiher <r@hev.cc> Cc: Jason Baron <jbaron@akamai.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14epoll: atomically remove wait entry on wake upRoman Penyaev
commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream. This patch does two things: - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") - improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Heiher <r@hev.cc> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-10cifs: do not share tcons with DFSPaulo Alcantara
[ Upstream commit 65303de829dd6d291a4947c1a31de31896f8a060 ] This disables tcon re-use for DFS shares. tcon->dfs_path stores the path that the tcon should connect to when doing failing over. If that tcon is used multiple times e.g. 2 mounts using it with different prefixpath, each will need a different dfs_path but there is only one tcon. The other solution would be to split the tcon in 2 tcons during failover but that is much harder. tcons could not be shared with DFS in cifs.ko because in a DFS namespace like: //domain/dfsroot -> /serverA/dfsroot, /serverB/dfsroot //serverA/dfsroot/link -> /serverA/target1/aa/bb //serverA/dfsroot/link2 -> /serverA/target1/cc/dd you can see that link and link2 are two DFS links that both resolve to the same target share (/serverA/target1), so cifs.ko will only contain a single tcon for both link and link2. The problem with that is, if we (auto)mount "link" and "link2", cifs.ko will only contain a single tcon for both DFS links so we couldn't perform failover or refresh the DFS cache for both links because tcon->dfs_path was set to either "link" or "link2", but not both -- which is wrong. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-10cifs: protect updating server->dstaddr with a spinlockRonnie Sahlberg
[ Upstream commit fada37f6f62995cc449b36ebba1220594bfe55fe ] We use a spinlock while we are reading and accessing the destination address for a server. We need to also use this spinlock to protect when we are modifying this address from reconn_set_ipaddr(). Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-06nfs: Fix potential posix_acl refcnt leak in nfs3_set_aclAndreas Gruenbacher
commit 7648f939cb919b9d15c21fff8cd9eba908d595dc upstream. nfs3_set_acl keeps track of the acl it allocated locally to determine if an acl needs to be released at the end. This results in a memory leak when the function allocates an acl as well as a default acl. Fix by releasing acls that differ from the acl originally passed into nfs3_set_acl. Fixes: b7fa0554cf1b ("[PATCH] NFS: Add support for NFSv3 ACLs") Reported-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-06Fix use after free in get_tree_bdev()David Howells
commit dd7bc8158b413e0b580c491e8bd18cb91057c7c2 upstream. Commit 6fcf0c72e4b9, a fix to get_tree_bdev() put a missing blkdev_put() in the wrong place, before a warnf() that displays the bdev under consideration rather after it. This results in a silent lockup in printk("%pg") called via warnf() from get_tree_bdev() under some circumstances when there's a race with the blockdev being frozen. This can be caused by xfstests/tests/generic/085 in combination with Lukas Czerner's ext4 mount API conversion patchset. It looks like it ought to occur with other users of get_tree_bdev() such as XFS, but apparently doesn't. Fix this by switching the order of the lines. Fixes: 6fcf0c72e4b9 ("vfs: add missing blkdev_put() in get_tree_bdev()") Reported-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ian Kent <raven@themaw.net> cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-06dlmfs_file_write(): fix the bogosity in handling non-zero *pposAl Viro
commit 3815f1be546e752327b5868af103ccdddcc4db77 upstream. 'count' is how much you want written, not the final position. Moreover, it can legitimately be less than the current position... Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-06btrfs: transaction: Avoid deadlock due to bad initialization timing of ↵Qu Wenruo
fs_info::journal_info commit fcc99734d1d4ced30167eb02e17f656735cb9928 upstream. [BUG] One run of btrfs/063 triggered the following lockdep warning: ============================================ WARNING: possible recursive locking detected 5.6.0-rc7-custom+ #48 Not tainted -------------------------------------------- kworker/u24:0/7 is trying to acquire lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] but task is already holding lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sb_internal#2); lock(sb_internal#2); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by kworker/u24:0/7: #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80 #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80 #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs] stack backtrace: CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] Call Trace: dump_stack+0xc2/0x11a __lock_acquire.cold+0xce/0x214 lock_acquire+0xe6/0x210 __sb_start_write+0x14e/0x290 start_transaction+0x66c/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] find_free_extent+0x1504/0x1a50 [btrfs] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] btrfs_alloc_tree_block+0x1ac/0x570 [btrfs] btrfs_copy_root+0x213/0x580 [btrfs] create_reloc_root+0x3bd/0x470 [btrfs] btrfs_init_reloc_root+0x2d2/0x310 [btrfs] record_root_in_trans+0x191/0x1d0 [btrfs] btrfs_record_root_in_trans+0x90/0xd0 [btrfs] start_transaction+0x16e/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] btrfs_work_helper+0x116/0x9a0 [btrfs] process_one_work+0x632/0xb80 worker_thread+0x80/0x690 kthread+0x1a3/0x1f0 ret_from_fork+0x27/0x50 It's pretty hard to reproduce, only one hit so far. [CAUSE] This is because we're calling btrfs_join_transaction() without re-using the current running one: btrfs_finish_ordered_io() |- btrfs_join_transaction() <<< Call #1 |- btrfs_record_root_in_trans() |- btrfs_reserve_extent() |- btrfs_join_transaction() <<< Call #2 Normally such btrfs_join_transaction() call should re-use the existing one, without trying to re-start a transaction. But the problem is, in btrfs_join_transaction() call #1, we call btrfs_record_root_in_trans() before initializing current::journal_info. And in btrfs_join_transaction() call #2, we're relying on current::journal_info to avoid such deadlock. [FIX] Call btrfs_record_root_in_trans() after we have initialized current::journal_info. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-06btrfs: fix partial loss of prealloc extent past i_size after fsyncFilipe Manana
commit f135cea30de5f74d5bfb5116682073841fb4af8f upstream. When we have an inode with a prealloc extent that starts at an offset lower than the i_size and there is another prealloc extent that starts at an offset beyond i_size, we can end up losing part of the first prealloc extent (the part that starts at i_size) and have an implicit hole if we fsync the file and then have a power failure. Consider the following example with comments explaining how and why it happens. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our test file with 2 consecutive prealloc extents, each with a # size of 128Kb, and covering the range from 0 to 256Kb, with a file # size of 0. $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo $ xfs_io -c "falloc -k 128K 128K" /mnt/foo # Fsync the file to record both extents in the log tree. $ xfs_io -c "fsync" /mnt/foo # Now do a redudant extent allocation for the range from 0 to 64Kb. # This will merely increase the file size from 0 to 64Kb. Instead we # could also do a truncate to set the file size to 64Kb. $ xfs_io -c "falloc 0 64K" /mnt/foo # Fsync the file, so we update the inode item in the log tree with the # new file size (64Kb). This also ends up setting the number of bytes # for the first prealloc extent to 64Kb. This is done by the truncation # at btrfs_log_prealloc_extents(). # This means that if a power failure happens after this, a write into # the file range 64Kb to 128Kb will not use the prealloc extent and # will result in allocation of a new extent. $ xfs_io -c "fsync" /mnt/foo # Now set the file size to 256K with a truncate and then fsync the file. # Since no changes happened to the extents, the fsync only updates the # i_size in the inode item at the log tree. This results in an implicit # hole for the file range from 64Kb to 128Kb, something which fsck will # complain when not using the NO_HOLES feature if we replay the log # after a power failure. $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo So instead of always truncating the log to the inode's current i_size at btrfs_log_prealloc_extents(), check first if there's a prealloc extent that starts at an offset lower than the i_size and with a length that crosses the i_size - if there is one, just make sure we truncate to a size that corresponds to the end offset of that prealloc extent, so that we don't lose the part of that extent that starts at i_size if a power failure happens. A test case for fstests follows soon. Fixes: 31d11b83b96f ("Btrfs: fix duplicate extents after fsync of file with prealloc extents") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-06btrfs: fix block group leak when removing failsXiyu Yang
commit f6033c5e333238f299c3ae03fac8cc1365b23b77 upstream. btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which returns a local reference of the block group that contains the given bytenr to "block_group" with increased refcount. When btrfs_remove_block_group() returns, "block_group" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in several exception handling paths of btrfs_remove_block_group(). When those error scenarios occur such as btrfs_alloc_path() returns NULL, the function forgets to decrease its refcnt increased by btrfs_lookup_block_group() and will cause a refcnt leak. Fix this issue by jumping to "out_put_group" label and calling btrfs_put_block_group() when those error scenarios occur. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>