aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-11-19ext4: fix -Wstringop-truncation warningsWenlin Kang
The strncpy() function may create a unterminated string, use strscpy_pad() instead. This fixes the following warning: fs/ext4/super.c: In function '__save_error_info': fs/ext4/super.c:368:2: warning: 'strncpy' specified bound 32 equals destination size [-Wstringop-truncation] strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/ext4/super.c:373:3: warning: 'strncpy' specified bound 32 equals destination size [-Wstringop-truncation] strncpy(es->s_first_error_func, func, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ sizeof(es->s_first_error_func)); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Wenlin Kang <wenlin.kang@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-11-10Merge tag 'v5.8.18' into v5.8/standard/baseBruce Ashfield
This is the 5.8.18 stable release # gpg: Signature made Sun 01 Nov 2020 06:46:02 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-11-02Merge tag 'v5.8.17' into v5.8/standard/baseBruce Ashfield
This is the 5.8.17 stable release Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-11-01fuse: fix page dereference after freeMiklos Szeredi
commit d78092e4937de9ce55edcb4ee4c5e3c707be0190 upstream. After unlock_request() pages from the ap->pages[] array may be put (e.g. by aborting the connection) and the pages can be freed. Prevent use after free by grabbing a reference to the page before calling unlock_request(). The original patch was created by Pradeep P V K. Reported-by: Pradeep P V K <ppvk@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01erofs: avoid duplicated permission check for "trusted." xattrsGao Xiang
commit d578b46db69d125a654f509bdc9091d84e924dc8 upstream. Don't recheck it since xattr_permission() already checks CAP_SYS_ADMIN capability. Just follow 5d3ce4f70172 ("f2fs: avoid duplicated permission check for "trusted." xattrs") Reported-by: Hongyu Jin <hongyu.jin@unisoc.com> [ Gao Xiang: since it could cause some complex Android overlay permission issue as well on android-5.4+, it'd be better to backport to 5.4+ rather than pure cleanup on mainline. ] Cc: <stable@vger.kernel.org> # 5.4+ Link: https://lore.kernel.org/r/20200811070020.6339-1-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01efivarfs: Replace invalid slashes with exclamation marks in dentries.Michael Schaller
commit 336af6a4686d885a067ecea8c3c3dd129ba4fc75 upstream. Without this patch efivarfs_alloc_dentry creates dentries with slashes in their name if the respective EFI variable has slashes in its name. This in turn causes EIO on getdents64, which prevents a complete directory listing of /sys/firmware/efi/efivars/. This patch replaces the invalid shlashes with exclamation marks like kobject_set_name_vargs does for /sys/firmware/efi/vars/ to have consistently named dentries under /sys/firmware/efi/vars/ and /sys/firmware/efi/efivars/. Signed-off-by: Michael Schaller <misch@google.com> Link: https://lore.kernel.org/r/20200925074502.150448-1-misch@google.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: Convert advanced XArray uses to the normal APIMatthew Wilcox (Oracle)
commit 5e2ed8c4f45093698855b1f45cdf43efbf6dd498 upstream. There are no bugs here that I've spotted, it's just easier to use the normal API and there are no performance advantages to using the more verbose advanced API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: Fix XArray usage in io_uring_add_task_fileMatthew Wilcox (Oracle)
commit 236434c3438c4da3dfbd6aeeab807577b85e951a upstream. The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't allocate memory using GFP_NOWAIT, it would leak the reference to the file descriptor. Also the node pointed to by the xas could be freed between the call to xas_load() under the rcu_read_lock() and the acquisition of the xa_lock. It's easier to just use the normal xa_load/xa_store interface here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [axboe: fix missing assign after alloc, cur_uring -> tctx rename] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: Fix use of XArray in __io_uring_files_cancelMatthew Wilcox (Oracle)
commit ce765372bc443573d1d339a2bf4995de385dea3a upstream. We have to drop the lock during each iteration, so there's no advantage to using the advanced API. Convert this to a standard xa_for_each() loop. Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: no need to call xa_destroy() on empty xarrayJens Axboe
commit ca6484cd308a671811bf39f3119e81966eb476e3 upstream. The kernel test robot reports this lockdep issue: [child1:659] mbind (274) returned ENOSYS, marking as inactive. [child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive. [main] 10175 iterations. [F:7781 S:2344 HI:2397] [ 24.610601] [ 24.610743] ================================ [ 24.611083] WARNING: inconsistent lock state [ 24.611437] 5.9.0-rc7-00017-g0f2122045b9462 #5 Not tainted [ 24.611861] -------------------------------- [ 24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes: [ 24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1 [ 24.613642] {SOFTIRQ-ON-W} state was registered at: [ 24.614024] lock_acquire+0x20c/0x29b [ 24.614341] _raw_spin_lock+0x21/0x30 [ 24.614636] io_uring_add_task_file+0xe8/0x13a [ 24.614987] io_uring_create+0x535/0x6bd [ 24.615297] io_uring_setup+0x11d/0x136 [ 24.615606] __ia32_sys_io_uring_setup+0xd/0xf [ 24.615977] do_int80_syscall_32+0x53/0x6c [ 24.616306] restore_all_switch_stack+0x0/0xb1 [ 24.616677] irq event stamp: 939881 [ 24.616968] hardirqs last enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145 [ 24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e [ 24.618321] softirqs last enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a [ 24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61 [ 24.619521] [ 24.619521] other info that might help us debug this: [ 24.620028] Possible unsafe locking scenario: [ 24.620028] [ 24.620492] CPU0 [ 24.620685] ---- [ 24.620894] lock(&xa->xa_lock#4); [ 24.621168] <Interrupt> [ 24.621381] lock(&xa->xa_lock#4); [ 24.621695] [ 24.621695] *** DEADLOCK *** [ 24.621695] [ 24.622154] 1 lock held by ksoftirqd/0/7: [ 24.622468] #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155 [ 24.623106] [ 24.623106] stack backtrace: [ 24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f2122045b9462 #5 [ 24.624090] Call Trace: [ 24.624284] ? show_stack+0x40/0x46 [ 24.624551] dump_stack+0x1b/0x1d [ 24.624809] print_usage_bug+0x17a/0x185 [ 24.625142] mark_lock+0x11d/0x1db [ 24.625474] ? print_shortest_lock_dependencies+0x121/0x121 [ 24.625905] __lock_acquire+0x41e/0x7bf [ 24.626206] lock_acquire+0x20c/0x29b [ 24.626517] ? xa_destroy+0x43/0xc1 [ 24.626810] ? lock_acquire+0x20c/0x29b [ 24.627110] _raw_spin_lock_irqsave+0x3e/0x4e [ 24.627450] ? xa_destroy+0x43/0xc1 [ 24.627725] xa_destroy+0x43/0xc1 [ 24.627989] __io_uring_free+0x57/0x71 [ 24.628286] ? get_pid+0x22/0x22 [ 24.628544] __put_task_struct+0xf2/0x163 [ 24.628865] put_task_struct+0x1f/0x2a [ 24.629161] delayed_put_task_struct+0xe2/0xe9 [ 24.629509] rcu_process_callbacks+0x128/0x155 [ 24.629860] __do_softirq+0x1a3/0x45a [ 24.630151] run_ksoftirqd+0x35/0x61 [ 24.630443] smpboot_thread_fn+0x304/0x31a [ 24.630763] kthread+0x124/0x139 [ 24.631016] ? sort_range+0x18/0x18 [ 24.631290] ? kthread_create_worker_on_cpu+0x17/0x17 [ 24.631682] ret_from_fork+0x1c/0x28 which is complaining about xa_destroy() grabbing the xa lock in an IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt safe. This is really an xarray issue, since it should not assume the lock type. But for our use case, since we know the xarray is empty at this point, there's no need to actually call xa_destroy(). So just get rid of it. Fixes: 0f2122045b94 ("io_uring: don't rely on weak ->files references") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io-wq: fix use-after-free in io_wq_worker_runningHillf Danton
commit c4068bf898ddaef791049a366828d9b84b467bda upstream. The smart syzbot has found a reproducer for the following issue: ================================================================== BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline] BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771 CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_write include/linux/instrumented.h:71 [inline] atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] io_wqe_inc_running fs/io-wq.c:301 [inline] io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 schedule_timeout+0x148/0x250 kernel/time/timer.c:1879 io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Allocated by task 7768: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594 kmalloc_node include/linux/slab.h:572 [inline] kzalloc_node include/linux/slab.h:677 [inline] io_wq_create+0x57b/0xa10 fs/io-wq.c:1064 io_init_wq_offload fs/io_uring.c:7432 [inline] io_sq_offload_start fs/io_uring.c:7504 [inline] io_uring_create fs/io_uring.c:8625 [inline] io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 21: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x10e/0x2b0 mm/slab.c:3756 __io_wq_destroy fs/io-wq.c:1138 [inline] io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146 io_finish_async fs/io_uring.c:6836 [inline] io_ring_ctx_free fs/io_uring.c:7870 [inline] io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 The buggy address belongs to the object at ffff8882183db000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 140 bytes inside of 1024-byte region [ffff8882183db000, ffff8882183db400) The buggy address belongs to the page: page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db flags: 0x57ffe0000000200(slab) raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700 raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== which is down to the comment below, /* all workers gone, wq exit can proceed */ if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs)) complete(&wqe->wq->done); because there might be multiple cases of wqe in a wq and we would wait for every worker in every wqe to go home before releasing wq's resources on destroying. To that end, rework wq's refcount by making it independent of the tracking of workers because after all they are two different things, and keeping it balanced when workers come and go. Note the manager kthread, like other workers, now holds a grab to wq during its lifetime. Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker and do nothing for exiting wq. Cc: stable@vger.kernel.org # v5.5+ Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_wq: Make io_wqe::lock a raw_spinlock_tSebastian Andrzej Siewior
commit 95da84659226d75698a1ab958be0af21d9cc2a9c upstream. During a context switch the scheduler invokes wq_worker_sleeping() with disabled preemption. Disabling preemption is needed because it protects access to `worker->sleeping'. As an optimisation it avoids invoking schedule() within the schedule path as part of possible wake up (thus preempt_enable_no_resched() afterwards). The io-wq has been added to the mix in the same section with disabled preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping() acquires a spinlock_t. Also within the schedule() the spinlock_t must be acquired after tsk_is_pi_blocked() otherwise it will block on the sleeping lock again while scheduling out. While playing with `io_uring-bench' I didn't notice a significant latency spike after converting io_wqe::lock to a raw_spinlock_t. The latency was more or less the same. In order to keep the spinlock_t it would have to be moved after the tsk_is_pi_blocked() check which would introduce a branch instruction into the hot path. The lock is used to maintain the `work_list' and wakes one task up at most. Should io_wqe_cancel_pending_work() cause latency spikes, while searching for a specific item, then it would need to drop the lock during iterations. revert_creds() is also invoked under the lock. According to debug cred::non_rcu is 0. Otherwise it should be moved outside of the locked section because put_cred_rcu()->free_uid() acquires a sleeping lock. Convert io_wqe::lock to a raw_spinlock_t.c Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: reference ->nsproxy for file table commandsJens Axboe
commit 9b8284921513fc1ea57d87777283a59b05862f03 upstream. If we don't get and assign the namespace for the async work, then certain paths just don't work properly (like /dev/stdin, /proc/mounts, etc). Anything that references the current namespace of the given task should be assigned for async work on behalf of that task. Cc: stable@vger.kernel.org # v5.5+ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: don't rely on weak ->files referencesJens Axboe
commit 0f2122045b946241a9e549c2a76cea54fa58a7ff upstream. Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: enable task/files specific overflow flushingJens Axboe
commit e6c8aa9ac33bd7c968af7816240fc081401fddcd upstream. This allows us to selectively flush out pending overflows, depending on the task and/or files_struct being passed in. No intended functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: return cancelation status from poll/timeout/files handlersJens Axboe
commit 76e1b6427fd8246376a97e3227049d49188dfb9c upstream. Return whether we found and canceled requests or not. This is in preparation for using this information, no functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: unconditionally grab req->taskJens Axboe
commit e3bc8e9dad7f2f83cc807111d4472164c9210153 upstream. Sometimes we assign a weak reference to it, sometimes we grab a reference to it. Clean this up and make it unconditional, and drop the flag related to tracking this state. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: stash ctx task reference for SQPOLLJens Axboe
commit 2aede0e417db846793c276c7a1bbf7262c8349b0 upstream. We can grab a reference to the task instead of stashing away the task files_struct. This is doable without creating a circular reference between the ring fd and the task itself. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: move dropping of files into separate helperJens Axboe
commit f573d384456b3025d3f8e58b3eafaeeb0f510784 upstream. No functional changes in this patch, prep patch for grabbing references to the files_struct. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: allow timeout/poll/files killing to take task into accountJens Axboe
commit f3606e3a92ddd36299642c78592fc87609abb1f6 upstream. We currently cancel these when the ring exits, and we cancel all of them. This is in preparation for killing only the ones associated with a given task. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01io_uring: don't run task work on an exiting taskJens Axboe
commit 6200b0ae4ea28a4bfd8eb434e33e6201b7a6a282 upstream. This isn't safe, and isn't needed either. We are guaranteed that any work we queue is on a live task (and will be run), or it goes to our backup io-wq threads if the task is exiting. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29reiserfs: Fix memory leak in reiserfs_parse_options()Jan Kara
[ Upstream commit e9d4709fcc26353df12070566970f080e651f0c9 ] When a usrjquota or grpjquota mount option is used multiple times, we will leak memory allocated for the file name. Make sure the last setting is used and all the previous ones are properly freed. Reported-by: syzbot+c9e294bbe0333a6b7640@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29iomap: fix WARN_ON_ONCE() from unprivileged usersQian Cai
[ Upstream commit a805c111650cdba6ee880f528abdd03c1af82089 ] It is trivial to trigger a WARN_ON_ONCE(1) in iomap_dio_actor() by unprivileged users which would taint the kernel, or worse - panic if panic_on_warn or panic_on_taint is set. Hence, just convert it to pr_warn_ratelimited() to let users know their workloads are racing. Thank Dave Chinner for the initial analysis of the racing reproducers. Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: make sure the rt allocator doesn't run off the endDarrick J. Wong
[ Upstream commit 2a6ca4baed620303d414934aa1b7b0a8e7bab05f ] There's an overflow bug in the realtime allocator. If the rt volume is large enough to handle a single allocation request that is larger than the maximum bmap extent length and the rt bitmap ends exactly on a bitmap block boundary, it's possible that the near allocator will try to check the freeness of a range that extends past the end of the bitmap. This fails with a corruption error and shuts down the fs. Therefore, constrain maxlen so that the range scan cannot run off the end of the rt bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29reiserfs: only call unlock_new_inode() if I_NEWEric Biggers
[ Upstream commit 8859bf2b1278d064a139e3031451524a49a56bd0 ] unlock_new_inode() is only meant to be called after a new inode has already been inserted into the hash table. But reiserfs_new_inode() can call it even before it has inserted the inode, triggering the WARNING in unlock_new_inode(). Fix this by only calling unlock_new_inode() if the inode has the I_NEW flag set, indicating that it's in the table. This addresses the syzbot report "WARNING in unlock_new_inode" (https://syzkaller.appspot.com/bug?extid=187510916eb6a14598f7). Link: https://lore.kernel.org/r/20200628070057.820213-1-ebiggers@kernel.org Reported-by: syzbot+187510916eb6a14598f7@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29udf: Avoid accessing uninitialized data on failed inode readJan Kara
[ Upstream commit 044e2e26f214e5ab26af85faffd8d1e4ec066931 ] When we fail to read inode, some data accessed in udf_evict_inode() may be uninitialized. Move the accesses to !is_bad_inode() branch. Reported-by: syzbot+91f02b28f9bb5f5f1341@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29udf: Limit sparing table sizeJan Kara
[ Upstream commit 44ac6b829c4e173fdf6df18e6dd86aecf9a3dc99 ] Although UDF standard allows it, we don't support sparing table larger than a single block. Check it during mount so that we don't try to access memory beyond end of buffer. Reported-by: syzbot+9991561e714f597095da@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ntfs: add check for mft record size in superblockRustam Kovhaev
[ Upstream commit 4f8c94022f0bc3babd0a124c0a7dcdd7547bd94e ] Number of bytes allocated for mft record should be equal to the mft record size stored in ntfs superblock as reported by syzbot, userspace might trigger out-of-bounds read by dereferencing ctx->attr in ntfs_attr_find() Reported-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com Acked-by: Anton Altaparmakov <anton@tuxera.com> Link: https://syzkaller.appspot.com/bug?extid=aed06913f36eff9b544e Link: https://lkml.kernel.org/r/20200824022804.226242-1-rkovhaev@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29fs: dlm: fix configfs memory leakAlexander Aring
[ Upstream commit 3d2825c8c6105b0f36f3ff72760799fa2e71420e ] This patch fixes the following memory detected by kmemleak and umount gfs2 filesystem which removed the last lockspace: unreferenced object 0xffff9264f482f600 (size 192): comm "dlm_controld", pid 325, jiffies 4294690276 (age 48.136s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 6e 6f 64 65 73 00 00 00 ........nodes... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000060481d7>] make_space+0x41/0x130 [<000000008d905d46>] configfs_mkdir+0x1a2/0x5f0 [<00000000729502cf>] vfs_mkdir+0x155/0x210 [<000000000369bcf1>] do_mkdirat+0x6d/0x110 [<00000000cc478a33>] do_syscall_64+0x33/0x40 [<00000000ce9ccf01>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The patch just remembers the "nodes" entry pointer in space as I think it's created as subdirectory when parent "spaces" is created. In function drop_space() we will lost the pointer reference to nds because configfs_remove_default_groups(). However as this subdirectory is always available when "spaces" exists it will just be freed when "spaces" will be freed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ext4: limit entries returned when counting fsmap recordsDarrick J. Wong
[ Upstream commit af8c53c8bc087459b1aadd4c94805d8272358d79 ] If userspace asked fsmap to try to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20201001222148.GA49520@magnolia Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ext4: disallow modifying DAX inode flag if inline_data has been setXiao Yang
[ Upstream commit aa2f77920b743c44e02e2dc8474bbf8bd30007a2 ] inline_data is mutually exclusive to DAX so enabling both of them triggers the following issue: ------------------------------------------ # mkfs.ext4 -F -O inline_data /dev/pmem1 ... # mount /dev/pmem1 /mnt # echo 'test' >/mnt/file # lsattr -l /mnt/file /mnt/file Inline_Data # xfs_io -c "chattr +x" /mnt/file # xfs_io -c "lsattr -v" /mnt/file [dax] /mnt/file # umount /mnt # mount /dev/pmem1 /mnt # cat /mnt/file cat: /mnt/file: Numerical result out of range ------------------------------------------ Fixes: b383a73f2b83 ("fs/ext4: Introduce DAX inode flag") Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200828084330.15776-1-yangx.jy@cn.fujitsu.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ext4: discard preallocations before releasing group lockJan Kara
[ Upstream commit 5b3dc19dda6691e8ab574e8eede1aef6f02a4f1c ] ext4_mb_discard_group_preallocations() can be releasing group lock with preallocations accumulated on its local list. Thus although discard_pa_seq was incremented and concurrent allocating processes will be retrying allocations, it can happen that premature ENOSPC error is returned because blocks used for preallocations are not available for reuse yet. Make sure we always free locally accumulated preallocations before releasing group lock. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ext4: fix dead loop in ext4_mb_new_blocksYe Bin
[ Upstream commit 70022da804f0f3f152115688885608c39182082e ] As we test disk offline/online with running fsstress, we find fsstress process is keeping running state. kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 .... kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 ext4_mb_new_blocks repeat: ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) freed = ext4_mb_discard_preallocations ext4_mb_discard_group_preallocations this_cpu_inc(discard_pa_seq); ---> freed == 0 seq_retry = ext4_get_discard_pa_seq_sum for_each_possible_cpu(__cpu) __seq += per_cpu(discard_pa_seq, __cpu); if (seq_retry != *seq) { *seq = seq_retry; ret = true; } As we see seq_retry is sum of discard_pa_seq every cpu, if ext4_mb_discard_group_preallocations return zero discard_pa_seq in this cpu maybe increase one, so condition "seq_retry != *seq" have always been met. Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we only increase discard_pa_seq when there is some PA to free. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29ramfs: fix nommu mmap with gaps in the page cacheMatthew Wilcox (Oracle)
[ Upstream commit 50b7d85680086126d7bd91dae81d57d4cb1ab6b7 ] ramfs needs to check that pages are both physically contiguous and contiguous in the file. If the page cache happens to have, eg, page A for index 0 of the file, no page for index 1, and page A+1 for index 2, then an mmap of the first two pages of the file will succeed when it should fail. Fixes: 642fb4d1f1dd ("[PATCH] NOMMU: Provide shared-writable mmap support on ramfs") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Link: https://lkml.kernel.org/r/20200914122239.GO6583@casper.infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29afs: Fix cell removalDavid Howells
[ Upstream commit 1d0e850a49a5b56f8f3cb51e74a11e2fedb96be6 ] Fix cell removal by inserting a more final state than AFS_CELL_FAILED that indicates that the cell has been unpublished in case the manager is already requeued and will go through again. The new AFS_CELL_REMOVED state will just immediately leave the manager function. Going through a second time in the AFS_CELL_FAILED state will cause it to try to remove the cell again, potentially leading to the proc list being removed. Fixes: 989782dcdc91 ("afs: Overhaul cell database management") Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29afs: Fix cell purging with aliasesDavid Howells
[ Upstream commit 286377f6bdf71568a4cf07104fe44006ae0dba6d ] When the afs module is removed, one of the things that has to be done is to purge the cell database. afs_cell_purge() cancels the management timer and then starts the cell manager work item to do the purging. This does a single run through and then assumes that all cells are now purged - but this is no longer the case. With the introduction of alias detection, a later cell in the database can now be holding an active count on an earlier cell (cell->alias_of). The purge scan passes by the earlier cell first, but this can't be got rid of until it has discarded the alias. Ordinarily, afs_unuse_cell() would handle this by setting the management timer to trigger another pass - but afs_set_cell_timer() doesn't do anything if the namespace is being removed (net->live == false). rmmod then hangs in the wait on cells_outstanding in afs_cell_purge(). Fix this by making afs_set_cell_timer() directly queue the cell manager if net->live is false. This causes additional management passes. Queueing the cell manager increments cells_outstanding to make sure the wait won't complete until all cells are destroyed. Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29afs: Fix cell refcounting by splitting the usage counterDavid Howells
[ Upstream commit 88c853c3f5c0a07c5db61b494ee25152535cfeee ] Management of the lifetime of afs_cell struct has some problems due to the usage counter being used to determine whether objects of that type are in use in addition to whether anyone might be interested in the structure. This is made trickier by cell objects being cached for a period of time in case they're quickly reused as they hold the result of a setup process that may be slow (DNS lookups, AFS RPC ops). Problems include the cached root volume from alias resolution pinning its parent cell record, rmmod occasionally hanging and occasionally producing assertion failures. Fix this by splitting the count of active users from the struct reference count. Things then work as follows: (1) The cell cache keeps +1 on the cell's activity count and this has to be dropped before the cell can be removed. afs_manage_cell() tries to exchange the 1 to a 0 with the cells_lock write-locked, and if successful, the record is removed from the net->cells. (2) One struct ref is 'owned' by the activity count. That is put when the active count is reduced to 0 (final_destruction label). (3) A ref can be held on a cell whilst it is queued for management on a work queue without confusing the active count. afs_queue_cell() is added to wrap this. (4) The queue's ref is dropped at the end of the management. This is split out into a separate function, afs_manage_cell_work(). (5) The root volume record is put after a cell is removed (at the final_destruction label) rather then in the RCU destruction routine. (6) Volumes hold struct refs, but aren't active users. (7) Both counts are displayed in /proc/net/afs/cells. There are some management function changes: (*) afs_put_cell() now just decrements the refcount and triggers the RCU destruction if it becomes 0. It no longer sets a timer to have the manager do this. (*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease the active count. afs_unuse_cell() sets the management timer. (*) afs_queue_cell() is added to queue a cell with approprate refs. There are also some other fixes: (*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL. (*) Make sure that candidate cells in lookups are properly destroyed rather than being simply kfree'd. This ensures the bits it points to are destroyed also. (*) afs_dec_cells_outstanding() is now called in cell destruction rather than at "final_destruction". This ensures that cell->net is still valid to the end of the destructor. (*) As a consequence of the previous two changes, move the increment of net->cells_outstanding that was at the point of insertion into the tree to the allocation routine to correctly balance things. Fixes: 989782dcdc91 ("afs: Overhaul cell database management") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29afs: Fix rapid cell addition/removal by not using RCU on cells treeDavid Howells
[ Upstream commit 92e3cc91d8f51ce64a8b7c696377180953dd316e ] There are a number of problems that are being seen by the rapidly mounting and unmounting an afs dynamic root with an explicit cell and volume specified (which should probably be rejected, but that's a separate issue): What the tests are doing is to look up/create a cell record for the name given and then tear it down again without actually using it to try to talk to a server. This is repeated endlessly, very fast, and the new cell collides with the old one if it's not quick enough to reuse it. It appears (as suggested by Hillf Danton) that the search through the RB tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and that it's not blocking the write_seqlock(), despite taking two passes at it. He suggested that the code should take a ref on the cell it's attempting to look at - but this shouldn't be necessary until we've compared the cell names. It's possible that I'm missing a barrier somewhere. However, using an RCU search for this is overkill, really - we only need to access the cell name in a few places, and they're places where we're may end up sleeping anyway. Fix this by switching to an R/W semaphore instead. Additionally, draw the down_read() call inside the function (renamed to afs_find_cell()) since all the callers were taking the RCU read lock (or should've been[*]). [*] afs_probe_cell_name() should have been, but that doesn't appear to be involved in the bug reports. The symptoms of this look like: general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf] ... RIP: 0010:strncasecmp lib/string.c:52 [inline] RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43 afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88 afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249 afs_parse_source fs/afs/super.c:290 [inline] ... Fixes: 989782dcdc91 ("afs: Overhaul cell database management") Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Hillf Danton <hdanton@sina.com> cc: syzkaller-bugs@googlegroups.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29f2fs: wait for sysfs kobject removal before freeing f2fs_sb_infoJamie Iles
[ Upstream commit ae284d87abade58c8db7760c808f311ef1ce693c ] syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, unmounting an f2fs filesystem could result in the following splat: kobject: 'loop5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 250) kobject: 'f2fs_xattr_entry-7:5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 750) ------------[ cut here ]------------ ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98 WARNING: CPU: 0 PID: 699 at lib/debugobjects.c:485 debug_print_object+0x180/0x240 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 699 Comm: syz-executor.5 Tainted: G S 5.9.0-rc8+ #101 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x4d8 show_stack+0x34/0x48 dump_stack+0x174/0x1f8 panic+0x360/0x7a0 __warn+0x244/0x2ec report_bug+0x240/0x398 bug_handler+0x50/0xc0 call_break_hook+0x160/0x1d8 brk_handler+0x30/0xc0 do_debug_exception+0x184/0x340 el1_dbg+0x48/0xb0 el1_sync_handler+0x170/0x1c8 el1_sync+0x80/0x100 debug_print_object+0x180/0x240 debug_check_no_obj_freed+0x200/0x430 slab_free_freelist_hook+0x190/0x210 kfree+0x13c/0x460 f2fs_put_super+0x624/0xa58 generic_shutdown_super+0x120/0x300 kill_block_super+0x94/0xf8 kill_f2fs_super+0x244/0x308 deactivate_locked_super+0x104/0x150 deactivate_super+0x118/0x148 cleanup_mnt+0x27c/0x3c0 __cleanup_mnt+0x28/0x38 task_work_run+0x10c/0x248 do_notify_resume+0x9d4/0x1188 work_pending+0x8/0x34c Like the error handling for f2fs_register_sysfs(), we need to wait for the kobject to be destroyed before returning to prevent a potential use-after-free. Fixes: bf9e697ecd42 ("f2fs: expose features to sysfs entry") Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <chao@kernel.org> Signed-off-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29f2fs: reject CASEFOLD inode flag without casefold featureEric Biggers
[ Upstream commit f6322f3f1212e005e7e6aa82ceb62be53030a64b ] syzbot reported: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6860 Comm: syz-executor835 Not tainted 5.9.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:utf8_casefold+0x43/0x1b0 fs/unicode/utf8-core.c:107 [...] Call Trace: f2fs_init_casefolded_name fs/f2fs/dir.c:85 [inline] __f2fs_setup_filename fs/f2fs/dir.c:118 [inline] f2fs_prepare_lookup+0x3bf/0x640 fs/f2fs/dir.c:163 f2fs_lookup+0x10d/0x920 fs/f2fs/namei.c:494 __lookup_hash+0x115/0x240 fs/namei.c:1445 filename_create+0x14b/0x630 fs/namei.c:3467 user_path_create fs/namei.c:3524 [inline] do_mkdirat+0x56/0x310 fs/namei.c:3664 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [...] The problem is that an inode has F2FS_CASEFOLD_FL set, but the filesystem doesn't have the casefold feature flag set, and therefore super_block::s_encoding is NULL. Fix this by making sanity_check_inode() reject inodes that have F2FS_CASEFOLD_FL when the filesystem doesn't have the casefold feature. Reported-by: syzbot+05139c4039d0679e19ff@syzkaller.appspotmail.com Fixes: 2c2eb7a300cd ("f2fs: Support case-insensitive file name lookups") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: fix high key handling in the rt allocator's query_range functionDarrick J. Wong
[ Upstream commit d88850bd5516a77c6f727e8b6cefb64e0cc929c7 ] Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key in the realtime bitmap is always one less than the number of rt extents, which means that the key clamp at the start of the function is wrong. The 4th argument to xfs_rtfind_forw is the highest rt extent that we want to probe, which means that passing 1 less than the high key is wrong. Finally, drop the rem variable that controls the loop because we can compare the iteration point (rtstart) against the high key directly. The sordid history of this function is that the original commit (fb3c3) incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter to xfs_rtfind_forw. This was wrong because the "high key" is supposed to be the largest key for which the caller wants result rows, not the key for the first row that could possibly be outside the range that the caller wants to see. A subsequent attempt (8ad56) to strengthen the parameter checking added incorrect clamping of the parameters to the number of rt blocks in the system (despite the bitmap functions all taking units of rt extents) to avoid querying ranges past the end of rt bitmap file but failed to fix the incorrect _rtfind_forw parameter. The original _rtfind_forw parameter error then survived the conversion of the startblock and blockcount fields to rt extents (a0e5c), and the most recent off-by-one fix (a3a37) thought it was patching a problem when the end of the rt volume is not in use, but none of these fixes actually solved the original problem that the author was confused about the "limit" argument to xfs_rtfind_forw. Sadly, all four of these patches were written by this author and even his own usage of this function and rt testing were inadequate to get this fixed quickly. Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap") Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks") Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units") Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29nfs: add missing "posix" local_lock constant table definitionScott Mayhew
[ Upstream commit a2d24bcb97dc7b0be1cb891e60ae133bdf36c786 ] "mount -o local_lock=posix..." was broken by the mount API conversion due to the missing constant. Fixes: e38bb238ed8c ("NFS: Convert mount option parsing to use functionality from fs_parser.h") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: fix deadlock and streamline xfs_getfsmap performanceDarrick J. Wong
[ Upstream commit 8ffa90e1145c70c7ac47f14b59583b2296d89e72 ] Refactor xfs_getfsmap to improve its performance: instead of indirectly calling a function that copies one record to userspace at a time, create a shadow buffer in the kernel and copy the whole array once at the end. On the author's computer, this reduces the runtime on his /home by ~20%. This also eliminates a deadlock when running GETFSMAP against the realtime device. The current code locks the rtbitmap to create fsmappings and copies them into userspace, having not released the rtbitmap lock. If the userspace buffer is an mmap of a sparse file that itself resides on the realtime device, the write page fault will recurse into the fs for allocation, which will deadlock on the rtbitmap lock. Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: limit entries returned when counting fsmap recordsDarrick J. Wong
[ Upstream commit acd1ac3aa22fd58803a12d26b1ab7f70232f8d8d ] If userspace asked fsmap to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: fix finobt btree block recovery orderingDave Chinner
[ Upstream commit 671459676ab0e1d371c8d6b184ad1faa05b6941e ] Nathan popped up on #xfs and pointed out that we fail to handle finobt btree blocks in xlog_recover_get_buf_lsn(). This means they always fall through the entire magic number matching code to "recover immediately". Whilst most of the time this is the correct behaviour, occasionally it will be incorrect and could potentially overwrite more recent metadata because we don't check the LSN in the on disk metadata at all. This bug has been present since the finobt was first introduced, and is a potential cause of the occasional xfs_iget_check_free_state() failures we see that indicate that the inode btree state does not match the on disk inode state. Fixes: aafc3c246529 ("xfs: support the XFS_BTNUM_FINOBT free inode btree type") Reported-by: Nathan Scott <nathans@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29fs: fix NULL dereference due to data race in prepend_path()Andrii Nakryiko
[ Upstream commit 09cad07547445bf3a41683e4d3abcd154c123ef5 ] Fix data race in prepend_path() with re-reading mnt->mnt_ns twice without holding the lock. is_mounted() does check for NULL, but is_anon_ns(mnt->mnt_ns) might re-read the pointer again which could be NULL already, if in between reads one of kern_unmount()/kern_unmount_array()/umount_tree() sets mnt->mnt_ns to NULL. This is seen in production with the following stack trace: BUG: kernel NULL pointer dereference, address: 0000000000000048 ... RIP: 0010:prepend_path.isra.4+0x1ce/0x2e0 Call Trace: d_path+0xe6/0x150 proc_pid_readlink+0x8f/0x100 vfs_readlink+0xf8/0x110 do_readlinkat+0xfd/0x120 __x64_sys_readlinkat+0x1a/0x20 do_syscall_64+0x42/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: f2683bd8d5bd ("[PATCH] fix d_absolute_path() interplay with fsmount()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessarySuren Baghdasaryan
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ] Currently __set_oom_adj loops through all processes in the system to keep oom_score_adj and oom_score_adj_min in sync between processes sharing their mm. This is done for any task with more that one mm_users, which includes processes with multiple threads (sharing mm and signals). However for such processes the loop is unnecessary because their signal structure is shared as well. Android updates oom_score_adj whenever a tasks changes its role (background/foreground/...) or binds to/unbinds from a service, making it more/less important. Such operation can happen frequently. We noticed that updates to oom_score_adj became more expensive and after further investigation found out that the patch mentioned in "Fixes" introduced a regression. Using Pixel 4 with a typical Android workload, write time to oom_score_adj increased from ~3.57us to ~362us. Moreover this regression linearly depends on the number of multi-threaded processes running on the system. Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj update should be synchronized between multiple processes. To prevent races between clone() and __set_oom_adj(), when oom_score_adj of the process being cloned might be modified from userspace, we use oom_adj_mutex. Its scope is changed to global. The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for the case of vfork(). To prevent performance regressions of vfork(), we skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is specified. Clearing the MMF_MULTIPROCESS flag (when the last process sharing the mm exits) is left out of this patch to keep it simple and because it is believed that this threading model is rare. Should there ever be a need for optimizing that case as well, it can be done by hooking into the exit path, likely following the mm_update_next_owner pattern. With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being quite rare, the regression is gone after the change is applied. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") Reported-by: Tim Murray <timmurray@google.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Christian Kellner <christian@kellner.me> Cc: Adrian Reber <areber@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: John Johansen <john.johansen@canonical.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com Debugged-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29iomap: Use kzalloc to allocate iomap_pageMatthew Wilcox (Oracle)
[ Upstream commit a6901d4d148dcbad7efb3174afbdf68c995618c2 ] We can skip most of the initialisation, although spinlocks still need explicit initialisation as architectures may use a non-zero value to indicate unlocked. The comment is no longer useful as attach_page_private() handles the refcount now. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29quota: clear padding in v2r1_mem2diskdqb()Eric Dumazet
[ Upstream commit 3d3dc274ce736227e3197868ff749cff2f175f63 ] Freshly allocated memory contains garbage, better make sure to init all struct v2r1_disk_dqblk fields to avoid KMSAN report: BUG: KMSAN: uninit-value in qtree_entry_unused+0x137/0x1b0 fs/quota/quota_tree.c:218 CPU: 0 PID: 23373 Comm: syz-executor.1 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x21c/0x280 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219 qtree_entry_unused+0x137/0x1b0 fs/quota/quota_tree.c:218 v2r1_mem2diskdqb+0x43d/0x710 fs/quota/quota_v2.c:285 qtree_write_dquot+0x226/0x870 fs/quota/quota_tree.c:394 v2_write_dquot+0x1ad/0x280 fs/quota/quota_v2.c:333 dquot_commit+0x4af/0x600 fs/quota/dquot.c:482 ext4_write_dquot fs/ext4/super.c:5934 [inline] ext4_mark_dquot_dirty+0x4d8/0x6a0 fs/ext4/super.c:5985 mark_dquot_dirty fs/quota/dquot.c:347 [inline] mark_all_dquot_dirty fs/quota/dquot.c:385 [inline] dquot_alloc_inode+0xc05/0x12b0 fs/quota/dquot.c:1755 __ext4_new_inode+0x8204/0x9d70 fs/ext4/ialloc.c:1155 ext4_tmpfile+0x41a/0x850 fs/ext4/namei.c:2686 vfs_tmpfile+0x2a2/0x570 fs/namei.c:3283 do_tmpfile fs/namei.c:3316 [inline] path_openat+0x4035/0x6a90 fs/namei.c:3359 do_filp_open+0x2b8/0x710 fs/namei.c:3395 do_sys_openat2+0xa88/0x1140 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_compat_sys_openat fs/open.c:1242 [inline] __se_compat_sys_openat+0x2a4/0x310 fs/open.c:1240 __ia32_compat_sys_openat+0x56/0x70 fs/open.c:1240 do_syscall_32_irqs_on arch/x86/entry/common.c:80 [inline] __do_fast_syscall_32+0x129/0x180 arch/x86/entry/common.c:139 do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:162 do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:205 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c RIP: 0023:0xf7ff4549 Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000f55cd0cc EFLAGS: 00000296 ORIG_RAX: 0000000000000127 RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 0000000020000000 RDX: 0000000000410481 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:143 [inline] kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:126 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80 slab_alloc_node mm/slub.c:2907 [inline] slab_alloc mm/slub.c:2916 [inline] __kmalloc+0x2bb/0x4b0 mm/slub.c:3982 kmalloc include/linux/slab.h:559 [inline] getdqbuf+0x56/0x150 fs/quota/quota_tree.c:52 qtree_write_dquot+0xf2/0x870 fs/quota/quota_tree.c:378 v2_write_dquot+0x1ad/0x280 fs/quota/quota_v2.c:333 dquot_commit+0x4af/0x600 fs/quota/dquot.c:482 ext4_write_dquot fs/ext4/super.c:5934 [inline] ext4_mark_dquot_dirty+0x4d8/0x6a0 fs/ext4/super.c:5985 mark_dquot_dirty fs/quota/dquot.c:347 [inline] mark_all_dquot_dirty fs/quota/dquot.c:385 [inline] dquot_alloc_inode+0xc05/0x12b0 fs/quota/dquot.c:1755 __ext4_new_inode+0x8204/0x9d70 fs/ext4/ialloc.c:1155 ext4_tmpfile+0x41a/0x850 fs/ext4/namei.c:2686 vfs_tmpfile+0x2a2/0x570 fs/namei.c:3283 do_tmpfile fs/namei.c:3316 [inline] path_openat+0x4035/0x6a90 fs/namei.c:3359 do_filp_open+0x2b8/0x710 fs/namei.c:3395 do_sys_openat2+0xa88/0x1140 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_compat_sys_openat fs/open.c:1242 [inline] __se_compat_sys_openat+0x2a4/0x310 fs/open.c:1240 __ia32_compat_sys_openat+0x56/0x70 fs/open.c:1240 do_syscall_32_irqs_on arch/x86/entry/common.c:80 [inline] __do_fast_syscall_32+0x129/0x180 arch/x86/entry/common.c:139 do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:162 do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:205 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c Fixes: 498c60153ebb ("quota: Implement quota format with 64-bit space and inode limits") Link: https://lore.kernel.org/r/20200924183619.4176790-1-edumazet@google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29xfs: force the log after remapping a synchronous-writes fileDarrick J. Wong
[ Upstream commit 5ffce3cc22a0e89813ed0c7162a68b639aef9ab6 ] Commit 5833112df7e9 tried to make it so that a remap operation would force the log out to disk if the filesystem is mounted with mandatory synchronous writes. Unfortunately, that commit failed to handle the case where the inode or the file descriptor require mandatory synchronous writes. Refactor the check into into a helper that will look for all three conditions, and now we can treat reflink just like any other synchronous write. Fixes: 5833112df7e9 ("xfs: reflink should force the log out if mounted with wsync") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>