summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2019-08-06io_uring: fix KASAN use after free in io_sq_wq_submit_workJackie Liu
commit d0ee879187df966ef638031b5f5183078d672141 upstream. [root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06loop: Fix mount(2) failure due to race with LOOP_SET_FDJan Kara
commit 89e524c04fa966330e2e80ab2bc50b9944c5847a upstream. Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") made LOOP_SET_FD ioctl acquire exclusive block device reference while it updates loop device binding. However this can make perfectly valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding temporarily the exclusive bdev reference in cases like this: for i in {a..z}{a..z}; do dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024 mkfs.ext2 $i.image mkdir mnt$i done echo "Run" for i in {a..z}{a..z}; do mount -o loop -t ext2 $i.image mnt$i & done Fix the problem by not getting full exclusive bdev reference in LOOP_SET_FD but instead just mark the bdev as being claimed while we update the binding information. This just blocks new exclusive openers instead of failing them with EBUSY thus fixing the problem. Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") Cc: stable@vger.kernel.org Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06dax: Fix missed wakeup in put_unlocked_entry()Jan Kara
commit 61c30c98ef17e5a330d7bb8494b78b3d6dffe9b8 upstream. The condition checking whether put_unlocked_entry() needs to wake up following waiter got broken by commit 23c84eb78375 ("dax: Fix missed wakeup with PMD faults"). We need to wake the waiter whenever the passed entry is valid (i.e., non-NULL and not special conflict entry). This could lead to processes never being woken up when waiting for entry lock. Fix the condition. Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06Btrfs: fix race leading to fs corruption after transaction abortFilipe Manana
commit cb2d3daddbfb6318d170e79aac1f7d5e4d49f0d7 upstream. When one transaction is finishing its commit, it is possible for another transaction to start and enter its initial commit phase as well. If the first ends up getting aborted, we have a small time window where the second transaction commit does not notice that the previous transaction aborted and ends up committing, writing a superblock that points to btrees that reference extent buffers (nodes and leafs) that were not persisted to disk. The consequence is that after mounting the filesystem again, we will be unable to load some btree nodes/leafs, either because the content on disk is either garbage (or just zeroes) or corresponds to the old content of a previouly COWed or deleted node/leaf, resulting in the well known error messages "parent transid verify failed on ...". The following sequence diagram illustrates how this can happen. CPU 1 CPU 2 <at transaction N> btrfs_commit_transaction() (...) --> sets transaction state to TRANS_STATE_UNBLOCKED --> sets fs_info->running_transaction to NULL (...) btrfs_start_transaction() start_transaction() wait_current_trans() --> returns immediately because fs_info->running_transaction is NULL join_transaction() --> creates transaction N + 1 --> sets fs_info->running_transaction to transaction N + 1 --> adds transaction N + 1 to the fs_info->trans_list list --> returns transaction handle pointing to the new transaction N + 1 (...) btrfs_sync_file() btrfs_start_transaction() --> returns handle to transaction N + 1 (...) btrfs_write_and_wait_transaction() --> writeback of some extent buffer fails, returns an error btrfs_handle_fs_error() --> sets BTRFS_FS_STATE_ERROR in fs_info->fs_state --> jumps to label "scrub_continue" cleanup_transaction() btrfs_abort_transaction(N) --> sets BTRFS_FS_STATE_TRANS_ABORTED flag in fs_info->fs_state --> sets aborted field in the transaction and transaction handle structures, for transaction N only --> removes transaction from the list fs_info->trans_list btrfs_commit_transaction(N + 1) --> transaction N + 1 was not aborted, so it proceeds (...) --> sets the transaction's state to TRANS_STATE_COMMIT_START --> does not find the previous transaction (N) in the fs_info->trans_list, so it doesn't know that transaction was aborted, and the commit of transaction N + 1 proceeds (...) --> sets transaction N + 1 state to TRANS_STATE_UNBLOCKED btrfs_write_and_wait_transaction() --> succeeds writing all extent buffers created in the transaction N + 1 write_all_supers() --> succeeds --> we now have a superblock on disk that points to trees that refer to at least one extent buffer that was never persisted So fix this by updating the transaction commit path to check if the flag BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting the transaction to the TRANS_STATE_COMMIT_START we do not find any previous transaction in the fs_info->trans_list. If the flag is set, just fail the transaction commit with -EROFS, as we do in other places. The exact error code for the previous transaction abort was already logged and reported. Fixes: 49b25e0540904b ("btrfs: enhance transaction abort infrastructure") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06Btrfs: fix incremental send failure after deduplicationFilipe Manana
commit b4f9a1a87a48c255bb90d8a6c3d555a1abb88130 upstream. When doing an incremental send operation we can fail if we previously did deduplication operations against a file that exists in both snapshots. In that case we will fail the send operation with -EIO and print a message to dmesg/syslog like the following: BTRFS error (device sdc): Send: inconsistent snapshot, found updated \ extent for inode 257 without updated inode item, send root is 258, \ parent root is 257 This requires that we deduplicate to the same file in both snapshots for the same amount of times on each snapshot. The issue happens because a deduplication only updates the iversion of an inode and does not update any other field of the inode, therefore if we deduplicate the file on each snapshot for the same amount of time, the inode will have the same iversion value (stored as the "sequence" field on the inode item) on both snapshots, therefore it will be seen as unchanged between in the send snapshot while there are new/updated/deleted extent items when comparing to the parent snapshot. This makes the send operation return -EIO and print an error message. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our first file. The first half of the file has several 64Kb # extents while the second half as a single 512Kb extent. $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo # Create the base snapshot and the parent send stream from it. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 # Create our second file, that has exactly the same data as the first # file. $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar # Create the second snapshot, used for the incremental send, before # doing the file deduplication. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 # Now before creating the incremental send stream: # # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This # will drop several extent items and add a new one, also updating # the inode's iversion (sequence field in inode item) by 1, but not # any other field of the inode; # # 2) Deduplicate into a different subrange of file foo in snapshot # mysnap2. This will replace an extent item with a new one, also # updating the inode's iversion by 1 but not any other field of the # inode. # # After these two deduplication operations, the inode items, for file # foo, are identical in both snapshots, but we have different extent # items for this inode in both snapshots. We want to check this doesn't # cause send to fail with an error or produce an incorrect stream. $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo # Create the incremental send stream. $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2 ERROR: send ioctl failed with -5: Input/output error This issue started happening back in 2015 when deduplication was updated to not update the inode's ctime and mtime and update only the iversion. Back then we would hit a BUG_ON() in send, but later in 2016 send was updated to return -EIO and print the error message instead of doing the BUG_ON(). A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933 Fixes: 1c919a5e13702c ("btrfs: don't update mtime/ctime on deduped inodes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06coda: add error handling for fgetZhouyang Jia
[ Upstream commit 02551c23bcd85f0c68a8259c7b953d49d44f86af ] When fget fails, the lack of error-handling code may cause unexpected results. This patch adds error-handling code after calling fget. Link: http://lkml.kernel.org/r/2514ec03df9c33b86e56748513267a80dd8004d9.1558117389.git.jaharkes@cs.cmu.edu Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com> Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Howells <dhowells@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Mikko Rapeli <mikko.rapeli@iki.fi> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06cifs: fix crash in cifs_dfs_do_automountRonnie Sahlberg
[ Upstream commit ce465bf94b70f03136171a62b607864f00093b19 ] RHBZ: 1649907 Fix a crash that happens while attempting to mount a DFS referral from the same server on the root of a filesystem. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: return -ERANGE if virtual xattr value didn't fit in bufferJeff Layton
[ Upstream commit 3b421018f48c482bdc9650f894aa1747cf90e51d ] The getxattr manpage states that we should return ERANGE if the destination buffer size is too small to hold the value. ceph_vxattrcb_layout does this internally, but we should be doing this for all vxattrs. Fix the only caller of getxattr_cb to check the returned size against the buffer length and return -ERANGE if it doesn't fit. Drop the same check in ceph_vxattrcb_layout and just rely on the caller to handle it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: fix dir_lease_is_valid()Yan, Zheng
[ Upstream commit feab6ac25dbfe3ab96299cb741925dc8d2da0caf ] It should call __ceph_dentry_dir_lease_touch() under dentry->d_lock. Besides, ceph_dentry(dentry) can be NULL when called by LOOKUP_RCU d_revalidate() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06ceph: fix improper use of smp_mb__before_atomic()Andrea Parri
[ Upstream commit 749607731e26dfb2558118038c40e9c0c80d23b5 ] This barrier only applies to the read-modify-write operations; in particular, it does not apply to the atomic64_set() primitive. Replace the barrier with an smp_mb(). Fixes: fdd4e15838e59 ("ceph: rework dcache readdir") Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06cifs: Fix a race condition with cifs_echo_requestRonnie Sahlberg
[ Upstream commit f2caf901c1b7ce65f9e6aef4217e3241039db768 ] There is a race condition with how we send (or supress and don't send) smb echos that will cause the client to incorrectly think the server is unresponsive and thus needs to be reconnected. Summary of the race condition: 1) Daisy chaining scheduling creates a gap. 2) If traffic comes unfortunate shortly after the last echo, the planned echo is suppressed. 3) Due to the gap, the next echo transmission is delayed until after the timeout, which is set hard to twice the echo interval. This is fixed by changing the timeouts from 2 to three times the echo interval. Detailed description of the bug: https://lutz.donnerhacke.de/eng/Blog/Groundhog-Day-with-SMB-remount Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit()Qu Wenruo
[ Upstream commit e88439debd0a7f969b3ddba6f147152cd0732676 ] [BUG] Lockdep will report the following circular locking dependency: WARNING: possible circular locking dependency detected 5.2.0-rc2-custom #24 Tainted: G O ------------------------------------------------------ btrfs/8631 is trying to acquire lock: 000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs] but task is already holding lock: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->tree_log_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x475/0xa00 [btrfs] btrfs_commit_super+0x71/0x80 [btrfs] close_ctree+0x2bd/0x320 [btrfs] btrfs_put_super+0x15/0x20 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x16/0xa0 [btrfs] deactivate_locked_super+0x3a/0x80 deactivate_super+0x51/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0x94/0xb0 exit_to_usermode_loop+0xd8/0xe0 do_syscall_64+0x210/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&fs_info->reloc_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x40d/0xa00 [btrfs] btrfs_quota_enable+0x2da/0x730 [btrfs] btrfs_ioctl+0x2691/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}: lock_acquire+0xa7/0x190 __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_qgroup_inherit+0x40/0x620 [btrfs] create_pending_snapshot+0x9d7/0xe60 [btrfs] create_pending_snapshots+0x94/0xb0 [btrfs] btrfs_commit_transaction+0x415/0xa00 [btrfs] btrfs_mksubvol+0x496/0x4e0 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0xa90/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->tree_log_mutex); lock(&fs_info->reloc_mutex); lock(&fs_info->tree_log_mutex); lock(&fs_info->qgroup_ioctl_lock#2); *** DEADLOCK *** 6 locks held by btrfs/8631: #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60 #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs] #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs] #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs] #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs] #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] [CAUSE] Due to the delayed subvolume creation, we need to call btrfs_qgroup_inherit() inside commit transaction code, with a lot of other mutex hold. This hell of lock chain can lead to above problem. [FIX] On the other hand, we don't really need to hold qgroup_ioctl_lock if we're in the context of create_pending_snapshot(). As in that context, we're the only one being able to modify qgroup. All other qgroup functions which needs qgroup_ioctl_lock are either holding a transaction handle, or will start a new transaction: Functions will start a new transaction(): * btrfs_quota_enable() * btrfs_quota_disable() Functions hold a transaction handler: * btrfs_add_qgroup_relation() * btrfs_del_qgroup_relation() * btrfs_create_qgroup() * btrfs_remove_qgroup() * btrfs_limit_qgroup() * btrfs_qgroup_inherit() call inside create_subvol() So we have a higher level protection provided by transaction, thus we don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit(). Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in create_pending_snapshot() is already protected by transaction. So the fix is to detect the context by checking trans->transaction->state. If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction context and no need to get the mutex. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: Flush before reflinking any extent to prevent NOCOW write falling ↵Qu Wenruo
back to COW without data reservation [ Upstream commit a94d1d0cb3bf1983fcdf05b59d914dbff4f1f52c ] [BUG] The following script can cause unexpected fsync failure: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs mkfs.btrfs -f $dev -b 512M > /dev/null mount $dev $mnt -o nospace_cache # Prealloc one extent xfs_io -f -c "falloc 8k 64m" $mnt/file1 # Fill the remaining data space xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding sync # Write into the prealloc extent xfs_io -c "pwrite 1m 16m" $mnt/file1 # Reflink then fsync, fsync would fail due to ENOSPC xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 umount $dev The fsync fails with ENOSPC, and the last page of the buffered write is lost. [CAUSE] This is caused by: - Btrfs' back reference only has extent level granularity So write into shared extent must be COWed even only part of the extent is shared. So for above script we have: - fallocate Create a preallocated extent where we can do NOCOW write. - fill all the remaining data and unallocated space - buffered write into preallocated space As we have not enough space available for data and the extent is not shared (yet) we fall into NOCOW mode. - reflink Now part of the large preallocated extent is shared, later write into that extent must be COWed. - fsync triggers writeback But now the extent is shared and therefore we must fallback into COW mode, which fails with ENOSPC since there's not enough space to allocate data extents. [WORKAROUND] The workaround is to ensure any buffered write in the related extents (not just the reflink source range) get flushed before reflink/dedupe, so that NOCOW writes succeed that happened before reflinking succeed. The workaround is expensive, we could do it better by only flushing NOCOW range, but that needs extra accounting for NOCOW range. For now, fix the possible data loss first. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: fix minimum number of chunk errors for DUPDavid Sterba
[ Upstream commit 0ee5f8ae082e1f675a2fb6db601c31ac9958a134 ] The list of profiles in btrfs_chunk_max_errors lists DUP as a profile DUP able to tolerate 1 device missing. Though this profile is special with 2 copies, it still needs the device, unlike the others. Looking at the history of changes, thre's no clear reason why DUP is there, functions were refactored and blocks of code merged to one helper. d20983b40e828 Btrfs: fix writing data into the seed filesystem - factor code to a helper de11cc12df173 Btrfs: don't pre-allocate btrfs bio - unrelated change, DUP still in the list with max errors 1 a236aed14ccb0 Btrfs: Deal with failed writes in mirrored configurations - introduced the max errors, leaves DUP and RAID1 in the same group Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06btrfs: tree-checker: Check if the file extent end overflowsQu Wenruo
[ Upstream commit 4c094c33c9ed4b8d0d814bd1d7ff78e123d15d00 ] Under certain conditions, we could have strange file extent item in log tree like: item 18 key (69599 108 397312) itemoff 15208 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 18446744073709547520 ram 18446744073709547520 The num_bytes + ram_bytes overflow 64 bit type. For num_bytes part, we can detect such overflow along with file offset (key->offset), as file_offset + num_bytes should never go beyond u64. For ram_bytes part, it's about the decompressed size of the extent, not directly related to the size. In theory it is OK to have a large value, and put extra limitation on RAM bytes may cause unexpected false alerts. So in tree-checker, we only check if the file offset and num bytes overflow. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06fs/adfs: super: fix use-after-free bugRussell King
[ Upstream commit 5808b14a1f52554de612fee85ef517199855e310 ] Fix a use-after-free bug during filesystem initialisation, where we access the disc record (which is stored in a buffer) after we have released the buffer. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-04ceph: hold i_ceph_lock when removing caps for freeing inodeYan, Zheng
commit d6e47819721ae2d9d090058ad5570a66f3c42e39 upstream. ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask() on a freeing inode. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04/proc/<pid>/cmdline: add back the setproctitle() special caseLinus Torvalds
commit d26d0cd97c88eb1a5704b42e41ab443406807810 upstream. This makes the setproctitle() special case very explicit indeed, and handles it with a separate helper function entirely. In the process, it re-instates the original semantics of simply stopping at the first NUL character when the original last NUL character is no longer there. [ The original semantics can still be seen in mm/util.c: get_cmdline() that is limited to a fixed-size buffer ] This makes the logic about when we use the string lengths etc much more obvious, and makes it easier to see what we do and what the two very different cases are. Note that even when we allow walking past the end of the argument array (because the setproctitle() might have overwritten and overflowed the original argv[] strings), we only allow it when it overflows into the environment region if it is immediately adjacent. [ Fixed for missing 'count' checks noted by Alexey Izbyshev ] Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/ Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function") Cc: Jakub Jankowski <shasta@toxcorp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04/proc/<pid>/cmdline: remove all the special casesLinus Torvalds
commit 3d712546d8ba9f25cdf080d79f90482aa4231ed4 upstream. Start off with a clean slate that only reads exactly from arg_start to arg_end, without any oddities. This simplifies the code and in the process removes the case that caused us to potentially leak an uninitialized byte from the temporary kernel buffer. Note that in order to start from scratch with an understandable base, this simplifies things _too_ much, and removes all the legacy logic to handle setproctitle() having changed the argument strings. We'll add back those special cases very differently in the next commit. Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/ Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite") Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04sched/fair: Don't free p->numa_faults with concurrent readersJann Horn
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04NFS: Cleanup if nfs_match_client is interruptedBenjamin Coddington
commit 9f7761cf0409465075dadb875d5d4b8ef2f890c8 upstream. Don't bail out before cleaning up a new allocation if the wait for searching for a matching nfs client is interrupted. Memory leaks. Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com Fixes: 950a578c6128 ("NFS: make nfs_match_client killable") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31io_uring: don't use iov_iter_advance() for fixed buffersJens Axboe
commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49 upstream. Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31io_uring: fix counter inc/dec mismatch in async_listZhengyuan Liu
commit f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4 upstream. We could queue a work for each req in defer and link list without increasing async_list->cnt, so we shouldn't decrease it while exiting from workqueue as well if we didn't process the req in async list. Thanks to Jens Axboe <axboe@kernel.dk> for his guidance. Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31io_uring: ensure ->list is initialized for poll commandsJens Axboe
commit 36703247d5f52a679df9da51192b6950fe81689f upstream. Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: Daniel Kozak <kozzi11@gmail.com> Tested-by: Daniel Kozak <kozzi11@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31io_uring: add a memory barrier before atomic_readZhengyuan Liu
commit c0e48f9dea9129aa11bec3ed13803bcc26e96e49 upstream. There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script: while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this: [54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen? fio#cpu0 wq#cpu1 io_add_to_prev_work io_sq_wq_submit_work atomic_read() <<< 1 atomic_dec_return() << 1->0 list_empty(); <<< true; list_add_tail() atomic_read() << 0 or 1? As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier. This issue was detected with the help of Jackie's <liuyun01@kylinos.cn> Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31access: avoid the RCU grace period for the temporary subjective credentialsLinus Torvalds
commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream. It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU work because it installs a temporary credential that gets allocated and freed for each system call. The allocation and freeing overhead is mostly benign, but because credentials can be accessed under the RCU read lock, the freeing involves a RCU grace period. Which is not a huge deal normally, but if you have a lot of access() calls, this causes a fair amount of seconday damage: instead of having a nice alloc/free patterns that hits in hot per-CPU slab caches, you have all those delayed free's, and on big machines with hundreds of cores, the RCU overhead can end up being enormous. But it turns out that all of this is entirely unnecessary. Exactly because access() only installs the credential as the thread-local subjective credential, the temporary cred pointer doesn't actually need to be RCU free'd at all. Once we're done using it, we can just free it synchronously and avoid all the RCU overhead. So add a 'non_rcu' flag to 'struct cred', which can be set by users that know they only use it in non-RCU context (there are other potential users for this). We can make it a union with the rcu freeing list head that we need for the RCU case, so this doesn't need any extra storage. Note that this also makes 'get_current_cred()' clear the new non_rcu flag, in case we have filesystems that take a long-term reference to the cred and then expect the RCU delayed freeing afterwards. It's not entirely clear that this is required, but it makes for clear semantics: the subjective cred remains non-RCU as long as you only access it synchronously using the thread-local accessors, but you _can_ use it as a generic cred if you want to. It is possible that we should just remove the whole RCU markings for ->cred entirely. Only ->real_cred is really supposed to be accessed through RCU, and the long-term cred copies that nfs uses might want to explicitly re-enable RCU freeing if required, rather than have get_current_cred() do it implicitly. But this is a "minimal semantic changes" change for the immediate problem. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Glauber <jglauber@marvell.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31io_uring: fix the sequence comparison in io_sequence_deferZhengyuan Liu
commit dbd0f6d6c2a11eb9c31ca9cd454f95bb5713e92e upstream. sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If cached_sq_head overflows before cached_cq_tail, then we may miss a barrier req. As cached_cq_tail always follows cached_sq_head, the NQ should be enough. Cc: stable@vger.kernel.org Fixes: de0617e46717 ("io_uring: add support for marking commands as draining") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31btrfs: inode: Don't compress if NODATASUM or NODATACOW setQu Wenruo
commit 42c16da6d684391db83788eb680accd84f6c2083 upstream. As btrfs(5) specified: Note If nodatacow or nodatasum are enabled, compression is disabled. If NODATASUM or NODATACOW set, we should not compress the extent. Normally NODATACOW is detected properly in run_delalloc_range() so compression won't happen for NODATACOW. However for NODATASUM we don't have any check, and it can cause compressed extent without csum pretty easily, just by: mkfs.btrfs -f $dev mount $dev $mnt -o nodatasum touch $mnt/foobar mount -o remount,datasum,compress $mnt xfs_io -f -c "pwrite 0 128K" $mnt/foobar And in fact, we have a bug report about corrupted compressed extent without proper data checksum so even RAID1 can't recover the corruption. (https://bugzilla.kernel.org/show_bug.cgi?id=199707) Running compression without proper checksum could cause more damage when corruption happens, as compressed data could make the whole extent unreadable, so there is no need to allow compression for NODATACSUM. The fix will refactor the inode compression check into two parts: - inode_can_compress() As the hard requirement, checked at btrfs_run_delalloc_range(), so no compression will happen for NODATASUM inode at all. - inode_need_compress() As the soft requirement, checked at btrfs_run_delalloc_range() and compress_file_range(). Reported-by: James Harvey <jamespharvey20@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31proc: use down_read_killable mmap_sem for /proc/pid/mapsKonstantin Khlebnikov
[ Upstream commit 8a713e7df3352b8d9392476e9cf29e4e185dac32 ] Do not remain stuck forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. This function is also used for /proc/pid/smaps. Link: http://lkml.kernel.org/r/156007493160.3335.14447544314127417266.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31proc: use down_read_killable mmap_sem for /proc/pid/map_filesKonstantin Khlebnikov
[ Upstream commit cd9e2bb8271c971d9f37c722be2616c7f8ba0664 ] Do not remain stuck forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. It seems ->d_revalidate() could return any error (except ECHILD) to abort validation and pass error as result of lookup sequence. [akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei] Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31proc: use down_read_killable mmap_sem for /proc/pid/clear_refsKonstantin Khlebnikov
[ Upstream commit c46038017fbdcac627b670c9d4176f1d0c2f5fa3 ] Do not remain stuck forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Replace the only unkillable mmap_sem lock in clear_refs_write(). Link: http://lkml.kernel.org/r/156007493826.3335.5424884725467456239.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31proc: use down_read_killable mmap_sem for /proc/pid/pagemapKonstantin Khlebnikov
[ Upstream commit ad80b932c57d85fd6377f97f359b025baf179a87 ] Do not remain stuck forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Link: http://lkml.kernel.org/r/156007493638.3335.4872164955523928492.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollupKonstantin Khlebnikov
[ Upstream commit a26a97815548574213fd37f29b4b78ccc6d9ed20 ] Do not remain stuck forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Link: http://lkml.kernel.org/r/156007493429.3335.14666825072272692455.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31memcg, fsnotify: no oom-kill for remote memcg chargingShakeel Butt
[ Upstream commit ec165450968b26298bd1c373de37b0ab6d826b33 ] Commit d46eb14b735b ("fs: fsnotify: account fsnotify metadata to kmemcg") added remote memcg charging for fanotify and inotify event objects. The aim was to charge the memory to the listener who is interested in the events but without triggering the OOM killer. Otherwise there would be security concerns for the listener. At the time, oom-kill trigger was not in the charging path. A parallel work added the oom-kill back to charging path i.e. commit 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path"). So to not trigger oom-killer in the remote memcg, explicitly add __GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations. Link: http://lkml.kernel.org/r/20190514212259.156585-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-319p: pass the correct prototype to read_cache_pageChristoph Hellwig
[ Upstream commit f053cbd4366051d7eb6ba1b8d529d20f719c2963 ] Fix the callback 9p passes to read_cache_page to actually have the proper type expected. Casting around function pointers can easily hide typing bugs, and defeats control flow protection. Link: http://lkml.kernel.org/r/20190520055731.24538-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31dlm: check if workqueues are NULL before flushing/destroyingDavid Windsor
[ Upstream commit b355516f450703c9015316e429b66a93dfff0e6f ] If the DLM lowcomms stack is shut down before any DLM traffic can be generated, flush_workqueue() and destroy_workqueue() can be called on empty send and/or recv workqueues. Insert guard conditionals to only call flush_workqueue() and destroy_workqueue() on workqueues that are not NULL. Signed-off-by: David Windsor <dwindsor@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: avoid out-of-range memory accessOcean Chen
[ Upstream commit 56f3ce675103e3fb9e631cfb4131fc768bc23e9a ] blkoff_off might over 512 due to fs corrupt or security vulnerability. That should be checked before being using. Use ENTRIES_IN_SUM to protect invalid value in cur_data_blkoff. Signed-off-by: Ocean Chen <oceanchen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: fix to avoid long latency during umountHeng Xiao
[ Upstream commit 6e0cd4a9dd4df1a0afcb454f1e654b5c80685913 ] In umount, we give an constand time to handle pending discard, previously, in __issue_discard_cmd() we missed to check timeout condition in loop, result in delaying long time, fix it. Signed-off-by: Heng Xiao <heng.xiao@unisoc.com> [Chao Yu: add commit message] Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31io_uring: fix io_sq_thread_stop running in front of io_sq_threadJackie Liu
[ Upstream commit a4c0b3decb33fb4a2b5ecc6234a50680f0b21e7d ] INFO: task syz-executor.5:8634 blocked for more than 143 seconds. Not tainted 5.2.0-rc5+ #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor.5 D25632 8634 8224 0x00004004 Call Trace: context_switch kernel/sched/core.c:2818 [inline] __schedule+0x658/0x9e0 kernel/sched/core.c:3445 schedule+0x131/0x1d0 kernel/sched/core.c:3509 schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783 do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83 __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x47/0x60 kernel/sched/completion.c:136 kthread_stop+0xb4/0x150 kernel/kthread.c:559 io_sq_thread_stop fs/io_uring.c:2252 [inline] io_finish_async fs/io_uring.c:2259 [inline] io_ring_ctx_free fs/io_uring.c:2770 [inline] io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834 io_uring_release+0x5d/0x70 fs/io_uring.c:2842 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x17e/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279 do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x412fb1 Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f 45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7 RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1 RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0 R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c ============================================= There is an wrong logic, when kthread_park running in front of io_sq_thread. CPU#0 CPU#1 io_sq_thread_stop: int kthread(void *_create): kthread_park() __kthread_parkme(self); <<< Wrong kthread_stop() << wait for self->exited << clear_bit KTHREAD_SHOULD_PARK ret = threadfn(data); | |- io_sq_thread |- kthread_should_park() << false |- schedule() <<< nobody wake up stuck CPU#0 stuck CPU#1 So, use a new variable sqo_thread_started to ensure that io_sq_thread run first, then io_sq_thread_stop. Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: fix is_idle() check for discard typeSahitya Tummala
[ Upstream commit 56659ce838456c6f2315ce8a4bd686ac4b23e9d1 ] The discard thread should issue upto dpolicy->max_requests at once and wait for all those discard requests at once it reaches dpolicy->max_requests. It should then sleep for dpolicy->min_interval timeout before issuing the next batch of discard requests. But in the current code of is_idle(), it checks for dcc_info->queued_discard and aborts issuing the discard batch of max_requests. This dcc_info->queued_discard will be true always once one discard command is issued. It is thus resulting into this type of discard request pattern - - Issue discard request#1 - is_idle() returns false, discard thread waits for request#1 and then sleeps for min_interval 50ms. - Issue discard request#2 - is_idle() returns false, discard thread waits for request#2 and then sleeps for min_interval 50ms. - and so on for all other discard requests, assuming f2fs is idle w.r.t other conditions. With this fix, the pattern will look like this - - Issue discard request#1 - Issue discard request#2 and so on upto max_requests of 8 - Issue discard request#8 - wait for min_interval 50ms. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: Lower threshold for disable_cp_againDaniel Rosenberg
[ Upstream commit ae4ad7ea09d32ff1b6fb908ff12f8c1bd5241b29 ] The existing threshold for allowable holes at checkpoint=disable time is too high. The OVP space contains reserved segments, which are always in the form of free segments. These must be subtracted from the OVP value. The current threshold is meant to be the maximum value of holes of a single type we can have and still guarantee that we can fill the disk without failing to find space for a block of a given type. If the disk is full, ignoring current reserved, which only helps us, the amount of unused blocks is equal to the OVP area. Of that, there are reserved segments, which must be free segments, and the rest of the ovp area, which can come from either free segments or holes. The maximum possible amount of holes is OVP-reserved. Now, consider the disk when mounting with checkpoint=disable. We must be able to fill all available free space with either data or node blocks. When we start with checkpoint=disable, holes are locked to their current type. Say we have H of one type of hole, and H+X of the other. We can fill H of that space with arbitrary typed blocks via SSR. For the remaining H+X blocks, we may not have any of a given block type left at all. For instance, if we were to fill the disk entirely with blocks of the type with fewer holes, the H+X blocks of the opposite type would not be used. If H+X > OVP-reserved, there would be more holes than could possibly exist, and we would have failed to find a suitable block earlier on, leading to a crash in update_sit_entry. If H+X <= OVP-reserved, then the holes end up effectively masked by the OVP region in this case. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: Fix accounting for unusable blocksDaniel Rosenberg
[ Upstream commit a4c3ecaaadac5693f555cfef1c9eecf4c39df818 ] Fixes possible underflows when dealing with unusable blocks. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: fix to avoid deadloop if data_flush is onChao Yu
[ Upstream commit 040d2bb318d1aea4f28cc22504b44e446666c86e ] As Hagbard Celine reported: [ 615.697824] INFO: task kworker/u16:5:344 blocked for more than 120 seconds. [ 615.697825] Not tainted 5.0.15-gentoo-f2fslog #4 [ 615.697826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 615.697827] kworker/u16:5 D 0 344 2 0x80000000 [ 615.697831] Workqueue: writeback wb_workfn (flush-259:0) [ 615.697832] Call Trace: [ 615.697836] ? __schedule+0x2c5/0x8b0 [ 615.697839] schedule+0x32/0x80 [ 615.697841] schedule_preempt_disabled+0x14/0x20 [ 615.697842] __mutex_lock.isra.8+0x2ba/0x4d0 [ 615.697845] ? log_store+0xf5/0x260 [ 615.697848] f2fs_write_data_pages+0x133/0x320 [ 615.697851] ? trace_hardirqs_on+0x2c/0xe0 [ 615.697854] do_writepages+0x41/0xd0 [ 615.697857] __filemap_fdatawrite_range+0x81/0xb0 [ 615.697859] f2fs_sync_dirty_inodes+0x1dd/0x200 [ 615.697861] f2fs_balance_fs_bg+0x2a7/0x2c0 [ 615.697863] ? up_read+0x5/0x20 [ 615.697865] ? f2fs_do_write_data_page+0x2cb/0x940 [ 615.697867] f2fs_balance_fs+0xe5/0x2c0 [ 615.697869] __write_data_page+0x1c8/0x6e0 [ 615.697873] f2fs_write_cache_pages+0x1e0/0x450 [ 615.697878] f2fs_write_data_pages+0x14b/0x320 [ 615.697880] ? trace_hardirqs_on+0x2c/0xe0 [ 615.697883] do_writepages+0x41/0xd0 [ 615.697885] __filemap_fdatawrite_range+0x81/0xb0 [ 615.697887] f2fs_sync_dirty_inodes+0x1dd/0x200 [ 615.697889] f2fs_balance_fs_bg+0x2a7/0x2c0 [ 615.697891] f2fs_write_node_pages+0x51/0x220 [ 615.697894] do_writepages+0x41/0xd0 [ 615.697897] __writeback_single_inode+0x3d/0x3d0 [ 615.697899] writeback_sb_inodes+0x1e8/0x410 [ 615.697902] __writeback_inodes_wb+0x5d/0xb0 [ 615.697904] wb_writeback+0x28f/0x340 [ 615.697906] ? cpumask_next+0x16/0x20 [ 615.697908] wb_workfn+0x33e/0x420 [ 615.697911] process_one_work+0x1a1/0x3d0 [ 615.697913] worker_thread+0x30/0x380 [ 615.697915] ? process_one_work+0x3d0/0x3d0 [ 615.697916] kthread+0x116/0x130 [ 615.697918] ? kthread_create_worker_on_cpu+0x70/0x70 [ 615.697921] ret_from_fork+0x3a/0x50 There is still deadloop in below condition: d A - do_writepages - f2fs_write_node_pages - f2fs_balance_fs_bg - f2fs_sync_dirty_inodes - f2fs_write_cache_pages - mutex_lock(&sbi->writepages) -- lock once - __write_data_page - f2fs_balance_fs_bg - f2fs_sync_dirty_inodes - f2fs_write_data_pages - mutex_lock(&sbi->writepages) -- lock again Thread A Thread B - do_writepages - f2fs_write_node_pages - f2fs_balance_fs_bg - f2fs_sync_dirty_inodes - .cp_task = current - f2fs_sync_dirty_inodes - .cp_task = current - filemap_fdatawrite - .cp_task = NULL - filemap_fdatawrite - f2fs_write_cache_pages - enter f2fs_balance_fs_bg since .cp_task is NULL - .cp_task = NULL Change as below to avoid this: - add condition to avoid holding .writepages mutex lock in path of data flush - introduce mutex lock sbi.flush_lock to exclude concurrent data flush in background. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31f2fs: fix to check layout on last valid checkpoint parkChao Yu
[ Upstream commit 5dae2d39074dde941cc3150dcbb7840d88179743 ] As Ju Hyung reported: " I was semi-forced today to use the new kernel and test f2fs. My Ubuntu initramfs got a bit wonky and I had to boot into live CD and fix some stuffs. The live CD was using 4.15 kernel, and just mounting the f2fs partition there corrupted f2fs and my 4.19(with 5.1-rc1-4.19 f2fs-stable merged) refused to mount with "SIT is corrupted node" message. I used the latest f2fs-tools sent by Chao including "fsck.f2fs: fix to repair cp_loads blocks at correct position" It spit out 140M worth of output, but at least I didn't have to run it twice. Everything returned "Ok" in the 2nd run. The new log is at http://arter97.com/f2fs/final After fixing the image, I used my 4.19 kernel with 5.2-rc1-4.19 f2fs-stable merged and it mounted. But, I got this: [ 1.047791] F2FS-fs (nvme0n1p3): layout of large_nat_bitmap is deprecated, run fsck to repair, chksum_offset: 4092 [ 1.081307] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint [ 1.161520] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs [ 1.162418] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7e00 But after doing a reboot, the message is gone: [ 1.098423] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint [ 1.177771] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs [ 1.178365] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7eda I'm not exactly sure why the kernel detected that I'm still using the old layout on the first boot. Maybe fsck didn't fix it properly, or the check from the kernel is improper. " Although we have rebuild the old deprecated checkpoint with new layout during repair, we only repair last checkpoint park, the other old one is remained. Once the image was mounted, we will 1) sanity check layout and 2) decide which checkpoint park to use according to cp_ver. So that we will print reported message unnecessarily at step 1), to avoid it, we simply move layout check into f2fs_sanity_check_ckpt() after step 2). Reported-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31btrfs: shut up bogus -Wmaybe-uninitialized warningArnd Bergmann
commit 6c64460cdc8be5fa074aa8fe2ae8736d5792bdc5 upstream. gcc sometimes can't determine whether a variable has been initialized when both the initialization and the use are conditional: fs/btrfs/props.c: In function 'inherit_props': fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized] btrfs_block_rsv_release(fs_info, trans->block_rsv, This code is fine. Unfortunately, I cannot think of a good way to rephrase it in a way that makes gcc understand this, so I add a bogus initialization the way one should not. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> [ gcc 8 and 9 don't emit the warning ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28ext4: allow directory holesTheodore Ts'o
commit 4e19d6b65fb4fc42e352ce9883649e049da14743 upstream. The largedir feature was intended to allow ext4 directories to have unmapped directory blocks (e.g., directory holes). And so the released e2fsprogs no longer enforces this for largedir file systems; however, the corresponding change to the kernel-side code was not made. This commit fixes this oversight. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28ext4: use jbd2_inode dirty range scopingRoss Zwisler
commit 73131fbb003b3691cfcf9656f234b00da497fcd6 upstream. Use the newly introduced jbd2_inode dirty range scoping to prevent us from waiting forever when trying to complete a journal transaction. Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28jbd2: introduce jbd2_inode dirty range scopingRoss Zwisler
commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream. Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out. The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation. We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure. This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to. Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28ext4: enforce the immutable flag on open filesTheodore Ts'o
commit 02b016ca7f99229ae6227e7b2fc950c4e140d74a upstream. According to the chattr man page, "a file with the 'i' attribute cannot be modified..." Historically, this was only enforced when the file was opened, per the rest of the description, "... and the file can not be opened in write mode". There is general agreement that we should standardize all file systems to prevent modifications even for files that were opened at the time the immutable flag is set. Eventually, a change to enforce this at the VFS layer should be landing in mainline. Until then, enforce this at the ext4 level to prevent xfstests generic/553 from failing. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28ext4: don't allow any modifications to an immutable fileDarrick J. Wong
commit 2e53840362771c73eb0a5ff71611507e64e8eecd upstream. Don't allow any modifications to a file that's marked immutable, which means that we have to flush all the writable pages to make the readonly and we have to check the setattr/setflags parameters more closely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>