summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2020-04-16btrfs: do not zero f_bavail if we have available spaceJosef Bacik
commit d55966c4279bfc6a0cf0b32bf13f5df228a1eeb6 upstream. There was some logic added a while ago to clear out f_bavail in statfs() if we did not have enough free metadata space to satisfy our global reserve. This was incorrect at the time, however didn't really pose a problem for normal file systems because we would often allocate chunks if we got this low on free metadata space, and thus wouldn't really hit this case unless we were actually full. Fast forward to today and now we are much better about not allocating metadata chunks all of the time. Couple this with d792b0f19711 ("btrfs: always reserve our entire size for the global reserve") which now means we'll easily have a larger global reserve than our free space, we are now more likely to trip over this while still having plenty of space. Fix this by skipping this logic if the global rsv's space_info is not full. space_info->full is 0 unless we've attempted to allocate a chunk for that space_info and that has failed. If this happens then the space for the global reserve is definitely sacred and we need to report b_avail == 0, but before then we can just use our calculated b_avail. Reported-by: Martin Steigerwald <martin@lichtvoll.de> Fixes: ca8a51b3a979 ("btrfs: statfs: report zero available if metadata are exhausted") CC: stable@vger.kernel.org # 4.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16reiserfs: Fix memory leak of journal device stringJan Kara
commit 5474ca7da6f34fa95e82edc747d5faa19cbdfb5c upstream. When a filesystem is mounted with jdev mount option, we store the journal device name in an allocated string in superblock. However we fail to ever free that string. Fix it. Reported-by: syzbot+1c6756baf4b16b94d2a6@syzkaller.appspotmail.com Fixes: c3aa077648e1 ("reiserfs: Properly display mount options in /proc/mounts") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16gfs2: Another gfs2_find_jhead fixAndreas Gruenbacher
commit eed0f953b90e86e765197a1dad06bb48aedc27fe upstream. On filesystems with a block size smaller than the page size, gfs2_find_jhead can split a page across two bios (for example, when blocks are not allocated consecutively). When that happens, the first bio that completes will unlock the page in its bi_end_io handler even though the page hasn't been read completely yet. Fix that by using a chained bio for the rest of the page. While at it, clean up the sector calculation logic in gfs2_log_alloc_bio. In gfs2_find_jhead, simplify the disk block and offset calculation logic and fix a variable name. Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16cifs: fix soft mounts hanging in the reconnect codeRonnie Sahlberg
commit c54849ddd832ae0a45cab16bcd1ed2db7da090d7 upstream. RHBZ: 1795429 In recent DFS updates we have a new variable controlling how many times we will retry to reconnect the share. If DFS is not used, then this variable is initialized to 0 in: static inline int dfs_cache_get_nr_tgts(const struct dfs_cache_tgt_list *tl) { return tl ? tl->tl_numtgts : 0; } This means that in the reconnect loop in smb2_reconnect() we will immediately wrap retries to -1 and never actually get to pass this conditional: if (--retries) continue; The effect is that we no longer reach the point where we fail the commands with -EHOSTDOWN and basically the kernel threads are virtually hung and unkillable. Fixes: a3a53b7603798fd8 (cifs: Add support for failover in smb2_reconnect()) Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> CC: Stable <stable@vger.kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16cifs: set correct max-buffer-size for smb2_ioctl_init()Ronnie Sahlberg
commit 731b82bb1750a906c1e7f070aedf5505995ebea7 upstream. Fix two places where we need to adjust down the max response size for ioctl when it is used together with compounding. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16CIFS: Fix task struct use-after-free on reconnectVincent Whitchurch
commit f1f27ad74557e39f67a8331a808b860f89254f2d upstream. The task which created the MID may be gone by the time cifsd attempts to call the callbacks on MIDs from cifs_reconnect(). This leads to a use-after-free of the task struct in cifs_wake_up_task: ================================================================== BUG: KASAN: use-after-free in __lock_acquire+0x31a0/0x3270 Read of size 8 at addr ffff8880103e3a68 by task cifsd/630 CPU: 0 PID: 630 Comm: cifsd Not tainted 5.5.0-rc6+ #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x8e/0xcb print_address_description.constprop.5+0x1d3/0x3c0 ? __lock_acquire+0x31a0/0x3270 __kasan_report+0x152/0x1aa ? __lock_acquire+0x31a0/0x3270 ? __lock_acquire+0x31a0/0x3270 kasan_report+0xe/0x20 __lock_acquire+0x31a0/0x3270 ? __wake_up_common+0x1dc/0x630 ? _raw_spin_unlock_irqrestore+0x4c/0x60 ? mark_held_locks+0xf0/0xf0 ? _raw_spin_unlock_irqrestore+0x39/0x60 ? __wake_up_common_lock+0xd5/0x130 ? __wake_up_common+0x630/0x630 lock_acquire+0x13f/0x330 ? try_to_wake_up+0xa3/0x19e0 _raw_spin_lock_irqsave+0x38/0x50 ? try_to_wake_up+0xa3/0x19e0 try_to_wake_up+0xa3/0x19e0 ? cifs_compound_callback+0x178/0x210 ? set_cpus_allowed_ptr+0x10/0x10 cifs_reconnect+0xa1c/0x15d0 ? generic_ip_connect+0x1860/0x1860 ? rwlock_bug.part.0+0x90/0x90 cifs_readv_from_socket+0x479/0x690 cifs_read_from_socket+0x9d/0xe0 ? cifs_readv_from_socket+0x690/0x690 ? mempool_resize+0x690/0x690 ? rwlock_bug.part.0+0x90/0x90 ? memset+0x1f/0x40 ? allocate_buffers+0xff/0x340 cifs_demultiplex_thread+0x388/0x2a50 ? cifs_handle_standard+0x610/0x610 ? rcu_read_lock_held_common+0x120/0x120 ? mark_lock+0x11b/0xc00 ? __lock_acquire+0x14ed/0x3270 ? __kthread_parkme+0x78/0x100 ? lockdep_hardirqs_on+0x3e8/0x560 ? lock_downgrade+0x6a0/0x6a0 ? lockdep_hardirqs_on+0x3e8/0x560 ? _raw_spin_unlock_irqrestore+0x39/0x60 ? cifs_handle_standard+0x610/0x610 kthread+0x2bb/0x3a0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 Allocated by task 649: save_stack+0x19/0x70 __kasan_kmalloc.constprop.5+0xa6/0xf0 kmem_cache_alloc+0x107/0x320 copy_process+0x17bc/0x5370 _do_fork+0x103/0xbf0 __x64_sys_clone+0x168/0x1e0 do_syscall_64+0x9b/0xec0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 0: save_stack+0x19/0x70 __kasan_slab_free+0x11d/0x160 kmem_cache_free+0xb5/0x3d0 rcu_core+0x52f/0x1230 __do_softirq+0x24d/0x962 The buggy address belongs to the object at ffff8880103e32c0 which belongs to the cache task_struct of size 6016 The buggy address is located 1960 bytes inside of 6016-byte region [ffff8880103e32c0, ffff8880103e4a40) The buggy address belongs to the page: page:ffffea000040f800 refcount:1 mapcount:0 mapping:ffff8880108da5c0 index:0xffff8880103e4c00 compound_mapcount: 0 raw: 4000000000010200 ffffea00001f2208 ffffea00001e3408 ffff8880108da5c0 raw: ffff8880103e4c00 0000000000050003 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880103e3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880103e3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880103e3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880103e3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880103e3b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== This can be reliably reproduced by adding the below delay to cifs_reconnect(), running find(1) on the mount, restarting the samba server while find is running, and killing find during the delay: spin_unlock(&GlobalMid_Lock); mutex_unlock(&server->srv_mutex); + msleep(10000); + cifs_dbg(FYI, "%s: issuing mid callbacks\n", __func__); list_for_each_safe(tmp, tmp2, &retry_list) { mid_entry = list_entry(tmp, struct mid_q_entry, qhead); Fix this by holding a reference to the task struct until the MID is freed. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16readdir: be more conservative with directory entry namesLinus Torvalds
commit 2c6b7bcd747201441923a0d3062577a8d1fbd8f8 upstream. Commit 8a23eb804ca4 ("Make filldir[64]() verify the directory entry filename is valid") added some minimal validity checks on the directory entries passed to filldir[64](). But they really were pretty minimal. This fleshes out at least the name length check: we used to disallow zero-length names, but really, negative lengths or oevr-long names aren't ok either. Both could happen if there is some filesystem corruption going on. Now, most filesystems tend to use just an "unsigned char" or similar for the length of a directory entry name, so even with a corrupt filesystem you should never see anything odd like that. But since we then use the name length to create the directory entry record length, let's make sure it actually is half-way sensible. Note how POSIX states that the size of a path component is limited by NAME_MAX, but we actually use PATH_MAX for the check here. That's because while NAME_MAX is generally the correct maximum name length (it's 255, for the same old "name length is usually just a byte on disk"), there's nothing in the VFS layer that really cares. So the real limitation at a VFS layer is the total pathname length you can pass as a filename: PATH_MAX. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16ceph: hold extra reference to r_parent over life of requestJeff Layton
commit 9c1c2b35f1d94de8325344c2777d7ee67492db3b upstream. Currently, we just assume that it will stick around by virtue of the submitter's reference, but later patches will allow the syscall to return early and we can't rely on that reference at that point. While I'm not aware of any reports of it, Xiubo pointed out that this may fix a use-after-free. If the wait for a reply times out or is canceled via signal, and then the reply comes in after the syscall returns, the client can end up trying to access r_parent without a reference. Take an extra reference to the inode when setting r_parent and release it when releasing the request. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16afs: Fix characters allowed into cell namesDavid Howells
commit a45ea48e2bcd92c1f678b794f488ca0bda9835b8 upstream. The afs filesystem needs to prohibit certain characters from cell names, such as '/', as these are used to form filenames in procfs, leading to the following warning being generated: WARNING: CPU: 0 PID: 3489 at fs/proc/generic.c:178 Fix afs_alloc_cell() to disallow nonprintable characters, '/', '@' and names that begin with a dot. Remove the check for "@cell" as that is then redundant. This can be tested by running: echo add foo/.bar 1.2.3.4 >/proc/fs/afs/cells Note that we will also need to deal with: - Names ending in ".invalid" shouldn't be passed to the DNS. - Names that contain non-valid domainname chars shouldn't be passed to the DNS. - DNS replies that say "your-dns-needs-immediate-attention.<gTLD>" and replies containing A records that say 127.0.53.53 should be considered invalid. [https://www.icann.org/en/system/files/files/name-collision-mitigation-01aug14-en.pdf] but these need to be dealt with by the kafs-client DNS program rather than the kernel. Reported-by: syzbot+b904ba7c947a37b4b291@syzkaller.appspotmail.com Cc: stable@kernel.org Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12afs: Remove set but not used variables 'before', 'after'zhengbin
commit 51590df4f3306cb1f43dca54e3ccdd121ab89594 upstream. Fixes gcc '-Wunused-but-set-variable' warning: fs/afs/dir_edit.c: In function afs_set_contig_bits: fs/afs/dir_edit.c:75:20: warning: variable after set but not used [-Wunused-but-set-variable] fs/afs/dir_edit.c: In function afs_set_contig_bits: fs/afs/dir_edit.c:75:12: warning: variable before set but not used [-Wunused-but-set-variable] fs/afs/dir_edit.c: In function afs_clear_contig_bits: fs/afs/dir_edit.c:100:20: warning: variable after set but not used [-Wunused-but-set-variable] fs/afs/dir_edit.c: In function afs_clear_contig_bits: fs/afs/dir_edit.c:100:12: warning: variable before set but not used [-Wunused-but-set-variable] They are never used since commit 63a4681ff39c. Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12nfsd: depend on CRYPTO_MD5 for legacy client trackingPatrick Steinhardt
commit 38a2204f5298620e8a1c3b1dc7b831425106dbc0 upstream. The legacy client tracking infrastructure of nfsd makes use of MD5 to derive a client's recovery directory name. As the nfsd module doesn't declare any dependency on CRYPTO_MD5, though, it may fail to allocate the hash if the kernel was compiled without it. As a result, generation of client recovery directories will fail with the following error: NFSD: unable to generate recoverydir name The explicit dependency on CRYPTO_MD5 was removed as redundant back in 6aaa67b5f3b9 (NFSD: Remove redundant "select" clauses in fs/Kconfig 2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5. This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit df486a25900f (NFS: Fix the selection of security flavours in Kconfig) at a later point. Fix the issue by adding back an explicit dependency on CRYPTO_MD5. Fixes: df486a25900f (NFS: Fix the selection of security flavours in Kconfig) Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12xfs: Sanity check flags of Q_XQUOTARM callJan Kara
commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 upstream. Flags passed to Q_XQUOTARM were not sanity checked for invalid values. Fix that. Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl") Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattrJeff Mahoney
commit 394440d469413fa9b74f88a11f144d76017221f2 upstream. Commit 60e4cf67a58 (reiserfs: fix extended attributes on the root directory) introduced a regression open_xa_root started returning -EOPNOTSUPP but it was not handled properly in reiserfs_for_each_xattr. When the reiserfs module is built without CONFIG_REISERFS_FS_XATTR, deleting an inode would result in a warning and chowning an inode would also result in a warning and then fail to complete. With CONFIG_REISERFS_FS_XATTR enabled, the xattr root would always be present for read-write operations. This commit handles -EOPNOSUPP in the same way -ENODATA is handled. Fixes: 60e4cf67a582 ("reiserfs: fix extended attributes on the root directory") CC: stable@vger.kernel.org # Commit 60e4cf67a58 was picked up by stable Link: https://lore.kernel.org/r/20200115180059.6935-1-jeffm@suse.com Reported-by: Michael Brunnbauer <brunni@netestate.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12Btrfs: always copy scrub arguments back to user spaceFilipe Manana
commit 5afe6ce748c1ea99e0d648153c05075e1ab93afb upstream. If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12btrfs: check rw_devices, not num_devices for balanceJosef Bacik
commit b35cf1f0bf1f2b0b193093338414b9bd63b29015 upstream. The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12btrfs: fix memory leak in qgroup accountingJohannes Thumshirn
commit 26ef8493e1ab771cb01d27defca2fa1315dc3980 upstream. When running xfstests on the current btrfs I get the following splat from kmemleak: unreferenced object 0xffff88821b2404e0 (size 32): comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&.... 10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&.... backtrace: [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs] [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs] [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs] [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs] [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs] [<00000000f0188930>] process_one_work+0x1cf/0x350 [<00000000af5f2f8e>] worker_thread+0x28/0x3c0 [<00000000b55a1add>] kthread+0x109/0x120 [<00000000f88cbd17>] ret_from_fork+0x35/0x40 This corresponds to: (gdb) l *(btrfs_find_all_roots_safe+0x41) 0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413). 1408 1409 tmp = ulist_alloc(GFP_NOFS); 1410 if (!tmp) 1411 return -ENOMEM; 1412 *roots = ulist_alloc(GFP_NOFS); 1413 if (!*roots) { 1414 ulist_free(tmp); 1415 return -ENOMEM; 1416 } 1417 Following the lifetime of the allocated 'roots' ulist, it gets freed again in btrfs_qgroup_account_extent(). But this does not happen if the function is called with the 'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent() does a short leave and directly returns. Instead of directly returning we should jump to the 'out_free' in order to free all resources as expected. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12btrfs: do not delete mismatched root refsJosef Bacik
commit 423a716cd7be16fb08690760691befe3be97d3fc upstream. btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in any way, and then continue to delete the reference. This shouldn't happen, we have these values because there's more to the reference than the original root and the sub root. If any of these checks fail, return -ENOENT. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12btrfs: fix invalid removal of root refJosef Bacik
commit d49d3287e74ffe55ae7430d1e795e5f9bf7359ea upstream. If we have the following sequence of events btrfs sub create A btrfs sub create A/B btrfs sub snap A C mkdir C/foo mv A/B C/foo rm -rf * We will end up with a transaction abort. The reason for this is because we create a root ref for B pointing to A. When we create a snapshot of C we still have B in our tree, but because the root ref points to A and not C we will make it appear to be empty. The problem happens when we move B into C. This removes the root ref for B pointing to A and adds a ref of B pointing to C. When we rmdir C we'll see that we have a ref to our root and remove the root ref, despite not actually matching our reference name. Now btrfs_del_root_ref() allowing this to work is a bug as well, however we know that this inode does not actually point to a root ref in the first place, so we shouldn't be calling btrfs_del_root_ref() in the first place and instead simply look up our dir index for this item and do the rest of the removal. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12btrfs: rework arguments of btrfs_unlink_subvolJosef Bacik
commit 045d3967b6920b663fc010ad414ade1b24143bd1 upstream. btrfs_unlink_subvol takes the name of the dentry and the root objectid based on what kind of inode this is, either a real subvolume link or a empty one that we inherited as a snapshot. We need to fix how we unlink in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework btrfs_unlink_subvol to just take the dentry and handle getting the right objectid given the type of inode this is. There is no functional change here, simply pushing the work into btrfs_unlink_subvol() proper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12smb3: fix performance regression with setting mtimeSteve French
commit cf5371ae460eb8e484e4884747af270c86c3c469 upstream. There are cases when we don't want to send the SMB2 flush operation (e.g. when user specifies mount parm "nostrictsync") and it can be a very expensive operation on the server. In most cases in order to set mtime, we simply need to flush (write) the dirtry pages from the client and send the writes to the server not also send a flush protocol operation to the server. Fixes: aa081859b10c ("cifs: flush before set-info if we have writeable handles") CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12locks: reinstate locks_delete_block optimizationLinus Torvalds
commit dcf23ac3e846ca0cf626c155a0e3fcbbcf4fae8a upstream. There is measurable performance impact in some synthetic tests due to commit 6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup a waiter). Fix the race condition instead by clearing the fl_blocker pointer after the wake_up, using explicit acquire/release semantics. This does mean that we can no longer use the clearing of fl_blocker as the wait condition, so switch the waiters over to checking whether the fl_blocked_member list_head is empty. Reviewed-by: yangerkun <yangerkun@huawei.com> Reviewed-by: NeilBrown <neilb@suse.de> Fixes: 6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup a waiter) Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12ocfs2: fix oops when writing cloned fileGang He
commit 2d797e9ff95ecbcf0a83d657928ed20579444857 upstream. Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12readdir: make user_access_begin() use the real access rangeLinus Torvalds
commit 3c2659bd1db81ed6a264a9fc6262d51667d655ad upstream. In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I changed filldir to not do individual __put_user() accesses, but instead use unsafe_put_user() surrounded by the proper user_access_begin/end() pair. That make them enormously faster on modern x86, where the STAC/CLAC games make individual user accesses fairly heavy-weight. However, the user_access_begin() range was not really the exact right one, since filldir() has the unfortunate problem that it needs to not only fill out the new directory entry, it also needs to fix up the previous one to contain the proper file offset. It's unfortunate, but the "d_off" field in "struct dirent" is _not_ the file offset of the directory entry itself - it's the offset of the next one. So we end up backfilling the offset in the previous entry as we walk along. But since x86 didn't really care about the exact range, and used to be the only architecture that did anything fancy in user_access_begin() to begin with, the filldir[64]() changes did something lazy, and even commented on it: /* * Note! This range-checks 'previous' (which may be NULL). * The real range was checked in getdents */ if (!user_access_begin(dirent, sizeof(*dirent))) goto efault; and it all worked fine. But now 32-bit ppc is starting to also implement user_access_begin(), and the fact that we faked the range to only be the (possibly not even valid) previous directory entry becomes a problem, because ppc32 will actually be using the range that is passed in for more than just "check that it's user space". This is a complete rewrite of Christophe's original patch. By saving off the record length of the previous entry instead of a pointer to it in the filldir data structures, we can simplify the range check and the writing of the previous entry d_off field. No need for any conditionals in the user accesses themselves, although we retain the conditional EINTR checking for the "was this the first directory entry" signal handling latency logic. Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Link: https://lore.kernel.org/lkml/a02d3426f93f7eb04960a4d9140902d278cab0bb.1579697910.git.christophe.leroy@c-s.fr/ Link: https://lore.kernel.org/lkml/408c90c4068b00ea8f1c41cca45b84ec23d4946b.1579783936.git.christophe.leroy@c-s.fr/ Reported-and-tested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12f2fs: use EINVAL for superblock with invalid magicIcenowy Zheng
commit 38fb6d0ea34299d97b031ed64fe994158b6f8eb3 upstream. The kernel mount_block_root() function expects -EACESS or -EINVAL for a unmountable filesystem when trying to mount the root with different filesystem types. However, in 5.3-rc1 the behavior when F2FS code cannot find valid block changed to return -EFSCORRUPTED(-EUCLEAN), and this error code makes mount_block_root() fail when trying to probe F2FS. When the magic number of the superblock mismatches, it has a high probability that it's just not a F2FS. In this case return -EINVAL seems to be a better result, and this return value can make mount_block_root() probing work again. Return -EINVAL when the superblock has magic mismatch, -EFSCORRUPTED in other cases (the magic matches but the superblock cannot be recognized). Fixes: 10f966bbf521 ("f2fs: use generic EFSBADCRC/EFSCORRUPTED") Signed-off-by: Icenowy Zheng <icenowy@aosc.io> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12NFS: Revalidate the file mapping on all fatal writeback errorsTrond Myklebust
commit b8946d7bfb9417ec171693d4478a831420aead5f upstream. If a write or commit failed, and the mapping sees a fatal error, we need to revalidate the contents of that mapping. Fixes: 06c9fdf3b9f1 ("NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12afs: Fix handling of an abort from a service handlerDavid Howells
commit dde9f095583b3f375ba23979045ee10dfcebec2f upstream. When an AFS service handler function aborts a call, AF_RXRPC marks the call as complete - which means that it's not going to get any more packets from the receiver. This is a problem because reception of the final ACK is what triggers afs_deliver_to_call() to drop the final ref on the afs_call object. Instead, aborted AFS service calls may then just sit around waiting for ever or until they're displaced by a new call on the same connection channel or a connection-level abort. Fix this by calling afs_set_call_complete() to finalise the afs_call struct representing the call. However, we then need to drop the ref that stops the call from being deallocated. We can do this in afs_set_call_complete(), as the work queue is holding a separate ref of its own, but then we shouldn't do it in afs_process_async_call() and afs_delete_async_call(). call->drop_ref is set to indicate that a ref needs dropping for a call and this is dealt with when we transition a call to AFS_CALL_COMPLETE. But then we also need to get rid of the ref that pins an asynchronous client call. We can do this by the same mechanism, setting call->drop_ref for an async client call too. We can also get rid of call->incoming since nothing ever sets it and only one thing ever checks it (futilely). A trace of the rxrpc_call and afs_call struct ref counting looks like: <idle>-0 [001] ..s5 164.764892: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_new_incoming_call+0x473/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns5 164.766001: rxrpc_call: c=00000002 QUE u=4 sp=rxrpc_propose_ACK+0xbe/0x551 a=00000000442095b5 <idle>-0 [001] .Ns4 164.766005: rxrpc_call: c=00000002 PUT u=3 sp=rxrpc_new_incoming_call+0xa3f/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns7 164.766433: afs_call: c=00000002 WAKE u=2 o=11 sp=rxrpc_notify_socket+0x196/0x33c kworker/1:2-1810 [001] ...1 164.768409: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_process_call+0x25/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.769439: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000001 00000000 02 22 ACK CallAck kworker/1:2-1810 [001] ...1 164.769459: rxrpc_call: c=00000002 PUT u=2 sp=rxrpc_process_call+0x74f/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.770794: afs_call: c=00000002 QUEUE u=3 o=12 sp=afs_deliver_to_call+0x449/0x72c kworker/1:2-1810 [001] ...1 164.770829: afs_call: c=00000002 PUT u=2 o=12 sp=afs_process_async_call+0xdb/0x11e kworker/1:2-1810 [001] ...2 164.771084: rxrpc_abort: c=00000002 95786a88:00000008 s=0 a=1 e=1 K-1 kworker/1:2-1810 [001] ...1 164.771461: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000002 00000000 04 00 ABORT CallAbort kworker/1:2-1810 [001] ...1 164.771466: afs_call: c=00000002 PUT u=1 o=12 sp=SRXAFSCB_ProbeUuid+0xc1/0x106 The abort generated in SRXAFSCB_ProbeUuid(), labelled "K-1", indicates that the local filesystem/cache manager didn't recognise the UUID as its own. Fixes: 2067b2b3f484 ("afs: Fix the CB.ProbeUuid service handler to reply correctly") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12afs: Fix afs_lookup() to not clobber the version on a new dentryDavid Howells
commit f52b83b0b1c40ada38df917973ab719a4a753951 upstream. Fix afs_lookup() to not clobber the version set on a new dentry by afs_do_lookup() - especially as it's using the wrong version of the version (we need to use the one given to us by whatever op the dir contents correspond to rather than what's in the afs_vnode). Fixes: 9dd0b82ef530 ("afs: Fix missing dentry data version updating") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-12io_uring: fix __io_iopoll_check deadlock in io_sq_threadXiaoguang Wang
commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream. Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit. I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check(). Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30ocfs2: call journal flush to mark journal as empty after journal recovery ↵Kai Li
when mount commit 397eac17f86f404f5ba31d8c3e39ec3124b39fd3 upstream. If journal is dirty when mount, it will be replayed but jbd2 sb log tail cannot be updated to mark a new start because journal->j_flag has already been set with JBD2_ABORT first in journal_init_common. When a new transaction is committed, it will be recored in block 1 first(journal->j_tail is set to 1 in journal_reset). If emergency restart happens again before journal super block is updated unfortunately, the new recorded trans will not be replayed in the next mount. The following steps describe this procedure in detail. 1. mount and touch some files 2. these transactions are committed to journal area but not checkpointed 3. emergency restart 4. mount again and its journals are replayed 5. journal super block's first s_start is 1, but its s_seq is not updated 6. touch a new file and its trans is committed but not checkpointed 7. emergency restart again 8. mount and journal is dirty, but trans committed in 6 will not be replayed. This exception happens easily when this lun is used by only one node. If it is used by multi-nodes, other node will replay its journal and its journal super block will be updated after recovery like what this patch does. ocfs2_recover_node->ocfs2_replay_journal. The following jbd2 journal can be generated by touching a new file after journal is replayed, and seq 15 is the first valid commit, but first seq is 13 in journal super block. logdump: Block 0: Journal Superblock Seq: 0 Type: 4 (JBD2_SUPERBLOCK_V2) Blocksize: 4096 Total Blocks: 32768 First Block: 1 First Commit ID: 13 Start Log Blknum: 1 Error: 0 Feature Compat: 0 Feature Incompat: 2 block64 Feature RO compat: 0 Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36 FS Share Cnt: 1 Dynamic Superblk Blknum: 0 Per Txn Block Limit Journal: 0 Data: 0 Block 1: Journal Commit Block Seq: 14 Type: 2 (JBD2_COMMIT_BLOCK) Block 2: Journal Descriptor Seq: 15 Type: 1 (JBD2_DESCRIPTOR_BLOCK) No. Blocknum Flags 0. 587 none UUID: 00000000000000000000000000000000 1. 8257792 JBD2_FLAG_SAME_UUID 2. 619 JBD2_FLAG_SAME_UUID 3. 24772864 JBD2_FLAG_SAME_UUID 4. 8257802 JBD2_FLAG_SAME_UUID 5. 513 JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG ... Block 7: Inode Inode: 8257802 Mode: 0640 Generation: 57157641 (0x3682809) FS Generation: 2839773110 (0xa9437fb6) CRC32: 00000000 ECC: 0000 Type: Regular Attr: 0x0 Flags: Valid Dynamic Features: (0x1) InlineData User: 0 (root) Group: 0 (root) Size: 7 Links: 1 Clusters: 0 ctime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019 atime: 0x5de5d870 0x113181a1 -- Tue Dec 3 11:37:20.288457121 2019 mtime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019 dtime: 0x0 -- Thu Jan 1 08:00:00 1970 ... Block 9: Journal Commit Block Seq: 15 Type: 2 (JBD2_COMMIT_BLOCK) The following is journal recovery log when recovering the upper jbd2 journal when mount again. syslog: ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it. fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0 fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1 fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2 fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13 Due to first commit seq 13 recorded in journal super is not consistent with the value recorded in block 1(seq is 14), journal recovery will be terminated before seq 15 even though it is an unbroken commit, inode 8257802 is a new file and it will be lost. Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com Signed-off-by: Kai Li <li.kai4@h3c.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30NFSD fixing possible null pointer derefering in copy offloadOlga Kornievskaia
commit 18f428d4e2f7eff162d80b2b21689496c4e82afd upstream. Static checker revealed possible error path leading to possible NULL pointer dereferencing. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e0639dc5805a: ("NFSD introduce async copy feature") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30f2fs: fix potential overflowChao Yu
commit 1f0d5c911b64165c9754139a26c8c2fad352c132 upstream. We expect 64-bit calculation result from below statement, however in 32-bit machine, looped left shift operation on pgoff_t type variable may cause overflow issue, fix it by forcing type cast. page->index << PAGE_SHIFT; Fixes: 26de9b117130 ("f2fs: avoid unnecessary updating inode during fsync") Fixes: 0a2aa8fbb969 ("f2fs: refactor __exchange_data_block for speed up") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30ubifs: Fixed missed le64_to_cpu() in journalBen Dooks (Codethink)
commit df22b5b3ecc6233e33bd27f67f14c0cd1b5a5897 upstream. In the ubifs_jnl_write_inode() functon, it calls ubifs_iget() with xent->inum. The xent->inum is __le64, but the ubifs_iget() takes native cpu endian. I think that this should be changed to passing le64_to_cpu(xent->inum) to fix the following sparse warning: fs/ubifs/journal.c:902:58: warning: incorrect type in argument 2 (different base types) fs/ubifs/journal.c:902:58: expected unsigned long inum fs/ubifs/journal.c:902:58: got restricted __le64 [usertype] inum Fixes: 7959cf3a7506 ("ubifs: journal: Handle xattrs like files") Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30gfs2: add compat_ioctl supportArnd Bergmann
commit 8d0980704842e8a68df2c3164c1c165e5c7ebc08 upstream. Out of the four ioctl commands supported on gfs2, only FITRIM works in compat mode. Add a proper handler based on the ext4 implementation. Fixes: 6ddc5c3ddf25 ("gfs2: getlabel support") Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30affs: fix a memory leak in affs_remountNavid Emamdoost
commit 450c3d4166837c496ebce03650c08800991f2150 upstream. In affs_remount if data is provided it is duplicated into new_opts. The allocated memory for new_opts is only released if parse_options fails. There's a bit of history behind new_options, originally there was save/replace options on the VFS layer so the 'data' passed must not change (thus strdup), this got cleaned up in later patches. But not completely. There's no reason to do the strdup in cases where the filesystem does not need to reuse the 'data' again, because strsep would modify it directly. Fixes: c8f33d0bec99 ("affs: kstrdup() memory handling") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturnTrond Myklebust
commit 5326de9e94bedcf7366e7e7625d4deb8c1f1ca8a upstream. If nfs4_delegreturn_prepare needs to wait for a layoutreturn to complete then make sure we drop the sequence slot if we hold it. Fixes: 1c5bd76d17cc ("pNFS: Enable layoutreturn operation for return-on-close") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()Trond Myklebust
commit 5c441544f045e679afd6c3c6d9f7aaf5fa5f37b0 upstream. If the server returns a bad or dead session error, the we don't want to update the session slot number, but just immediately schedule recovery and allow it to proceed. We can/should then remove handling in other places Fixes: 3453d5708b33 ("NFSv4.1: Avoid false retries when RPC calls are interrupted") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30NFSv2: Fix a typo in encode_sattr()Trond Myklebust
commit ad97a995d8edff820d4238bd0dfc69f440031ae6 upstream. Encode the mtime correctly. Fixes: 95582b0083883 ("vfs: change inode times to use struct timespec64") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30afs: Fix use-after-loss-of-refDavid Howells
commit 40a708bd622b78582ae3d280de29b09b50bd04c0 upstream. afs_lookup() has a tracepoint to indicate the outcome of d_splice_alias(), passing it the inode to retrieve the fid from. However, the function gave up its ref on that inode when it called d_splice_alias(), which may have failed and dropped the inode. Fix this by caching the fid. Fixes: 80548b03991f ("afs: Add more tracepoints") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30btrfs: simplify inode locking for RWF_NOWAITGoldwyn Rodrigues
commit 9cf35f673583ccc9f3e2507498b3079d56614ad3 upstream. This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression"). Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. This causes extra contention on the lock and can be measured eg. by AIM7 benchmark. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: edf064e7c6fe ("btrfs: nowait aio support") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30afs: Fix missing cell comparison in afs_test_super()David Howells
commit 106bc79843c3c6f4f00753d1f46e54e815f99377 upstream. Fix missing cell comparison in afs_test_super(). Without this, any pair volumes that have the same volume ID will share a superblock, no matter the cell, unless they're in different network namespaces. Normally, most users will only deal with a single cell and so they won't see this. Even if they do look into a second cell, they won't see a problem unless they happen to hit a volume with the same ID as one they've already got mounted. Before the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # cat /proc/mounts | grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0 After the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ admin/ common/ install/ OldFiles/ service/ system/ bakrestores/ home/ misc/ pkg/ src/ wsadmin/ # cat /proc/mounts | grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #kth.se:root.cell /afs/kth.se afs ro,relatime 0 0 Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Carsten Jacobi <jacobi@de.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org> cc: Todd DeSantis <atd@us.ibm.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30cifs: Adjust indentation in smb2_open_fileNathan Chancellor
commit 7935799e041ae10d380d04ea23868240f082bd11 upstream. Clang warns: ../fs/cifs/smb2file.c:70:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] if (oparms->tcon->use_resilient) { ^ ../fs/cifs/smb2file.c:66:2: note: previous statement is here if (rc) ^ 1 warning generated. This warning occurs because there is a space after the tab on this line. Remove it so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. Fixes: 592fafe644bf ("Add resilienthandles mount parm") Link: https://github.com/ClangBuiltLinux/linux/issues/826 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-30chardev: Avoid potential use-after-free in 'chrdev_open()'Will Deacon
commit 68faa679b8be1a74e6663c21c3a9d25d32f1c079 upstream. 'chrdev_open()' calls 'cdev_get()' to obtain a reference to the 'struct cdev *' stashed in the 'i_cdev' field of the target inode structure. If the pointer is NULL, then it is initialised lazily by looking up the kobject in the 'cdev_map' and so the whole procedure is protected by the 'cdev_lock' spinlock to serialise initialisation of the shared pointer. Unfortunately, it is possible for the initialising thread to fail *after* installing the new pointer, for example if the subsequent '->open()' call on the file fails. In this case, 'cdev_put()' is called, the reference count on the kobject is dropped and, if nobody else has taken a reference, the release function is called which finally clears 'inode->i_cdev' from 'cdev_purge()' before potentially freeing the object. The problem here is that a racing thread can happily take the 'cdev_lock' and see the non-NULL pointer in the inode, which can result in a refcount increment from zero and a warning: | ------------[ cut here ]------------ | refcount_t: addition on 0; use-after-free. | WARNING: CPU: 2 PID: 6385 at lib/refcount.c:25 refcount_warn_saturate+0x6d/0xf0 | Modules linked in: | CPU: 2 PID: 6385 Comm: repro Not tainted 5.5.0-rc2+ #22 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 | RIP: 0010:refcount_warn_saturate+0x6d/0xf0 | Code: 05 55 9a 15 01 01 e8 9d aa c8 ff 0f 0b c3 80 3d 45 9a 15 01 00 75 ce 48 c7 c7 00 9c 62 b3 c6 08 | RSP: 0018:ffffb524c1b9bc70 EFLAGS: 00010282 | RAX: 0000000000000000 RBX: ffff9e9da1f71390 RCX: 0000000000000000 | RDX: ffff9e9dbbd27618 RSI: ffff9e9dbbd18798 RDI: ffff9e9dbbd18798 | RBP: 0000000000000000 R08: 000000000000095f R09: 0000000000000039 | R10: 0000000000000000 R11: ffffb524c1b9bb20 R12: ffff9e9da1e8c700 | R13: ffffffffb25ee8b0 R14: 0000000000000000 R15: ffff9e9da1e8c700 | FS: 00007f3b87d26700(0000) GS:ffff9e9dbbd00000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 00007fc16909c000 CR3: 000000012df9c000 CR4: 00000000000006e0 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | Call Trace: | kobject_get+0x5c/0x60 | cdev_get+0x2b/0x60 | chrdev_open+0x55/0x220 | ? cdev_put.part.3+0x20/0x20 | do_dentry_open+0x13a/0x390 | path_openat+0x2c8/0x1470 | do_filp_open+0x93/0x100 | ? selinux_file_ioctl+0x17f/0x220 | do_sys_open+0x186/0x220 | do_syscall_64+0x48/0x150 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | RIP: 0033:0x7f3b87efcd0e | Code: 89 54 24 08 e8 a3 f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f4 | RSP: 002b:00007f3b87d259f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 | RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3b87efcd0e | RDX: 0000000000000000 RSI: 00007f3b87d25a80 RDI: 00000000ffffff9c | RBP: 00007f3b87d25e90 R08: 0000000000000000 R09: 0000000000000000 | R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe188f504e | R13: 00007ffe188f504f R14: 00007f3b87d26700 R15: 0000000000000000 | ---[ end trace 24f53ca58db8180a ]--- Since 'cdev_get()' can already fail to obtain a reference, simply move it over to use 'kobject_get_unless_zero()' instead of 'kobject_get()', which will cause the racing thread to return -ENXIO if the initialising thread fails unexpectedly. Cc: Hillf Danton <hdanton@sina.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+82defefbbd8527e1c2cb@syzkaller.appspotmail.com Signed-off-by: Will Deacon <will@kernel.org> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20191219120203.32691-1-will@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20pstore/ram: Regularize prz label allocation lifetimeKees Cook
commit e163fdb3f7f8c62dccf194f3f37a7bcb3c333aa8 upstream. In my attempt to fix a memory leak, I introduced a double-free in the pstore error path. Instead of trying to manage the allocation lifetime between persistent_ram_new() and its callers, adjust the logic so persistent_ram_new() always takes a kstrdup() copy, and leaves the caller's allocation lifetime up to the caller. Therefore callers are _always_ responsible for freeing their label. Before, it only needed freeing when the prz itself failed to allocate, and not in any of the other prz failure cases, which callers would have no visibility into, which is the root design problem that lead to both the leak and now double-free bugs. Reported-by: Cengiz Can <cengiz@kernel.wtf> Link: https://lore.kernel.org/lkml/d4ec59002ede4aaf9928c7f7526da87c@kernel.wtf Fixes: 8df955a32a73 ("pstore/ram: Fix error-path memory leak in persistent_ram_new() callers") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20locks: fix a potential use-after-free problem when wakeup a waiteryangerkun
commit 6d390e4b5d48ec03bb87e63cf0a2bff5f4e116da upstream. '16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the logic to check waiter->fl_blocker without blocked_lock_lock. And it will trigger a UAF when we try to wakeup some waiter: Thread 1 has create a write flock a on file, and now thread 2 try to unlock and delete flock a, thread 3 try to add flock b on the same file. Thread2 Thread3 flock syscall(create flock b) ...flock_lock_inode_wait flock_lock_inode(will insert our fl_blocked_member list to flock a's fl_blocked_requests) sleep flock syscall(unlock) ...flock_lock_inode_wait locks_delete_lock_ctx ...__locks_wake_up_blocks __locks_delete_blocks( b->fl_blocker = NULL) ... break by a signal locks_delete_block b->fl_blocker == NULL && list_empty(&b->fl_blocked_requests) success, return directly locks_free_lock b wake_up(&b->fl_waiter) trigger UAF Fix it by remove this logic, and this patch may also fix CVE-2019-19769. Cc: stable@vger.kernel.org Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20fs: move guard_bio_eod() after bio_set_op_attrsMing Lei
commit 83c9c547168e8b914ea6398430473a4de68c52cc upstream. Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers. So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio. Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info. Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more. Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei <ming.lei@redhat.com> Fold in kerneldoc and bio_op() change. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20fs: call fsnotify_sb_delete after evict_inodesEric Sandeen
commit 1edc8eb2e93130e36ac74ac9c80913815a57d413 upstream. When a filesystem is unmounted, we currently call fsnotify_sb_delete() before evict_inodes(), which means that fsnotify_unmount_inodes() must iterate over all inodes on the superblock looking for any inodes with watches. This is inefficient and can lead to livelocks as it iterates over many unwatched inodes. At this point, SB_ACTIVE is gone and dropping refcount to zero kicks the inode out out immediately, so anything processed by fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop. After that, the call to evict_inodes will evict everything else with a zero refcount. This should speed things up overall, and avoid livelocks in fsnotify_unmount_inodes(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20fs: avoid softlockups in s_inodes iteratorsEric Sandeen
commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 upstream. Anything that walks all inodes on sb->s_inodes list without rescheduling risks softlockups. Previous efforts were made in 2 functions, see: c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() ac05fbb inode: don't softlockup when evicting inodes but there hasn't been an audit of all walkers, so do that now. This also consistently moves the cond_resched() calls to the bottom of each loop in cases where it already exists. One loop remains: remove_dquot_ref(), because I'm not quite sure how to deal with that one w/o taking the i_lock. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20btrfs: handle error in btrfs_cache_block_groupJosef Bacik
commit db8fe64f9ce61d1d89d3c3c34d111a43afb9f053 upstream. We have a BUG_ON(ret < 0) in find_free_extent from btrfs_cache_block_group. If we fail to allocate our ctl we'll just panic, which is not good. Instead just go on to another block group. If we fail to find a block group we don't want to return ENOSPC, because really we got a ENOMEM and that's the root of the problem. Save our return from btrfs_cache_block_group(), and then if we still fail to make our allocation return that ret so we get the right error back. Tested with inject-error.py from bcc. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20btrfs: Fix error messages in qgroup_rescan_initNikolay Borisov
commit 37d02592f11bb76e4ab1dcaa5b8a2a0715403207 upstream. The branch of qgroup_rescan_init which is executed from the mount path prints wrong errors messages. The textual print out in case BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not set are transposed. Fix it by exchanging their place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20Btrfs: only associate the locked page with one async_chunk structChris Mason
commit 1d53c9e6723022b12e4a5ed4b141f67c834b7f6f upstream. The btrfs writepages function collects a large range of pages flagged for delayed allocation, and then sends them down through the COW code for processing. When compression is on, we allocate one async_chunk structure for every 512K, and then run those pages through the compression code for IO submission. writepages starts all of this off with a single page, locked by the original call to extent_write_cache_pages(), and it's important to keep track of this page because it has already been through clear_page_dirty_for_io(). The btrfs async_chunk struct has a pointer to the locked_page, and when we're redirtying the page because compression had to fallback to uncompressed IO, we use page->index to decide if a given async_chunk struct really owns that page. But, this is racey. If a given delalloc range is broken up into two async_chunks (chunkA and chunkB), we can end up with something like this: compress_file_range(chunkA) submit_compress_extents(chunkA) submit compressed bios(chunkA) put_page(locked_page) compress_file_range(chunkB) ... Or: async_cow_submit submit_compressed_extents <--- falls back to buffered writeout cow_file_range extent_clear_unlock_delalloc __process_pages_contig put_page(locked_pages) async_cow_submit The end result is that chunkA is completed and cleaned up before chunkB even starts processing. This means we can free locked_page() and reuse it elsewhere. If we get really lucky, it'll have the same page->index in its new home as it did before. While we're processing chunkB, we might decide we need to fall back to uncompressed IO, and so compress_file_range() will call __set_page_dirty_nobufers() on chunkB->locked_page. Without cgroups in use, this creates as a phantom dirty page, which isn't great but isn't the end of the world. What can happen, it can go through the fixup worker and the whole COW machinery again: in submit_compressed_extents(): while (async extents) { ... cow_file_range if (!page_started ...) extent_write_locked_range else if (...) unlock_page continue; This hasn't been observed in practice but is still possible. With cgroups in use, we might crash in the accounting code because page->mapping->i_wb isn't set. BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 IP: percpu_counter_add_batch+0x11/0x70 PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 16 PID: 2172 Comm: rm Not tainted RIP: 0010:percpu_counter_add_batch+0x11/0x70 RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115 RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090 RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000 R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001 FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: account_page_cleaned+0x15b/0x1f0 __cancel_dirty_page+0x146/0x200 truncate_cleanup_page+0x92/0xb0 truncate_inode_pages_range+0x202/0x7d0 btrfs_evict_inode+0x92/0x5a0 evict+0xc1/0x190 do_unlinkat+0x176/0x280 do_syscall_64+0x63/0x1a0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 The fix here is to make asyc_chunk->locked_page NULL everywhere but the one async_chunk struct that's allowed to do things to the locked page. Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/ Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Chris Mason <clm@fb.com> [ update changelog from mail thread discussion ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>