summaryrefslogtreecommitdiffstats
path: root/fs/ext4
AgeCommit message (Collapse)Author
2020-10-29ext4: limit entries returned when counting fsmap recordsDarrick J. Wong
[ Upstream commit af8c53c8bc087459b1aadd4c94805d8272358d79 ] If userspace asked fsmap to try to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20201001222148.GA49520@magnolia Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ext4: mark block bitmap corrupted when found instead of BUGONDmitry Monakhov
[ Upstream commit eb5760863fc28feab28b567ddcda7e667e638da0 ] We already has similar code in ext4_mb_complex_scan_group(), but ext4_mb_simple_scan_group() still affected. Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ext4: fix a data race at inode->i_disksizeQiujun Huang
[ Upstream commit dce8e237100f60c28cc66effb526ba65a01d8cb3 ] KCSAN find inode->i_disksize could be accessed concurrently. BUG: KCSAN: data-race in ext4_mark_iloc_dirty / ext4_write_end write (marked) to 0xffff8b8932f40090 of 8 bytes by task 66792 on cpu 0: ext4_write_end+0x53f/0x5b0 ext4_da_write_end+0x237/0x510 generic_perform_write+0x1c4/0x2a0 ext4_buffered_write_iter+0x13a/0x210 ext4_file_write_iter+0xe2/0x9b0 new_sync_write+0x29c/0x3a0 __vfs_write+0x92/0xa0 vfs_write+0xfc/0x2a0 ksys_write+0xe8/0x140 __x64_sys_write+0x4c/0x60 do_syscall_64+0x8a/0x2a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff8b8932f40090 of 8 bytes by task 14414 on cpu 1: ext4_mark_iloc_dirty+0x716/0x1190 ext4_mark_inode_dirty+0xc9/0x360 ext4_convert_unwritten_extents+0x1bc/0x2a0 ext4_convert_unwritten_io_end_vec+0xc5/0x150 ext4_put_io_end+0x82/0x130 ext4_writepages+0xae7/0x16f0 do_writepages+0x64/0x120 __writeback_single_inode+0x7d/0x650 writeback_sb_inodes+0x3a4/0x860 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x43f/0x510 wb_workfn+0x3b2/0x8a0 process_one_work+0x39b/0x7e0 worker_thread+0x88/0x650 kthread+0x1d4/0x1f0 ret_from_fork+0x35/0x40 The plain read is outside of inode->i_data_sem critical section which results in a data race. Fix it by adding READ_ONCE(). Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Link: https://lore.kernel.org/r/1582556566-3909-1-git-send-email-hqjagain@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03fs: prevent BUG_ON in submit_bh_wbc()Xianting Tian
[ Upstream commit 377254b2cd2252c7c3151b113cbdf93a7736c2e9 ] If a device is hot-removed --- for example, when a physical device is unplugged from pcie slot or a nbd device's network is shutdown --- this can result in a BUG_ON() crash in submit_bh_wbc(). This is because the when the block device dies, the buffer heads will have their Buffer_Mapped flag get cleared, leading to the crash in submit_bh_wbc. We had attempted to work around this problem in commit a17712c8 ("ext4: check superblock mapped prior to committing"). Unfortunately, it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the device dies between when the work-around check in ext4_commit_super() and when submit_bh_wbh() is finally called: Code path: ext4_commit_super judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8 lock_buffer(sbh) ... unlock_buffer(sbh) __sync_dirty_buffer(sbh,... lock_buffer(sbh) judge if 'buffer_mapped(sbh))' is false, return <== added by this patch submit_bh(...,sbh) submit_bh_wbc(...,sbh,...) [100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc() [100722.966503] invalid opcode: 0000 [#1] SMP [100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000 [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190 [100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246 [100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000 [100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001 [100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d [100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800 [100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000 [100722.966580] FS: 00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000 [100722.966580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0 [100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [100722.966583] PKRU: 55555554 [100722.966583] Call Trace: [100722.966588] __sync_dirty_buffer+0x6e/0xd0 [100722.966614] ext4_commit_super+0x1d8/0x290 [ext4] [100722.966626] __ext4_std_error+0x78/0x100 [ext4] [100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4] [100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4] [100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4] [100722.966663] ext4_mark_inode_dirty+0x53/0x1e0 [ext4] [100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4] [100722.966679] ext4_dirty_inode+0x48/0x70 [ext4] [100722.966682] __mark_inode_dirty+0x17f/0x350 [100722.966686] generic_update_time+0x87/0xd0 [100722.966687] touch_atime+0xa9/0xd0 [100722.966690] generic_file_read_iter+0xa09/0xcd0 [100722.966694] ? page_cache_tree_insert+0xb0/0xb0 [100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4] [100722.966707] ? __inode_security_revalidate+0x4f/0x60 [100722.966709] __vfs_read+0xec/0x160 [100722.966711] vfs_read+0x8c/0x130 [100722.966712] SyS_pread64+0x87/0xb0 [100722.966716] do_syscall_64+0x67/0x1b0 [100722.966719] entry_SYSCALL64_slow_path+0x25/0x25 To address this, add the check of 'buffer_mapped(bh)' to __sync_dirty_buffer(). This also has the benefit of fixing this for other file systems. With this addition, we can drop the workaround in ext4_commit_supper(). [ Commit description rewritten by tytso. ] Signed-off-by: Xianting Tian <xianting_tian@126.com> Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03ext4: correctly restore system zone info when remount failsJan Kara
[ Upstream commit 0f5bde1db174f6c471f0bd27198575719dabe3e5 ] When remounting filesystem fails late during remount handling and block_validity mount option is also changed during the remount, we fail to restore system zone information to a state matching the mount option. This is mostly harmless, just the block validity checking will not match the situation described by the mount option. Make sure these two are always consistent. Reported-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03ext4: handle error of ext4_setup_system_zone() on remountJan Kara
[ Upstream commit d176b1f62f242ab259ff665a26fbac69db1aecba ] ext4_setup_system_zone() can fail. Handle the failure in ext4_remount(). Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03ext4: handle option set by mount flags correctlyLukas Czerner
[ Upstream commit f25391ebb475d3ffb3aa61bb90e3594c841749ef ] Currently there is a problem with mount options that can be both set by vfs using mount flags or by a string parsing in ext4. i_version/iversion options gets lost after remount, for example $ mount -o i_version /dev/pmem0 /mnt $ grep pmem0 /proc/self/mountinfo | grep i_version 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version $ mount -o remount,ro /mnt $ grep pmem0 /proc/self/mountinfo | grep i_version nolazytime gets ignored by ext4 on remount, for example $ mount -o lazytime /dev/pmem0 /mnt $ grep pmem0 /proc/self/mountinfo | grep lazytime 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel $ mount -o remount,nolazytime /mnt $ grep pmem0 /proc/self/mountinfo | grep lazytime 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to s_flags before we parse the option and use the resulting state of the same flags in *flags at the end of successful remount. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03ext4: handle read only external journal deviceLukas Czerner
[ Upstream commit 273108fa5015eeffc4bacfa5ce272af3434b96e4 ] Ext4 uses blkdev_get_by_dev() to get the block_device for journal device which does check to see if the read-only block device was opened read-only. As a result ext4 will hapily proceed mounting the file system with external journal on read-only device. This is bad as we would not be able to use the journal leading to errors later on. Instead of simply failing to mount file system in this case, treat it in a similar way we treat internal journal on read-only device. Allow to mount with -o noload in read-only mode. This can be reproduced easily like this: mke2fs -F -O journal_dev $JOURNAL_DEV 100M mkfs.$FSTYPE -F -J device=$JOURNAL_DEV $FS_DEV blockdev --setro $JOURNAL_DEV mount $FS_DEV $MNT touch $MNT/file umount $MNT leading to error like this [ 1307.318713] ------------[ cut here ]------------ [ 1307.323362] generic_make_request: Trying to write to read-only block-device dm-2 (partno 0) [ 1307.331741] WARNING: CPU: 36 PID: 3224 at block/blk-core.c:855 generic_make_request_checks+0x2c3/0x580 [ 1307.341041] Modules linked in: ext4 mbcache jbd2 rfkill intel_rapl_msr intel_rapl_common isst_if_commd [ 1307.419445] CPU: 36 PID: 3224 Comm: jbd2/dm-2 Tainted: G W I 5.8.0-rc5 #2 [ 1307.427359] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 2.3.10 08/15/2019 [ 1307.434932] RIP: 0010:generic_make_request_checks+0x2c3/0x580 [ 1307.440676] Code: 94 03 00 00 48 89 df 48 8d 74 24 08 c6 05 cf 2b 18 01 01 e8 7f a4 ff ff 48 c7 c7 50e [ 1307.459420] RSP: 0018:ffffc0d70eb5fb48 EFLAGS: 00010286 [ 1307.464646] RAX: 0000000000000000 RBX: ffff9b33b2978300 RCX: 0000000000000000 [ 1307.471780] RDX: ffff9b33e12a81e0 RSI: ffff9b33e1298000 RDI: ffff9b33e1298000 [ 1307.478913] RBP: ffff9b7b9679e0c0 R08: 0000000000000837 R09: 0000000000000024 [ 1307.486044] R10: 0000000000000000 R11: ffffc0d70eb5f9f0 R12: 0000000000000400 [ 1307.493177] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 1307.500308] FS: 0000000000000000(0000) GS:ffff9b33e1280000(0000) knlGS:0000000000000000 [ 1307.508396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1307.514142] CR2: 000055eaf4109000 CR3: 0000003dee40a006 CR4: 00000000007606e0 [ 1307.521273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1307.528407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1307.535538] PKRU: 55555554 [ 1307.538250] Call Trace: [ 1307.540708] generic_make_request+0x30/0x340 [ 1307.544985] submit_bio+0x43/0x190 [ 1307.548393] ? bio_add_page+0x62/0x90 [ 1307.552068] submit_bh_wbc+0x16a/0x190 [ 1307.555833] jbd2_write_superblock+0xec/0x200 [jbd2] [ 1307.560803] jbd2_journal_update_sb_log_tail+0x65/0xc0 [jbd2] [ 1307.566557] jbd2_journal_commit_transaction+0x2ae/0x1860 [jbd2] [ 1307.572566] ? check_preempt_curr+0x7a/0x90 [ 1307.576756] ? update_curr+0xe1/0x1d0 [ 1307.580421] ? account_entity_dequeue+0x7b/0xb0 [ 1307.584955] ? newidle_balance+0x231/0x3d0 [ 1307.589056] ? __switch_to_asm+0x42/0x70 [ 1307.592986] ? __switch_to_asm+0x36/0x70 [ 1307.596918] ? lock_timer_base+0x67/0x80 [ 1307.600851] kjournald2+0xbd/0x270 [jbd2] [ 1307.604873] ? finish_wait+0x80/0x80 [ 1307.608460] ? commit_timeout+0x10/0x10 [jbd2] [ 1307.612915] kthread+0x114/0x130 [ 1307.616152] ? kthread_park+0x80/0x80 [ 1307.619816] ret_from_fork+0x22/0x30 [ 1307.623400] ---[ end trace 27490236265b1630 ]--- Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200717090605.2612-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03ext4: don't BUG on inconsistent journal featureJan Kara
[ Upstream commit 11215630aada28307ba555a43138db6ac54fa825 ] A customer has reported a BUG_ON in ext4_clear_journal_err() hitting during an LTP testing. Either this has been caused by a test setup issue where the filesystem was being overwritten while LTP was mounting it or the journal replay has overwritten the superblock with invalid data. In either case it is preferable we don't take the machine down with a BUG_ON. So handle the situation of unexpectedly missing has_journal feature more gracefully. We issue warning and fail the mount in the cases where the race window is narrow and the failed check is most likely a programming error. In cases where fs corruption is more likely, we do full ext4_error() handling before failing mount / remount. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26ext4: don't allow overlapping system zonesJan Kara
[ Upstream commit bf9a379d0980e7413d94cb18dac73db2bfc5f470 ] Currently, add_system_zone() just silently merges two added system zones that overlap. However the overlap should not happen and it generally suggests that some unrelated metadata overlap which indicates the fs is corrupted. We should have caught such problems earlier (e.g. in ext4_check_descriptors()) but add this check as another line of defense. In later patch we also use this for stricter checking of journal inode extent tree. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26ext4: fix potential negative array index in do_split()Eric Sandeen
[ Upstream commit 5872331b3d91820e14716632ebb56b1399b34fe1 ] If for any reason a directory passed to do_split() does not have enough active entries to exceed half the size of the block, we can end up iterating over all "count" entries without finding a split point. In this case, count == move, and split will be zero, and we will attempt a negative index into map[]. Guard against this by detecting this case, and falling back to split-to-half-of-count instead; in this case we will still have plenty of space (> half blocksize) in each split block. Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26ext4: fix checking of directory entry validity for inline directoriesJan Kara
commit 7303cb5bfe845f7d43cd9b2dbd37dbb266efda9b upstream. ext4_search_dir() and ext4_generic_delete_entry() can be called both for standard director blocks and for inline directories stored inside inode or inline xattr space. For the second case we didn't call ext4_check_dir_entry() with proper constraints that could result in accepting corrupted directory entry as well as false positive filesystem errors like: EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400: block 113246792: comm dockerd: bad entry in directory: directory entry too close to block end - offset=0, inode=28320403, rec_len=32, name_len=8, size=4096 Fix the arguments passed to ext4_check_dir_entry(). Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-07ext4: fix direct I/O read errorJiang Ying
This patch is used to fix ext4 direct I/O read error when the read size is not aligned with block size. Then, I will use a test to explain the error. (1) Make a file that is not aligned with block size: $dd if=/dev/zero of=./test.jar bs=1000 count=3 (2) I wrote a source file named "direct_io_read_file.c" as following: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/file.h> #include <sys/types.h> #include <sys/stat.h> #include <string.h> #define BUF_SIZE 1024 int main() { int fd; int ret; unsigned char *buf; ret = posix_memalign((void **)&buf, 512, BUF_SIZE); if (ret) { perror("posix_memalign failed"); exit(1); } fd = open("./test.jar", O_RDONLY | O_DIRECT, 0755); if (fd < 0){ perror("open ./test.jar failed"); exit(1); } do { ret = read(fd, buf, BUF_SIZE); printf("ret=%d\n",ret); if (ret < 0) { perror("write test.jar failed"); } } while (ret > 0); free(buf); close(fd); } (3) Compile the source file: $gcc direct_io_read_file.c -D_GNU_SOURCE (4) Run the test program: $./a.out The result is as following: ret=1024 ret=1024 ret=952 ret=-1 write test.jar failed: Invalid argument. I have tested this program on XFS filesystem, XFS does not have this problem, because XFS use iomap_dio_rw() to do direct I/O read. And the comparing between read offset and file size is done in iomap_dio_rw(), the code is as following: if (pos < size) { retval = filemap_write_and_wait_range(mapping, pos, pos + iov_length(iov, nr_segs) - 1); if (!retval) { retval = mapping->a_ops->direct_IO(READ, iocb, iov, pos, nr_segs); } ... } ...only when "pos < size", direct I/O can be done, or 0 will be return. I have tested the fix patch on Ext4, it is up to the mustard of EINVAL in man2(read) as following: #include <unistd.h> ssize_t read(int fd, void *buf, size_t count); EINVAL fd is attached to an object which is unsuitable for reading; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the current file offset is not suitably aligned. So I think this patch can be applied to fix ext4 direct I/O error. However Ext4 introduces direct I/O read using iomap infrastructure on kernel 5.5, the patch is commit <b1b4705d54ab> ("ext4: introduce direct I/O read using iomap infrastructure"), then Ext4 will be the same as XFS, they all use iomap_dio_rw() to do direct I/O read. So this problem does not exist on kernel 5.5 for Ext4. >From above description, we can see this problem exists on all the kernel versions between kernel 3.14 and kernel 5.4. It will cause the Applications to fail to read. For example, when the search service downloads a new full index file, the search engine is loading the previous index file and is processing the search request, it can not use buffer io that may squeeze the previous index file in use from pagecache, so the serch service must use direct I/O read. Please apply this patch on these kernel versions, or please use the method on kernel 5.5 to fix this problem. Fixes: 9fe55eea7e4b ("Fix race when checking i_size on direct i/o read") Reviewed-by: Jan Kara <jack@suse.cz> Co-developed-by: Wang Long <wanglong19@meituan.com> Signed-off-by: Wang Long <wanglong19@meituan.com> Signed-off-by: Jiang Ying <jiangying8582@126.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-24ext4: avoid race conditions when remounting with options that change daxTheodore Ts'o
[ Upstream commit 829b37b8cddb1db75c1b7905505b90e593b15db1 ] Trying to change dax mount options when remounting could allow mount options to be enabled for a small amount of time, and then the mount option change would be reverted. In the case of "mount -o remount,dax", this can cause a race where files would temporarily treated as DAX --- and then not. Cc: stable@kernel.org Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24ext4: avoid utf8_strncasecmp() with unstable nameEric Biggers
commit 2ce3ee931a097e9720310db3f09c01c825a4580c upstream. If the dentry name passed to ->d_compare() fits in dentry::d_iname, then it may be concurrently modified by a rename. This can cause undefined behavior (possibly out-of-bounds memory accesses or crashes) in utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings that may be concurrently modified. Fix this by first copying the filename to a stack buffer if needed. This way we get a stable snapshot of the filename. Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.2+ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Daniel Rosenberg <drosen@google.com> Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200601200543.59417-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-24ext4: fix partial cluster initialization when splitting extentJeffle Xu
commit cfb3c85a600c6aa25a2581b3c1c4db3460f14e46 upstream. Fix the bug when calculating the physical block number of the first block in the split extent. This bug will cause xfstests shared/298 failure on ext4 with bigalloc enabled occasionally. Ext4 error messages indicate that previously freed blocks are being freed again, and the following fsck will fail due to the inconsistency of block bitmap and bg descriptor. The following is an example case: 1. First, Initialize a ext4 filesystem with cluster size '16K', block size '4K', in which case, one cluster contains four blocks. 2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent tree of this file is like: ... 36864:[0]4:220160 36868:[0]14332:145408 51200:[0]2:231424 ... 3. Then execute PUNCH_HOLE fallocate on this file. The hole range is like: .. ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1 ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1 ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1 ... 4. Then the extent tree of this file after punching is like ... 49507:[0]37:158047 49547:[0]58:158087 ... 5. Detailed procedure of punching hole [49544, 49546] 5.1. The block address space: ``` lblk ~49505 49506 49507~49543 49544~49546 49547~ ---------+------+-------------+----------------+-------- extent | hole | extent | hole | extent ---------+------+-------------+----------------+-------- pblk ~158045 158046 158047~158083 158084~158086 158087~ ``` 5.2. The detailed layout of cluster 39521: ``` cluster 39521 <-------------------------------> hole extent <----------------------><-------- lblk 49544 49545 49546 49547 +-------+-------+-------+-------+ | | | | | +-------+-------+-------+-------+ pblk 158084 1580845 158086 158087 ``` 5.3. The ftrace output when punching hole [49544, 49546]: - ext4_ext_remove_space (start 49544, end 49546) - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2]) - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2] - ext4_free_blocks: (block 158084 count 4) - ext4_mballoc_free (extent 1/6753/1) 5.4. Ext4 error message in dmesg: EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt. EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters In this case, the whole cluster 39521 is freed mistakenly when freeing pblock 158084~158086 (i.e., the first three blocks of this cluster), although pblock 158087 (the last remaining block of this cluster) has not been freed yet. The root cause of this isuue is that, the pclu of the partial cluster is calculated mistakenly in ext4_ext_remove_space(). The correct partial_cluster.pclu (i.e., the cluster number of the first block in the next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather than 39522. Fixes: f4226d9ea400 ("ext4: fix partial cluster initialization") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Eric Whitney <enwlinux@gmail.com> Cc: stable@kernel.org # v3.19+ Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-24ext4: stop overwrite the errcode in ext4_setup_superyangerkun
[ Upstream commit 5adaccac46ea79008d7b75f47913f1a00f91d0ce ] Now the errcode from ext4_commit_super will overwrite EROFS exists in ext4_setup_super. Actually, no need to call ext4_commit_super since we will return EROFS. Fix it by goto done directly. Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super") Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22ext4: fix race between ext4_sync_parent() and rename()Eric Biggers
commit 08adf452e628b0e2ce9a01048cfbec52353703d7 upstream. 'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is broken because without d_lock, d_parent can be concurrently changed due to a rename(). Then if the old directory is immediately deleted, old d_parent->inode can be NULL. That causes a NULL dereference in igrab(). To fix this, use dget_parent() to safely grab a reference to the parent dentry, which pins the inode. This also eliminates the need to use d_find_any_alias() other than for the initial inode, as we no longer throw away the dentry at each step. This is an extremely hard race to hit, but it is possible. Adding a udelay() in between the reads of ->d_parent and its ->d_inode makes it reproducible on a no-journal filesystem using the following program: #include <fcntl.h> #include <unistd.h> int main() { if (fork()) { for (;;) { mkdir("dir1", 0700); int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC); write(fd, "X", 1); close(fd); } } else { mkdir("dir2", 0700); for (;;) { rename("dir1/file", "dir2/file"); rmdir("dir1"); } } } Fixes: d59729f4e794 ("ext4: fix races in ext4_sync_parent()") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix error pointer dereferenceJeffle Xu
commit 8418897f1bf87da0cb6936489d57a4320c32c0af upstream. Don't pass error pointers to brelse(). commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed some cases, fix the remaining one case. Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is stored in @bs->bh, which will be passed to brelse() in the cleanup routine of ext4_xattr_set_handle(). This will then cause a NULL panic crash in __brelse(). BUG: unable to handle kernel NULL pointer dereference at 000000000000005b RIP: 0010:__brelse+0x1b/0x50 Call Trace: ext4_xattr_set_handle+0x163/0x5d0 ext4_xattr_set+0x95/0x110 __vfs_setxattr+0x6b/0x80 __vfs_setxattr_noperm+0x68/0x1b0 vfs_setxattr+0xa0/0xb0 setxattr+0x12c/0x1a0 path_setxattr+0x8d/0xc0 __x64_sys_setxattr+0x27/0x30 do_syscall_64+0x60/0x250 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this case, @bs->bh stores '-EIO' actually. Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: stable@kernel.org # 2.6.19 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_maxHarshad Shirwadkar
commit c36a71b4e35ab35340facdd6964a00956b9fef0a upstream. If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned (-1) resulting in illegal memory accesses. Although there is no consistent repro, we see that generic/019 sometimes crashes because of this bug. Ran gce-xfstests smoke and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02ext4: check for non-zero journal inum in ext4_calculate_overheadRitesh Harjani
commit f1eec3b0d0a849996ebee733b053efa71803dad5 upstream. While calculating overhead for internal journal, also check that j_inum shouldn't be 0. Otherwise we get below error with xfstests generic/050 with external journal (XXX_LOGDEV config) enabled. It could be simply reproduced with loop device with an external journal and marking blockdev as RO before mounting. [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 CPU: 107 PID: 115347 Comm: mount Tainted: G L --------- -t - 4.18.0-167.el8.ppc64le #1 NIP: c0000000006f6d44 LR: c0000000006f6d40 CTR: 0000000030041dd4 <...> NIP [c0000000006f6d44] generic_make_request_checks+0x6b4/0x7d0 LR [c0000000006f6d40] generic_make_request_checks+0x6b0/0x7d0 <...> Call Trace: generic_make_request_checks+0x6b0/0x7d0 (unreliable) generic_make_request+0x3c/0x420 submit_bio+0xd8/0x200 submit_bh_wbc+0x1e8/0x250 __sync_dirty_buffer+0xd0/0x210 ext4_commit_super+0x310/0x420 [ext4] __ext4_error+0xa4/0x1e0 [ext4] __ext4_iget+0x388/0xe10 [ext4] ext4_get_journal_inode+0x40/0x150 [ext4] ext4_calculate_overhead+0x5a8/0x610 [ext4] ext4_fill_super+0x3188/0x3260 [ext4] mount_bdev+0x778/0x8f0 ext4_mount+0x28/0x50 [ext4] mount_fs+0x74/0x230 vfs_kern_mount.part.6+0x6c/0x250 do_mount+0x2fc/0x1280 sys_mount+0x158/0x180 system_call+0x5c/0x70 EXT4-fs (pmem1p2): no journal found EXT4-fs (pmem1p2): can't get journal size EXT4-fs (pmem1p2): mounted filesystem without journal. Opts: dax,norecovery Fixes: 3c816ded78bb ("ext4: use journal inode to determine journal overhead") Reported-by: Harish Sriram <harish@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200316093038.25485-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02ext4: convert BUG_ON's to WARN_ON's in mballoc.cTheodore Ts'o
[ Upstream commit 907ea529fc4c3296701d2bfc8b831dd2a8121a34 ] If the in-core buddy bitmap gets corrupted (or out of sync with the block bitmap), issue a WARN_ON and try to recover. In most cases this involves skipping trying to allocate out of a particular block group. We can end up declaring the file system corrupted, which is fair, since the file system probably should be checked before we proceed any further. Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu Google-Bug-Id: 34811296 Google-Bug-Id: 34639169 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02ext4: increase wait time needed before reuse of deleted inode numbersTheodore Ts'o
[ Upstream commit a17a9d935dc4a50acefaf319d58030f1da7f115a ] Current wait times have proven to be too short to protect against inode reuses that lead to metadata inconsistencies. Now that we will retry the inode allocation if we can't find any recently deleted inodes, it's a lot safer to increase the recently deleted time from 5 seconds to a minute. Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu Google-Bug-Id: 36602237 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02ext4: use matching invalidatepage in ext4_writepageyangerkun
[ Upstream commit c2a559bc0e7ed5a715ad6b947025b33cb7c05ea7 ] Run generic/388 with journal data mode sometimes may trigger the warning in ext4_invalidatepage. Actually, we should use the matching invalidatepage in ext4_writepage. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29ext4: fix extent_status fragmentation for plain filesDmitry Monakhov
[ Upstream commit 4068664e3cd2312610ceac05b74c4cf1853b8325 ] Extents are cached in read_extent_tree_block(); as a result, extents are not cached for inodes with depth == 0 when we try to find the extent using ext4_find_extent(). The result of the lookup is cached in ext4_map_blocks() but is only a subset of the extent on disk. As a result, the contents of extents status cache can get very badly fragmented for certain workloads, such as a random 4k read workload. File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 8191: 40960.. 49151: 8192: last,eof $ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test $ perf script | grep ext4_es_insert_extent | head -n 10 fio 131 [000] 13.975421: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W fio 131 [000] 13.975939: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W fio 131 [000] 13.976467: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W fio 131 [000] 13.976937: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W fio 131 [000] 13.977440: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W fio 131 [000] 13.977931: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W fio 131 [000] 13.978376: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W fio 131 [000] 13.978957: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W fio 131 [000] 13.979474: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W Fix this by caching the extents for inodes with depth == 0 in ext4_find_extent(). [ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this newly added function is not in extents_cache.c, and to avoid potential visual confusion with ext4_es_cache_extent(). -TYT ] Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23ext4: do not commit super on read-only bdevEric Sandeen
[ Upstream commit c96e2b8564adfb8ac14469ebc51ddc1bfecb3ae2 ] Under some circumstances we may encounter a filesystem error on a read-only block device, and if we try to save the error info to the superblock and commit it, we'll wind up with a noisy error and backtrace, i.e.: [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 ... To avoid this, commit the error info in the superblock only if the block device is writable. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/4b6e774d-cc00-3469-7abb-108eb151071a@sandeen.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23ext4: use non-movable memory for superblock readaheadRoman Gushchin
commit d87f639258a6a5980183f11876c884931ad93da2 upstream. Since commit a8ac900b8163 ("ext4: use non-movable memory for the superblock") buffers for ext4 superblock were allocated using the sb_bread_unmovable() helper which allocated buffer heads out of non-movable memory blocks. It was necessarily to not block page migrations and do not cause cma allocation failures. However commit 85c8f176a611 ("ext4: preload block group descriptors") broke this by introducing pre-reading of the ext4 superblock. The problem is that __breadahead() is using __getblk() underneath, which allocates buffer heads out of movable memory. It resulted in page migration failures I've seen on a machine with an ext4 partition and a preallocated cma area. Fix this by introducing sb_breadahead_unmovable() and __breadahead_gfp() helpers which use non-movable memory for buffer head allocations and use them for the ext4 superblock readahead. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 85c8f176a611 ("ext4: preload block group descriptors") Signed-off-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: do not zeroout extents beyond i_disksizeJan Kara
commit 801674f34ecfed033b062a0f217506b93c8d5e8a upstream. We do not want to create initialized extents beyond end of file because for e2fsck it is impossible to distinguish them from a case of corrupted file size / extent tree and so it complains like: Inode 12, i_size is 147456, should be 163840. Fix? no Code in ext4_ext_convert_to_initialized() and ext4_split_convert_extents() try to make sure it does not create initialized extents beyond inode size however they check against inode->i_size which is wrong. They should instead check against EXT4_I(inode)->i_disksize which is the current inode size on disk. That's what e2fsck is going to see in case of crash before all dirty data is written. This bug manifests as generic/456 test failure (with recent enough fstests where fsx got fixed to properly pass FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock mount option. CC: stable@vger.kernel.org Fixes: 21ca087a3891 ("ext4: Do not zero out uninitialized extents beyond i_size") Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: fix incorrect inodes per group in error messageJosh Triplett
commit b9c538da4e52a7b79dfcf4cfa487c46125066dfb upstream. If ext4_fill_super detects an invalid number of inodes per group, the resulting error message printed the number of blocks per group, rather than the number of inodes per group. Fix it to print the correct value. Fixes: cd6bb35bf7f6d ("ext4: use more strict checks for inodes_per_block on mount") Link: https://lore.kernel.org/r/8be03355983a08e5d4eed480944613454d7e2550.1585434649.git.josh@joshtriplett.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: fix incorrect group count in ext4_fill_super error messageJosh Triplett
commit df41460a21b06a76437af040d90ccee03888e8e5 upstream. ext4_fill_super doublechecks the number of groups before mounting; if that check fails, the resulting error message prints the group count from the ext4_sb_info sbi, which hasn't been set yet. Print the freshly computed group count instead (which at that point has just been computed in "blocks_count"). Signed-off-by: Josh Triplett <josh@joshtriplett.org> Fixes: 4ec1102813798 ("ext4: Add sanity checks for the superblock before mounting the filesystem") Link: https://lore.kernel.org/r/8b957cd1513fcc4550fe675c10bcce2175c33a49.1585431964.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17ext4: fix a data race at inode->i_blocksQian Cai
commit 28936b62e71e41600bab319f262ea9f9b1027629 upstream. inode->i_blocks could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_do_update_inode [ext4] / inode_add_bytes write to 0xffff9a00d4b982d0 of 8 bytes by task 22100 on cpu 118: inode_add_bytes+0x65/0xf0 __inode_add_bytes at fs/stat.c:689 (inlined by) inode_add_bytes at fs/stat.c:702 ext4_mb_new_blocks+0x418/0xca0 [ext4] ext4_ext_map_blocks+0x1a6b/0x27b0 [ext4] ext4_map_blocks+0x1a9/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block_unwritten+0x33/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] ext4_da_write_begin+0x35f/0x8f0 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9a00d4b982d0 of 8 bytes by task 8 on cpu 65: ext4_do_update_inode+0x4a0/0xf60 [ext4] ext4_inode_blocks_set at fs/ext4/inode.c:4815 ext4_mark_iloc_dirty+0xaf/0x160 [ext4] ext4_mark_inode_dirty+0x129/0x3e0 [ext4] ext4_convert_unwritten_extents+0x253/0x2d0 [ext4] ext4_convert_unwritten_io_end_vec+0xc5/0x150 [ext4] ext4_end_io_rsv_work+0x22c/0x350 [ext4] process_one_work+0x54f/0xb90 worker_thread+0x80/0x5f0 kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 4 locks held by kworker/u256:0/8: #0: ffff9a025abc4328 ((wq_completion)ext4-rsv-conversion){+.+.}, at: process_one_work+0x443/0xb90 #1: ffffab5a862dbe20 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}, at: process_one_work+0x443/0xb90 #2: ffff9a025a9d0f58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff9a00d4b985d8 (&(&ei->i_raw_lock)->rlock){+.+.}, at: ext4_do_update_inode+0xaa/0xf60 [ext4] irq event stamp: 3009267 hardirqs last enabled at (3009267): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (3009266): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (3009230): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (3009223): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 65 PID: 8 Comm: kworker/u256:0 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ext4] The plain read is outside of inode->i_lock critical section which results in a data race. Fix it by adding READ_ONCE() there. Link: https://lore.kernel.org/r/20200222043258.2279-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()Dan Carpenter
commit 37b0b6b8b99c0e1c1f11abbe7cf49b6d03795b3f upstream. If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix race between writepages and enabling EXT4_EXTENTS_FLEric Biggers
commit cb85f4d23f794e24127f3e562cb3b54b0803f456 upstream. If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: rename s_journal_flag_rwsem to s_writepages_rwsemEric Biggers
commit bbd55937de8f2754adc5792b0f8e5ff7d9c0420e upstream. In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix mount failure with quota configured as moduleJan Kara
commit 9db176bceb5c5df4990486709da386edadc6bd1d upstream. When CONFIG_QFMT_V2 is configured as a module, the test in ext4_feature_set_ok() fails and so mount of filesystems with quota or project features fails. Fix the test to use IS_ENABLED macro which works properly even for modules. Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between s_flex_groups online resizing and accessSuraj Jitindar Singh
commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream. During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between s_group_info online resizing and accessSuraj Jitindar Singh
commit df3da4ea5a0fc5d115c90d5aa6caa4dd433750a7 upstream. During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Balbir Singh <sblbir@amazon.com> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between online resizing and write operationsTheodore Ts'o
commit 1d0c3924a92e69bfa91163bda83c12a994b4d106 upstream. During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu Reported-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: add cond_resched() to __ext4_find_entry()Shijie Luo
commit 9424ef56e13a1f14c57ea161eed3ecfdc7b2770e upstream. We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix a data race in EXT4_I(inode)->i_disksizeQian Cai
commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream. EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-24ext4: fix deadlock allocating bio_post_read_ctx from mempoolEric Biggers
[ Upstream commit 68e45330e341dad2d3a0a3f8ef2ec46a2a0a3bbc ] Without any form of coordination, any case where multiple allocations from the same mempool are needed at a time to make forward progress can deadlock under memory pressure. This is the case for struct bio_post_read_ctx, as one can be allocated to decrypt a Merkle tree page during fsverity_verify_bio(), which itself is running from a post-read callback for a data bio which has its own struct bio_post_read_ctx. Fix this by freeing the first bio_post_read_ctx before calling fsverity_verify_bio(). This works because verity (if enabled) is always the last post-read step. This deadlock can be reproduced by trying to read from an encrypted verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching mempool_alloc() to pretend that pool->alloc() always fails. Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually hit this bug in practice would require reading from lots of encrypted verity files at the same time. But it's theoretically possible, as N available objects isn't enough to guarantee forward progress when > N/2 threads each need 2 objects at a time. Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAITRitesh Harjani
[ Upstream commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 ] Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19ext4: choose hardlimit when softlimit is larger than hardlimit in ↵Chengguang Xu
ext4_statfs_project() [ Upstream commit 57c32ea42f8e802bda47010418e25043e0c9337f ] Setting softlimit larger than hardlimit seems meaningless for disk quota but currently it is allowed. In this case, there may be a bit of comfusion for users when they run df comamnd to directory which has project quota. For example, we set 20M softlimit and 10M hardlimit of block usage limit for project quota of test_dir(project id 123). [root@hades mnt_ext4]# repquota -P -a *** Report for project quotas on device /dev/loop0 Block grace time: 7days; Inode grace time: 7days Block limits File limits Project used soft hard grace used soft hard grace ---------------------------------------------------------------------- 0 -- 13 0 0 2 0 0 123 -- 10237 20480 10240 5 200 100 The result of df command as below: [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 20M 10M 10M 50% /home/cgxu/test/mnt_ext4 Even though it looks like there is another 10M free space to use, if we write new data to diretory test_dir(inherit project id), the write will fail with errno(-EDQUOT). After this patch, the df result looks like below. [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 10M 10M 3.0K 100% /home/cgxu/test/mnt_ext4 Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19ext4: improve explanation of a mount failure caused by a misconfigured kernelTheodore Ts'o
commit d65d87a07476aa17df2dcb3ad18c22c154315bec upstream. If CONFIG_QFMT_V2 is not enabled, but CONFIG_QUOTA is enabled, when a user tries to mount a file system with the quota or project quota enabled, the kernel will emit a very confusing messsage: EXT4-fs warning (device vdc): ext4_enable_quotas:5914: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix. EXT4-fs (vdc): mount failed We will now report an explanatory message indicating which kernel configuration options have to be enabled, to avoid customer/sysadmin confusion. Link: https://lore.kernel.org/r/20200215012738.565735-1-tytso@mit.edu Google-Bug-Id: 149093531 Fixes: 7c319d328505b778 ("ext4: make quota as first class supported feature") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: add cond_resched() to ext4_protect_reserved_inodeShijie Luo
commit af133ade9a40794a37104ecbcc2827c0ea373a3c upstream. When journal size is set too big by "mkfs.ext4 -J size=", or when we mount a crafted image to make journal inode->i_size too big, the loop, "while (i < num)", holds cpu too long. This could cause soft lockup. [ 529.357541] Call trace: [ 529.357551] dump_backtrace+0x0/0x198 [ 529.357555] show_stack+0x24/0x30 [ 529.357562] dump_stack+0xa4/0xcc [ 529.357568] watchdog_timer_fn+0x300/0x3e8 [ 529.357574] __hrtimer_run_queues+0x114/0x358 [ 529.357576] hrtimer_interrupt+0x104/0x2d8 [ 529.357580] arch_timer_handler_virt+0x38/0x58 [ 529.357584] handle_percpu_devid_irq+0x90/0x248 [ 529.357588] generic_handle_irq+0x34/0x50 [ 529.357590] __handle_domain_irq+0x68/0xc0 [ 529.357593] gic_handle_irq+0x6c/0x150 [ 529.357595] el1_irq+0xb8/0x140 [ 529.357599] __ll_sc_atomic_add_return_acquire+0x14/0x20 [ 529.357668] ext4_map_blocks+0x64/0x5c0 [ext4] [ 529.357693] ext4_setup_system_zone+0x330/0x458 [ext4] [ 529.357717] ext4_fill_super+0x2170/0x2ba8 [ext4] [ 529.357722] mount_bdev+0x1a8/0x1e8 [ 529.357746] ext4_mount+0x44/0x58 [ext4] [ 529.357748] mount_fs+0x50/0x170 [ 529.357752] vfs_kern_mount.part.9+0x54/0x188 [ 529.357755] do_mount+0x5ac/0xd78 [ 529.357758] ksys_mount+0x9c/0x118 [ 529.357760] __arm64_sys_mount+0x28/0x38 [ 529.357764] el0_svc_common+0x78/0x130 [ 529.357766] el0_svc_handler+0x38/0x78 [ 529.357769] el0_svc+0x8/0xc [ 541.356516] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mount:18674] Link: https://lore.kernel.org/r/20200211011752.29242-1-luoshijie1@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: fix checksum errors with indexed dirsJan Kara
commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream. DIR_INDEX has been introduced as a compat ext4 feature. That means that even kernels / tools that don't understand the feature may modify the filesystem. This works because for kernels not understanding indexed dir format, internal htree nodes appear just as empty directory entries. Index dir aware kernels then check the htree structure is still consistent before using the data. This all worked reasonably well until metadata checksums were introduced. The problem is that these effectively made DIR_INDEX only ro-compatible because internal htree nodes store checksums in a different place than normal directory blocks. Thus any modification ignorant to DIR_INDEX (or just clearing EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch and trigger kernel errors. So we have to be more careful when dealing with indexed directories on filesystems with checksumming enabled. 1) We just disallow loading any directory inodes with EXT4_INDEX_FL when DIR_INDEX is not enabled. This is harsh but it should be very rare (it means someone disabled DIR_INDEX on existing filesystem and didn't run e2fsck), e2fsck can fix the problem, and we don't want to answer the difficult question: "Should we rather corrupt the directory more or should we ignore that DIR_INDEX feature is not set?" 2) When we find out htree structure is corrupted (but the filesystem and the directory should in support htrees), we continue just ignoring htree information for reading but we refuse to add new entries to the directory to avoid corrupting it more. Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: fix support for inode sizes > 1024 bytesTheodore Ts'o
commit 4f97a68192bd33b9963b400759cef0ca5963af00 upstream. A recent commit, 9803387c55f7 ("ext4: validate the debug_want_extra_isize mount option at parse time"), moved mount-time checks around. One of those changes moved the inode size check before the blocksize variable was set to the blocksize of the file system. After 9803387c55f7 was set to the minimum allowable blocksize, which in practice on most systems would be 1024 bytes. This cuased file systems with inode sizes larger than 1024 bytes to be rejected with a message: EXT4-fs (sdXX): unsupported inode size: 4096 Fixes: 9803387c55f7 ("ext4: validate the debug_want_extra_isize mount option at parse time") Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.edu Reported-by: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: don't assume that mmp_nodename/bdevname have NULAndreas Dilger
commit 14c9ca0583eee8df285d68a0e6ec71053efd2228 upstream. Don't assume that the mmp_nodename and mmp_bdevname strings are NUL terminated, since they are filled in by snprintf(), which is not guaranteed to do so. Link: https://lore.kernel.org/r/1580076215-1048-1-git-send-email-adilger@dilger.ca Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11ext4: fix race conditions in ->d_compare() and ->d_hash()Eric Biggers
commit ec772f01307a2c06ebf6cdd221e6b518a71ddae7 upstream. Since ->d_compare() and ->d_hash() can be called in RCU-walk mode, ->d_parent and ->d_inode can be concurrently modified, and in particular, ->d_inode may be changed to NULL. For ext4_d_hash() this resulted in a reproducible NULL dereference if a lookup is done in a directory being deleted, e.g. with: int main() { if (fork()) { for (;;) { mkdir("subdir", 0700); rmdir("subdir"); } } else { for (;;) access("subdir/file", 0); } } ... or by running the 't_encrypted_d_revalidate' program from xfstests. Both repros work in any directory on a filesystem with the encoding feature, even if the directory doesn't actually have the casefold flag. I couldn't reproduce a crash in ext4_d_compare(), but it appears that a similar crash is possible there. Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and falling back to the case sensitive behavior if the inode is NULL. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.2+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11ext4: fix deadlock allocating crypto bounce page from mempoolEric Biggers
commit 547c556f4db7c09447ecf5f833ab6aaae0c5ab58 upstream. ext4_writepages() on an encrypted file has to encrypt the data, but it can't modify the pagecache pages in-place, so it encrypts the data into bounce pages and writes those instead. All bounce pages are allocated from a mempool using GFP_NOFS. This is not correct use of a mempool, and it can deadlock. This is because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never fail" mode for mempool_alloc() where a failed allocation will fall back to waiting for one of the preallocated elements in the pool. But since this mode is used for all a bio's pages and not just the first, it can deadlock waiting for pages already in the bio to be freed. This deadlock can be reproduced by patching mempool_alloc() to pretend that pool->alloc() always fails (so that it always falls back to the preallocations), and then creating an encrypted file of size > 128 KiB. Fix it by only using GFP_NOFS for the first page in the bio. For subsequent pages just use GFP_NOWAIT, and if any of those fail, just submit the bio and start a new one. This will need to be fixed in f2fs too, but that's less straightforward. Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>