aboutsummaryrefslogtreecommitdiffstats
path: root/fs/ext4
AgeCommit message (Collapse)Author
2017-07-15ext4: check return value of kstrtoull correctly in reserved_clusters_storeChao Yu
commit 1ea1516fbbab2b30bf98c534ecaacba579a35208 upstream. kstrtoull returns 0 on success, however, in reserved_clusters_store we will return -EINVAL if kstrtoull returns 0, it makes us fail to update reserved_clusters value through sysfs. Fixes: 76d33bca5581b1dd5c3157fa168db849a784ada4 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-29ext4: fix fdatasync(2) after extent manipulation operationsJan Kara
Currently, extent manipulation operations such as hole punch, range zeroing, or extent shifting do not record the fact that file data has changed and thus fdatasync(2) has a work to do. As a result if we crash e.g. after a punch hole and fdatasync, user can still possibly see the punched out data after journal replay. Test generic/392 fails due to these problems. Fix the problem by properly marking that file data has changed in these operations. CC: stable@vger.kernel.org Fixes: a4bb6b64e39abc0e41ca077725f2a72c868e7622 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26ext4: fix data corruption for mmap writesJan Kara
mpage_submit_page() can race with another process growing i_size and writing data via mmap to the written-back page. As mpage_submit_page() samples i_size too early, it may happen that ext4_bio_write_page() zeroes out too large tail of the page and thus corrupts user data. Fix the problem by sampling i_size only after the page has been write-protected in page tables by clear_page_dirty_for_io() call. Reported-by: Michael Zimmer <michael@swarm64.com> CC: stable@vger.kernel.org Fixes: cb20d5188366f04d96d2e07b1240cc92170ade40 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26ext4: fix data corruption with EXT4_GET_BLOCKS_ZEROJan Kara
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out allocated blocks and these blocks are actually converted from unwritten extent the following race can happen: CPU0 CPU1 page fault page fault ... ... ext4_map_blocks() ext4_ext_map_blocks() ext4_ext_handle_unwritten_extents() ext4_ext_convert_to_initialized() - zero out converted extent ext4_zeroout_es() - inserts extent as initialized in status tree ext4_map_blocks() ext4_es_lookup_extent() - finds initialized extent write data ext4_issue_zeroout() - zeroes out new extent overwriting data This problem can be reproduced by generic/340 for the fallocated case for the last block in the file. Fix the problem by avoiding zeroing out the area we are mapping with ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless to zero out this area in the first place as the caller asked us to convert the area to initialized because he is just going to write data there before the transaction finishes. To achieve this we delete the special case of zeroing out full extent as that will be handled by the cases below zeroing only the part of the extent that needs it. We also instruct ext4_split_extent() that the middle of extent being split contains data so that ext4_split_extent_at() cannot zero out full extent in case of ENOSPC. CC: stable@vger.kernel.org Fixes: 12735f881952c32b31bc4e433768f18489f79ec9 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-24ext4: fix quota charging for shared xattr blocksTahsin Erdogan
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr block when new references are made. However if dquot_initialize() hasn't been called on an inode, request for charging is effectively ignored because ext4_inode_info->i_dquot is not initialized yet. Add dquot_initialize() to call paths that lead to ext4_xattr_block_set(). Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24ext4: remove redundant check for encrypted file on dio write pathEric Biggers
Currently we don't allow direct I/O on encrypted regular files, so in such cases we return 0 early in ext4_direct_IO(). There was also an additional BUG_ON() check in ext4_direct_IO_write(), but it can never be hit because of the earlier check for the exact same condition in ext4_direct_IO(). There was also no matching check on the read path, which made the write path specific check seem very ad-hoc. Just remove the unnecessary BUG_ON(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: David Gstir <david@sigma-star.at> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24ext4: remove unused d_name argument from ext4_search_dir() et al.Eric Biggers
Now that we are passing a struct ext4_filename, we do not need to pass around the original struct qstr too. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24ext4: fix off-by-one error when writing back pages before dio readEric Biggers
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so we need to subtract 1 from pos + count. Note that 'count' is guaranteed to be nonzero since ext4_file_read_iter() returns early when given a 0 count. Fixes: 16c54688592c ("ext4: Allow parallel DIO reads") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()Eryu Guan
ext4_find_unwritten_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size ext4 on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/ext4/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is unconvered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-21ext4: keep existing extra fields when inode expandsKonstantin Khlebnikov
ext4_expand_extra_isize() should clear only space between old and new size. Fixes: 6dd4ee7cab7e # v2.6.23 Cc: stable@vger.kernel.org Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errorsKonstantin Khlebnikov
I've got another report about breaking ext4 by ENOMEM error returned from ext4_mb_load_buddy() caused by memory shortage in memory cgroup. This time inside ext4_discard_preallocations(). This patch replaces ext4_error() with ext4_warning() where errors returned from ext4_mb_load_buddy() are not fatal and handled by caller: * ext4_mb_discard_group_preallocations() - called before generating ENOSPC, we'll try to discard other group or return ENOSPC into user-space. * ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl. Some callers cannot handle errors, thus __GFP_NOFAIL is used for them: * ext4_discard_preallocations() * ext4_mb_discard_lg_preallocations() Fixes: adb7ef600cc9 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()Jan Kara
There is an off-by-one error in loop termination conditions in ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21ext4: fix SEEK_HOLEJan Kara
Currently, SEEK_HOLE implementation in ext4 may both return that there's a hole at some offset although that offset already has data and skip some holes during a search for the next hole. The first problem is demostrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec) Whence Result HOLE 0 Where we can see that SEEK_HOLE wrongly returned offset 0 as containing a hole although we have written data there. The second problem can be demonstrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offsets 56k..128k has been ignored by the SEEK_HOLE call. The underlying problem is in the ext4_find_unwritten_pgoff() which is just buggy. In some cases it fails to update returned offset when it finds a hole (when no pages are found or when the first found page has higher index than expected), in some cases conditions for detecting hole are just missing (we fail to detect a situation where indices of returned pages are not contiguous). Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page indices and also handle all cases where we got less pages then expected in one place and handle it properly there. CC: stable@vger.kernel.org Fixes: c8c0df241cc2719b1262e627f999638411934f60 CC: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21ext4: clear lockdep subtype for quota files on quota offJan Kara
Quota files have special ranking of i_data_sem lock. We inform lockdep about it when turning on quotas however when turning quotas off, we don't clear the lockdep subclass from i_data_sem lock and thus when the inode gets later reused for a normal file or directory, lockdep gets confused and complains about possible deadlocks. Fix the problem by resetting lockdep subclass of i_data_sem on quota off. Cc: stable@vger.kernel.org Fixes: daf647d2dd58cec59570d7698a45b98e580f2076 Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-13dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n caseDan Williams
Tetsuo reports: fs/built-in.o: In function `xfs_file_iomap_end': xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax' fs/built-in.o: In function `xfs_file_iomap_begin': xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host' make: *** [vmlinux] Error 1 $ grep DAX .config CONFIG_DAX=m # CONFIG_DEV_DAX is not set # CONFIG_FS_DAX is not set When FS_DAX=n we can/must throw away the dax code in filesystems. Implement 'fs_' versions of dax_get_by_host() and put_dax() that are nops in the FS_DAX=n case. Cc: <linux-xfs@vger.kernel.org> Cc: <linux-ext4@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Fixes: ef51042472f5 ("block, dax: move 'select DAX' from BLOCK to FS_DAX") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-13Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "15 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, docs: update memory.stat description with workingset* entries mm: vmscan: scan until it finds eligible pages mm, thp: copying user pages must schedule on collapse dax: fix PMD data corruption when fault races with write dax: fix data corruption when fault races with write ext4: return to starting transaction in ext4_dax_huge_fault() mm: fix data corruption due to stale mmap reads dax: prevent invalidation of mapped DAX entries Tigran has moved mm, vmalloc: fix vmalloc users tracking properly mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin gcov: support GCC 7.1 mm, vmstat: Remove spurious WARN() during zoneinfo print time: delete current_fs_time() hwpoison, memcg: forcibly uncharge LRU pages
2017-05-12ext4: return to starting transaction in ext4_dax_huge_fault()Jan Kara
DAX will return to locking exceptional entry before mapping blocks for a page fault to fix possible races with concurrent writes. To avoid lock inversion between exceptional entry lock and transaction start, start the transaction already in ext4_dax_huge_fault(). Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956 Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "Incremental fixes and a small feature addition on top of the main libnvdimm 4.12 pull request: - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX. The size regression is fixed by moving all dax helpers into the dax-core and only specifying "select DAX" for FS_DAX and dax-capable drivers. He also asked for clarification of the NR_DEV_DAX config option which, on closer look, does not need to be a config option at all. Mike also throws in a DEV_DAX_PMEM fixup for good measure. - Ben's attention to detail on -stable patch submissions caught a case where the recent fixes to arch_copy_from_iter_pmem() missed a condition where we strand dirty data in the cache. This is tagged for -stable and will also be included in the rework of the pmem api to a proposed {memcpy,copy_user}_flushcache() interface for 4.13. - Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace. - Ross noticed that the dax_device + dax_operations conversion broke __dax_zero_page_range(). The nvdimm unit tests fail to check this path, but xfstests immediately trips over it. No excuse for missing this before submitting the 4.12 pull request. These all pass the nvdimm unit tests and an xfstests spot check. The set has received a build success notification from the kbuild robot" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: filesystem-dax: fix broken __dax_zero_page_range() conversion libnvdimm, btt: ensure that initializing metadata clears poison libnvdimm: add an atomic vs process context flag to rw_bytes x86, pmem: Fix cache flushing for iovec write < 8 bytes device-dax: kill NR_DEV_DAX block, dax: move "select DAX" from BLOCK to FS_DAX device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - the rest of MM - various misc things - procfs updates - lib/ updates - checkpatch updates - kdump/kexec updates - add kvmalloc helpers, use them - time helper updates for Y2038 issues. We're almost ready to remove current_fs_time() but that awaits a btrfs merge. - add tracepoints to DAX * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4 selftests/vm: add a test for virtual address range mapping dax: add tracepoint to dax_insert_mapping() dax: add tracepoint to dax_writeback_one() dax: add tracepoints to dax_writeback_mapping_range() dax: add tracepoints to dax_load_hole() dax: add tracepoints to dax_pfn_mkwrite() dax: add tracepoints to dax_iomap_pte_fault() mtd: nand: nandsim: convert to memalloc_noreclaim_*() treewide: convert PF_MEMALLOC manipulations to new helpers mm: introduce memalloc_noreclaim_{save,restore} mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required mm/huge_memory.c: use zap_deposited_table() more time: delete CURRENT_TIME_SEC and CURRENT_TIME gfs2: replace CURRENT_TIME with current_time apparmorfs: replace CURRENT_TIME with current_time() lustre: replace CURRENT_TIME macro fs: ubifs: replace CURRENT_TIME_SEC with current_time fs: ufs: use ktime_get_real_ts64() for birthtime ...
2017-05-08mm: introduce kv[mz]alloc helpersMichal Hocko
Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [sfr@canb.auug.org.au: f2fs fixup] Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part] Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08Merge tag 'fscrypt_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Only bug fixes and cleanups for this merge window" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: correct collision claim for digested names MAINTAINERS: fscrypt: update mailing list, patchwork, and git ext4: clean up ext4_match() and callers f2fs: switch to using fscrypt_match_name() ext4: switch to using fscrypt_match_name() fscrypt: introduce helper function for filename matching fscrypt: avoid collisions when presenting long encrypted filenames f2fs: check entire encrypted bigname when finding a dentry ubifs: check for consistent encryption contexts in ubifs_lookup() f2fs: sync f2fs_lookup() with ext4_lookup() ext4: remove "nokey" check from ext4_lookup() fscrypt: fix context consistency check when key(s) unavailable fscrypt: Remove __packed from fscrypt_policy fscrypt: Move key structure and constants to uapi fscrypt: remove fscrypt_symlink_data_len() fscrypt: remove unnecessary checks for NULL operations
2017-05-08Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - add GETFSMAP support - some performance improvements for very large file systems and for random write workloads into a preallocated file - bug fixes and cleanups. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: cleanup write flags handling from jbd2_write_superblock() ext4: mark superblock writes synchronous for nobarrier mounts ext4: inherit encryption xattr before other xattrs ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio() ext4: avoid unnecessary transaction stalls during writeback ext4: preload block group descriptors ext4: make ext4_shutdown() static ext4: support GETFSMAP ioctls vfs: add common GETFSMAP ioctl definitions ext4: evict inline data when writing to memory map ext4: remove ext4_xattr_check_entry() ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries() ext4: merge ext4_xattr_list() into ext4_listxattr() ext4: constify static data that is never modified ext4: trim return value and 'dir' argument from ext4_insert_dentry() jbd2: fix dbench4 performance regression for 'nobarrier' mounts jbd2: Fix lockdep splat with generic/270 test mm: retry writepages() on ENOMEM when doing an data integrity writeback
2017-05-08block, dax: move "select DAX" from BLOCK to FS_DAXDan Williams
For configurations that do not enable DAX filesystems or drivers, do not require the DAX core to be built. Given that the 'direct_access' method has been removed from 'block_device_operations', we can also go ahead and remove the block-related dax helper functions from fs/block_dev.c to drivers/dax/super.c. This keeps dax details out of the block layer and lets the DAX core be built as a module in the FS_DAX=n case. Filesystems need to include dax.h to call bdev_dax_supported(). Cc: linux-xfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-05Merge tag 'libnvdimm-for-4.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has been in multiple -next releases. There were a few late breaking fixes and small features that got added in the last couple days, but the whole set has received a build success notification from the kbuild robot. Change summary: - Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. - Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. - ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock": Tested-by: Yi Zhang <yizhan@redhat.com> - commit 23f498448362 "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits) libnvdimm, pfn: fix 'npfns' vs section alignment libnvdimm: handle locked label storage areas libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED brd: fix uninitialized use of brd->dax_dev block, dax: use correct format string in bdev_dax_supported device-dax: fix sysfs attribute deadlock libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering libnvdimm: rework region badblocks clearing acpi, nfit: kill ACPI_NFIT_DEBUG libnvdimm: fix clear length of nvdimm_forget_poison() libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify libnvdimm, region: sysfs trigger for nvdimm_flush() libnvdimm: fix phys_addr for nvdimm_clear_poison x86, dax, pmem: remove indirection around memcpy_from_pmem() block: remove block_device_operations ->direct_access() block, dax: convert bdev_dax_supported() to dax_direct_access() filesystem-dax: convert to dax_direct_access() Revert "block: use DAX for partition table reads" ext2, ext4, xfs: retrieve dax_device for iomap operations ...
2017-05-04ext4: clean up ext4_match() and callersEric Biggers
When ext4 encryption was originally merged, we were encrypting the user-specified filename in ext4_match(), introducing a lot of additional complexity into ext4_match() and its callers. This has since been changed to encrypt the filename earlier, so we can remove the gunk that's no longer needed. This more or less reverts ext4_search_dir() and ext4_find_dest_de() to the way they were in the v4.0 kernel. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: switch to using fscrypt_match_name()Eric Biggers
Switch ext4 directory searches to use the fscrypt_match_name() helper function. There should be no functional change. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04fscrypt: avoid collisions when presenting long encrypted filenamesEric Biggers
When accessing an encrypted directory without the key, userspace must operate on filenames derived from the ciphertext names, which contain arbitrary bytes. Since we must support filenames as long as NAME_MAX, we can't always just base64-encode the ciphertext, since that may make it too long. Currently, this is solved by presenting long names in an abbreviated form containing any needed filesystem-specific hashes (e.g. to identify a directory block), then the last 16 bytes of ciphertext. This needs to be sufficient to identify the actual name on lookup. However, there is a bug. It seems to have been assumed that due to the use of a CBC (ciphertext block chaining)-based encryption mode, the last 16 bytes (i.e. the AES block size) of ciphertext would depend on the full plaintext, preventing collisions. However, we actually use CBC with ciphertext stealing (CTS), which handles the last two blocks specially, causing them to appear "flipped". Thus, it's actually the second-to-last block which depends on the full plaintext. This caused long filenames that differ only near the end of their plaintexts to, when observed without the key, point to the wrong inode and be undeletable. For example, with ext4: # echo pass | e4crypt add_key -p 16 edir/ # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l 100000 # sync # echo 3 > /proc/sys/vm/drop_caches # keyctl new_session # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l 2004 # rm -rf edir/ rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning ... To fix this, when presenting long encrypted filenames, encode the second-to-last block of ciphertext rather than the last 16 bytes. Although it would be nice to solve this without depending on a specific encryption mode, that would mean doing a cryptographic hash like SHA-256 which would be much less efficient. This way is sufficient for now, and it's still compatible with encryption modes like HEH which are strong pseudorandom permutations. Also, changing the presented names is still allowed at any time because they are only provided to allow applications to do things like delete encrypted directories. They're not designed to be used to persistently identify files --- which would be hard to do anyway, given that they're encrypted after all. For ease of backports, this patch only makes the minimal fix to both ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the ciphertext block yet. Follow-on patches will clean things up properly and make the filesystems use a shared helper function. Fixes: 5de0b4d0cd15 ("ext4 crypto: simplify and speed up filename encryption") Reported-by: Gwendal Grignou <gwendal@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: remove "nokey" check from ext4_lookup()Eric Biggers
Now that fscrypt_has_permitted_context() correctly handles the case where we have the key for the parent directory but not the child, we don't need to try to work around this in ext4_lookup(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04ext4: mark superblock writes synchronous for nobarrier mountsJan Kara
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. generic_make_request_checks() however strips REQ_FUA flag from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions. This affects superblock writes for ext4. Fix the problem by marking superblock writes always as synchronous. Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3 CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-03Merge branch 'generic' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota, reiserfs, udf and ext2 updates from Jan Kara: "The branch contains changes to quota code so that it does not modify persistent flags in inode->i_flags (it was the only place in kernel doing that) and handle it inside filesystem's quotaon/off handlers instead. The branch also contains two UDF cleanups, a couple of reiserfs fixes and one fix for ext2 quota locking" * 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext4: Improve comments in ext4_quota_{on|off}() udf: use kmap_atomic for memcpy copying udf: use octal for permissions quota: Remove dquot_quotactl_ops reiserfs: Remove i_attrs_to_sd_attrs() reiserfs: Remove useless setting of i_flags jfs: Remove jfs_get_inode_flags() ext2: Remove ext2_get_inode_flags() ext4: Remove ext4_get_inode_flags() quota: Stop setting IMMUTABLE and NOATIME flags on quota files jfs: Set flags on quota files directly ext2: Set flags on quota files directly reiserfs: Set flags on quota files directly ext4: Set flags on quota files directly reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore reiserfs: Make cancel_old_flush() reliable ext2: Call dquot_writeback_dquots() with s_umount held reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-02ext4: inherit encryption xattr before other xattrsEric Biggers
When using both encryption and SELinux (or another feature that requires an xattr per file) on a filesystem with 256-byte inodes, each file's xattrs usually spill into an external xattr block. Currently, the xattrs are inherited in the order ACL, security, then encryption. Therefore, if spillage occurs, the encryption xattr will always end up in the external block. This is not ideal because the encryption xattrs contain a nonce, so they will always be unique and will prevent the external xattr blocks from being deduplicated. To improve the situation, change the inheritance order to encryption, ACL, then security. This gives the encryption xattr a better chance to be stored in-inode, allowing the other xattr(s) to be deduplicated. Note that it may be better for userspace to format the filesystem with 512-byte inodes in this case. However, it's not the default. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()Theodore Ts'o
Add fallback code and a WARN_ONCE() call instead of a BUG_ON() in the ext4_end_bio() function. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: avoid unnecessary transaction stalls during writebackJan Kara
Currently ext4_writepages() submits all pages with transaction started. When no page needs block allocation or extent conversion we can submit all dirty pages in the inode while holding a single transaction handle and when device is congested this can take significant amount of time. Thus ext4_writepages() can block transaction commits for extended periods of time. Take for example a simple benchmark simulating PostgreSQL database (pgioperf in mmtest). The benchmark runs 16 processes doing random reads from a huge file, one process doing random writes to the huge file, and one process doing sequential writes to a small files and frequently running fsync. With unpatched kernel transaction commits take on average ~18s with standard deviation of ~41s, top 5 commit times are: 274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s. After this patch transaction commits take on average 0.1s with standard deviation of 0.15s, top 5 commit times are: 0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s [ Modified so we use an explicit do_map flag instead of relying on io_end not being allocated, the since io_end->inode is needed for I/O error handling. -- tytso ] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: preload block group descriptorsAndrew Perepechko
With enabled meta_bg option block group descriptors reading IO is not sequential and requires optimization. Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: make ext4_shutdown() staticEric Biggers
Make the ext4_shutdown() function static, as suggested by running sparse ('make C=2 fs/ext4/'). This was the only such warning in fs/ext4/. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: support GETFSMAP ioctlsDarrick J. Wong
Support the GETFSMAP ioctls so that we can use the xfs free space management tools to probe ext4 as well. Note that this is a partial implementation -- we only report fixed-location metadata and free space; everything else is reported as "unknown". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: evict inline data when writing to memory mapEric Biggers
Currently the case of writing via mmap to a file with inline data is not handled. This is maybe a rare case since it requires a writable memory map of a very small file, but it is trivial to trigger with on inline_data filesystem, and it causes the 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in ext4_writepages() to be hit: mkfs.ext4 -O inline_data /dev/vdb mount /dev/vdb /mnt xfs_io -f /mnt/file \ -c 'pwrite 0 1' \ -c 'mmap -w 0 1m' \ -c 'mwrite 0 1' \ -c 'fsync' kernel BUG at fs/ext4/inode.c:2723! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 task: ffff88003d3a8040 task.stack: ffffc90000300000 RIP: 0010:ext4_writepages+0xc89/0xf8a RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 Call Trace: ? _raw_spin_unlock+0x27/0x2a ? kvm_clock_read+0x1e/0x20 do_writepages+0x23/0x2c ? do_writepages+0x23/0x2c __filemap_fdatawrite_range+0x80/0x87 filemap_write_and_wait_range+0x67/0x8c ext4_sync_file+0x20e/0x472 vfs_fsync_range+0x8e/0x9f ? syscall_trace_enter+0x25b/0x2d0 vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 do_syscall_64+0x69/0x131 entry_SYSCALL64_slow_path+0x25/0x25 We could try to be smart and keep the inline data in this case, or at least support delayed allocation when allocating the block, but these solutions would be more complicated and don't seem worthwhile given how rare this case seems to be. So just fix the bug by calling ext4_convert_inline_data() when we're asked to make a page writable, so that any inline data gets evicted, with the block allocated immediately. Reported-by: Nick Alcock <nick.alcock@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30ext4: remove ext4_xattr_check_entry()Eric Biggers
ext4_xattr_check_entry() was redundant with validation of the full xattr entries list in ext4_xattr_check_entries(), which all callers also did. ext4_xattr_check_entry() also didn't actually do correct validation; specifically, it never checked that the value doesn't overlap the xattr names, nor did it account for padding when checking whether the xattr value overflows the available space. So remove it to eliminate any potential confusion. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()Eric Biggers
ext4_xattr_check_names() actually validates both the xattr names and values, not just the names. So rename it to ext4_xattr_check_entries() to avoid confusion. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29ext4: merge ext4_xattr_list() into ext4_listxattr()Eric Biggers
There's no difference between ext4_xattr_list() and ext4_listxattr(), so merge them together and just have ext4_listxattr(). Some years ago they took different arguments, but that's no longer the case. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29ext4: constify static data that is never modifiedEric Biggers
Constify static data in ext4 that is never (intentionally) modified so that it is placed in .rodata and benefits from memory protection. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29ext4: trim return value and 'dir' argument from ext4_insert_dentry()Eric Biggers
In the initial implementation of ext4 encryption, the filename was encrypted in ext4_insert_dentry(), which could fail and also required access to the 'dir' inode. Since then ext4 filename encryption has been changed to encrypt the filename earlier, so we can revert the additions to ext4_insert_dentry(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-25ext2, ext4, xfs: retrieve dax_device for iomap operationsDan Williams
In preparation for converting fs/dax.c to use dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_device associated with a given block_device. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-24ext4: Improve comments in ext4_quota_{on|off}()Jan Kara
Improve comments in ext4_quota_{on|off}() to explain that returning success despite ext4_journal_start() failing is deliberate. Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19ext4: Remove ext4_get_inode_flags()Jan Kara
Now that all places setting inode->i_flags that should be reflected in on-disk flags are gone, we can remove ext4_get_inode_flags() call. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19ext4: Set flags on quota files directlyJan Kara
Currently immutable and noatime flags on quota files are set by quota code which requires us to copy inode->i_flags to our on disk version of quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to setting / clearing these on-disk flags directly to save that copying. Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-03statx: Include a mask for stx_attributes in struct statxDavid Howells
Include a mask in struct stat to indicate which bits of stx_attributes the filesystem actually supports. This would also be useful if we add another system call that allows you to do a 'bulk attribute set' and pass in a statx struct with the masks appropriately set to say what you want to set. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03ext4: Add statx supportDavid Howells
Return enhanced file attributes from the Ext4 filesystem. This includes the following: (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME. (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags. This requires that all ext4 inodes have a getattr call, not just some of them, so to this end, split the ext4_getattr() function and only call part of it where appropriate. Example output: [root@andromeda ~]# touch foo [root@andromeda ~]# chattr +ai foo [root@andromeda ~]# /tmp/test-statx foo statx(foo) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 08:12 Inode: 2101950 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2016-02-11 17:08:29.031795451+0000 Modify: 2016-02-11 17:08:29.031795451+0000 Change: 2016-02-11 17:11:11.987790114+0000 Birth: 2016-02-11 17:08:29.031795451+0000 Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----) Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-26Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a memory leak on an error path, and two races when modifying inodes relating to the inline_data and metadata checksum features" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix two spelling nits ext4: lock the xattr block before checksuming it jbd2: don't leak memory if setting up journal fails ext4: mark inode dirty after converting inline directory
2017-03-25Merge tag 'fscrypt-for-linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypto fixes from Ted Ts'o: "A code cleanup and bugfix for fs/crypto" * tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: eliminate ->prepare_context() operation fscrypt: remove broken support for detecting keyring key revocation