aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_reflink.c
AgeCommit message (Collapse)Author
2022-07-07xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_rangeDarrick J. Wong
commit d4f74e162d238ce00a640af5f0611c3f51dad70e upstream. The final parameter of filemap_write_and_wait_range is the end of the range to flush, not the length of the range to flush. Fixes: 46afb0628b86 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-04xfs: only flush the unshared range in xfs_reflink_unshareDarrick J. Wong
There's no reason to flush an entire file when we're unsharing part of a file. Therefore, only initiate writeback on the selected range. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-08-05xfs: delete duplicated words + other fixesRandy Dunlap
Delete repeated words in fs/xfs/. {we, that, the, a, to, fork} Change "it it" to "it is" in one location. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06xfs: move helpers that lock and unlock two inodes against userspace IODarrick J. Wong
Move the double-inode locking helpers to xfs_inode.c since they're not specific to reflink. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: refactor locking and unlocking two inodes against userspace IODarrick J. Wong
Refactor the two functions that we use to lock and unlock two inodes to block userspace from initiating IO against a file, whether via system calls or mmap activity. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: fix xfs_reflink_remap_prep calling conventionsDarrick J. Wong
Fix the return value of xfs_reflink_remap_prep so that its return value conventions match the rest of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: reflink can skip remap existing mappingsDarrick J. Wong
If the source and destination map are identical, we can skip the remap step to save some time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: only reserve quota blocks if we're mapping into a holeDarrick J. Wong
When logging quota block count updates during a reflink operation, we only log the /delta/ of the block count changes to the dquot. Since we now know ahead of time the extent type of both dmap and smap (and that they have the same length), we know that we only need to reserve quota blocks for dmap's blockcount if we're mapping it into a hole. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: only reserve quota blocks for bmbt changes if we're changing the data forkDarrick J. Wong
Now that we've reworked xfs_reflink_remap_extent to remap only one extent per transaction, we actually know if the extent being removed is an allocated mapping. This means that we now know ahead of time if we're going to be touching the data fork. Since we only need blocks for a bmbt split if we're going to update the data fork, we only need to get quota reservation if we know we're going to touch the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: redesign the reflink remap loop to fix blkres depletion crashDarrick J. Wong
The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reported-by: Edwin Török <edwin@etorok.net> Tested-by: Edwin Török <edwin@etorok.net>
2020-07-06xfs: rename xfs_bmap_is_real_extent to is_written_extentDarrick J. Wong
The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-07-06xfs: fix reflink quota reservation accounting errorDarrick J. Wong
Quota reservations are supposed to account for the blocks that might be allocated due to a bmap btree split. Reflink doesn't do this, so fix this to make the quota accounting more accurate before we start rearranging things. Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-04-13xfs: fix partially uninitialized structure in xfs_reflink_remap_extentDarrick J. Wong
In the reflink extent remap function, it turns out that uirec (the block mapping corresponding only to the part of the passed-in mapping that got unmapped) was not fully initialized. Specifically, br_state was not being copied from the passed-in struct to the uirec. This could lead to unpredictable results such as the reflinked mapping being marked unwritten in the destination file. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-01-26xfs: remove unnecessary null pointer checks from _read_agf callersDarrick J. Wong
Drop the null buffer pointer checks in all code that calls xfs_alloc_read_agf and doesn't pass XFS_ALLOC_FLAG_TRYLOCK because they're no longer necessary. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-20xfs: change return value of xfs_inode_need_cow to intzhengbin
Fixes coccicheck warning: fs/xfs/xfs_reflink.c:236:9-10: WARNING: return of 0/1 in function 'xfs_inode_need_cow' with return type bool Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> [darrick: rename the function so it doesn't sound like a predicate] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-01-14xfs: introduce XFS_MAX_FILEOFFDarrick J. Wong
Introduce a new #define for the maximum supported file block offset. We'll use this in the next patch to make it more obvious that we're doing some operation for all possible inode fork mappings after a given offset. We can't use ULLONG_MAX here because bunmapi uses that to detect when it's done. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-10-23xfs: don't set bmapi total block req where minleft isBrian Foster
xfs_bmapi_write() takes a total block requirement parameter that is passed down to the block allocation code and is used to specify the total block requirement of the associated transaction. This is used to try and select an AG that can not only satisfy the requested extent allocation, but can also accommodate subsequent allocations that might be required to complete the transaction. For example, additional bmbt block allocations may be required on insertion of the resulting extent to an inode data fork. While it's important for callers to calculate and reserve such extra blocks in the transaction, it is not necessary to pass the total value to xfs_bmapi_write() in all cases. The latter automatically sets minleft to ensure that sufficient free blocks remain after the allocation attempt to expand the format of the associated inode (i.e., such as extent to btree conversion, btree splits, etc). Therefore, any callers that pass a total block requirement of the bmap mapping length plus worst case bmbt expansion essentially specify the additional reservation requirement twice. These callers can pass a total of zero to rely on the bmapi minleft policy. Beyond being superfluous, the primary motivation for this change is that the total reservation logic in the bmbt code is dubious in scenarios where minlen < maxlen and a maxlen extent cannot be allocated (which is more common for data extent allocations where contiguity is not required). The total value is based on maxlen in the xfs_bmapi_write() caller. If the bmbt code falls back to an allocation between minlen and maxlen, that allocation will not succeed until total is reset to minlen, which essentially throws away any additional reservation included in total by the caller. In addition, the total value is not reset until after alignment is dropped, which means that such callers drop alignment far too aggressively than necessary. Update all callers of xfs_bmapi_write() that pass a total block value of the mapping length plus bmbt reservation to instead pass zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation requirement. This trades off slightly less conservative AG selection for the ability to preserve alignment in more scenarios. xfs_bmapi_write() callers that incorporate unrelated or additional reservations in total beyond what is already included in minleft must continue to use the former. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: split the iomap ops for buffered vs direct writesChristoph Hellwig
Instead of lots of magic conditionals in the main write_begin handler this make the intent very clear. Thing will become even better once we support delayed allocations for extent size hints and realtime allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: pass two imaps to xfs_reflink_allocate_cowChristoph Hellwig
xfs_reflink_allocate_cow consumes the source data fork imap, and potentially returns the COW fork imap. Split the arguments in two to clear up the calling conventions and to prepare for returning a source iomap from ->iomap_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: remove xfs_reflink_dirty_extentsChristoph Hellwig
Now that xfs_file_unshare is not completely dumb we can just call it directly without iterating the extent and reflink btrees ourselves. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: ignore non-shared or non-data blocks in xfs_file_dirtyChristoph Hellwig
xfs_file_dirty is used to unshare reflink blocks. Rename the function to xfs_file_unshare to better document that purpose, and skip iomaps that are not shared and don't need zeroing. This will allow to simplify the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-28xfs: remove unnecessary int returns from deferred bmap functionsDarrick J. Wong
Remove the return value from the functions that schedule deferred bmap operations since they never fail and do not return status. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28xfs: remove unnecessary int returns from deferred refcount functionsDarrick J. Wong
Remove the return value from the functions that schedule deferred refcount operations since they never fail and do not return status. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-18xfs: fix reflink source file racing with directio writesDarrick J. Wong
While trawling through the dedupe file comparison code trying to fix page deadlocking problems, Dave Chinner noticed that the reflink code only takes shared IOLOCK/MMAPLOCKs on the source file. Because page_mkwrite and directio writes do not take the EXCL versions of those locks, this means that reflink can race with writer processes. For pure remapping this can lead to undefined behavior and file corruption; for dedupe this means that we cannot be sure that the contents are identical when we decide to go ahead with the remapping. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-06-30xfs: remove XFS_TRANS_NOFSChristoph Hellwig
Instead of a magic flag for xfs_trans_alloc, just ensure all callers that can't relclaim through the file system use memalloc_nofs_save to set the per-task nofs flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: remove unused header filesEric Sandeen
There are many, many xfs header files which are included but unneeded (or included twice) in the xfs code, so remove them. nb: xfs_linux.h includes about 9 headers for everyone, so those explicit includes get removed by this. I'm not sure what the preference is, but if we wanted explicit includes everywhere, a followup patch could remove those xfs_*.h includes from xfs_linux.h and move them into the files that need them. Or it could be left as-is. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-25xfs: fix uninitialized error variablesDarrick J. Wong
smatch complained about some uninitialized error returns, so fix those. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2019-02-25xfs: don't pass iomap flags to xfs_reflink_allocate_cowDarrick J. Wong
Don't pass raw iomap flags to xfs_reflink_allocate_cow; signal our intention with a boolean argument. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-21xfs: introduce an always_cow modeChristoph Hellwig
Add a mode where XFS never overwrites existing blocks in place. This is to aid debugging our COW code, and also put infatructure in place for things like possible future support for zoned block devices, which can't support overwrites. This mode is enabled globally by doing a: echo 1 > /sys/fs/xfs/debug/always_cow Note that the parameter is global to allow running all tests in xfstests easily in this mode, which would not easily be possible with a per-fs sysfs file. In always_cow mode persistent preallocations are disabled, and fallocate will fail when called with a 0 mode (with our without FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE. There are a few interesting xfstests failures when run in always_cow mode: - generic/392 fails because the bytes used in the file used to test hole punch recovery are less after the log replay. This is because the blocks written and then punched out are only freed with a delay due to the logging mechanism. - xfs/170 will fail as the already fragile file streams mechanism doesn't seem to interact well with the COW allocator - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim the file system is badly fragmented, but there is not much we can do to avoid that when always writing out of place - xfs/205 fails because overwriting a file in always_cow mode will require new space allocation and the assumption in the test thus don't work anymore. - xfs/326 fails to modify the file at all in always_cow mode after injecting the refcount error, leading to an unexpected md5sum after the remount, but that again is expected Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21xfs: make COW fork unwritten extent conversions more robustChristoph Hellwig
If we have racing buffered and direct I/O COW fork extents under writeback can have been moved to the data fork by the time we call xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly harmless as the block numbers don't change by this move, except for the fact that xfs_bmapi_write will crash or trigger asserts when not finding existing extents, even despite trying to paper over this with the XFS_BMAPI_CONVERT_ONLY flag. Instead of special casing non-transaction conversions in the already way too complicated xfs_bmapi_write just add a new helper for the much simpler non-transactional COW fork case, which simplify ignores not found extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21xfs: merge COW handling into xfs_file_iomap_begin_delayChristoph Hellwig
Besides simplifying the code a bit this allows to actually implement the behavior of using COW preallocation for non-COW data mentioned in the current comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21xfs: don't use delalloc extents for COW on files with extsize hintsChristoph Hellwig
While using delalloc for extsize hints is generally a good idea, the current code that does so only for COW doesn't help us much and creates a lot of special cases. Switch it to use real allocations like we do for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: remove the io_type field from the writeback context and ioendChristoph Hellwig
The io_type field contains what is basically a summary of information from the inode fork and the imap. But we can just as easily use that information directly, simplifying a few bits here and there and improving the trace points. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: split up the xfs_reflink_end_cow work into smaller transactionsDarrick J. Wong
In xfs_reflink_end_cow, we allocate a single transaction for the entire end_cow operation and then loop the CoW fork mappings to move them to the data fork. This design fails on a heavily fragmented filesystem where an inode's data fork has exactly one more extent than would fit in an extents-format fork, because the unmap can collapse the data fork into extents format (freeing the bmbt block) but the remap can expand the data fork back into a (newly allocated) bmbt block. If the number of extents we end up remapping is large, we can overflow the block reservation because we reserved blocks assuming that we were adding mappings into an already-cleared area of the data fork. Let's say we have 8 extents in the data fork, 8 extents in the CoW fork, and the data fork can hold at most 7 extents before needing to convert to btree format; and that blocks A-P are discontiguous single-block extents: 0......7 D: ABCDEFGH C: IJKLMNOP When a write to file blocks 0-7 completes, we must remap I-P into the data fork. We start by removing H from the btree-format data fork. Now we have 7 extents, so we convert the fork to extents format, freeing the bmbt block. We then move P into the data fork and it now has 8 extents again. We must convert the data fork back to btree format, requiring a block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0, we'll need a total of 8 block allocations to remap all 8 blocks. We reserved only enough blocks to handle one btree split (5 blocks on a 4k block filesystem), which means we overflow the block reservation. To fix this issue, create a separate helper function to remap a single extent, and change _reflink_end_cow to call it in a tight loop over the entire range we're completing. As a side effect this also removes the size restrictions on how many extents we can end_cow at a time, though nobody ever hit that. It is not reasonable to reserve N blocks to remap N blocks. Note that this can be reproduced after ~320 million fsx ops while running generic/938 (long soak directio fsx exerciser): XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116 <machine registers snipped> Call Trace: xfs_trans_dup+0x211/0x250 [xfs] xfs_trans_roll+0x6d/0x180 [xfs] xfs_defer_trans_roll+0x10c/0x3b0 [xfs] xfs_defer_finish_noroll+0xdf/0x740 [xfs] xfs_defer_finish+0x13/0x70 [xfs] xfs_reflink_end_cow+0x2c6/0x680 [xfs] xfs_dio_write_end_io+0x115/0x220 [xfs] iomap_dio_complete+0x3f/0x130 iomap_dio_rw+0x3c3/0x420 xfs_file_dio_aio_write+0x132/0x3c0 [xfs] xfs_file_write_iter+0x8b/0xc0 [xfs] __vfs_write+0x193/0x1f0 vfs_write+0xba/0x1c0 ksys_write+0x52/0xc0 do_syscall_64+0x50/0x160 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-11-21xfs: flush removing page cache in xfs_reflink_remap_prepDave Chinner
On a sub-page block size filesystem, fsx is failing with a data corruption after a series of operations involving copying a file with the destination offset beyond EOF of the destination of the file: 8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW 8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes) 8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400 8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE 8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00 The second copy here is beyond EOF, and it is to sub-page (4k) but block aligned (1k) offset. The clone runs the EOF zeroing, landing in a pre-existing post-eof delalloc extent. This zeroes the post-eof extents in the page cache just fine, dirtying the pages correctly. The problem is that xfs_reflink_remap_prep() now truncates the page cache over the range that it is copying it to, and rounds that down to cover the entire start page. This removes the dirty page over the delalloc extent from the page cache without having written it back. Hence later, when the page cache is flushed, the page at offset 0x6f000 has not been written back and hence exposes stale data, which fsx trips over less than 10 operations later. Fix this by changing xfs_reflink_remap_prep() to use xfs_flush_unmap_range(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19xfs: fix shared extent data corruption due to missing cow reservationBrian Foster
Page writeback indirectly handles shared extents via the existence of overlapping COW fork blocks. If COW fork blocks exist, writeback always performs the associated copy-on-write regardless if the underlying blocks are actually shared. If the blocks are shared, then overlapping COW fork blocks must always exist. fstests shared/010 reproduces a case where a buffered write occurs over a shared block without performing the requisite COW fork reservation. This ultimately causes writeback to the shared extent and data corruption that is detected across md5 checks of the filesystem across a mount cycle. The problem occurs when a buffered write lands over a shared extent that crosses an extent size hint boundary and that also happens to have a partial COW reservation that doesn't cover the start and end blocks of the data fork extent. For example, a buffered write occurs across the file offset (in FSB units) range of [29, 57]. A shared extent exists at blocks [29, 35] and COW reservation already exists at blocks [32, 34]. After accommodating a COW extent size hint of 32 blocks and the existing reservation at offset 32, xfs_reflink_reserve_cow() allocates 32 blocks of reservation at offset 0 and returns with COW reservation across the range of [0, 34]. The associated data fork extent is still [29, 35], however, which isn't fully covered by the COW reservation. This leads to a buffered write at file offset 35 over a shared extent without associated COW reservation. Writeback eventually kicks in, performs an overwrite of the underlying shared block and causes the associated data corruption. Update xfs_reflink_reserve_cow() to accommodate the fact that a delalloc allocation request may not fully cover the extent in the data fork. Trim the data fork extent appropriately, just as is done for shared extent boundaries and/or existing COW reservations that happen to overlap the start of the data fork extent. This prevents shared/010 failures due to data corruption on reflink enabled filesystems. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-02Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ...
2018-10-30xfs: remove [cm]time update from reflink callsDarrick J. Wong
Now that the vfs remap helper dirties the inode [cm]time for us, xfs no longer needs to do that on its own. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30xfs: remove xfs_reflink_remap_rangeDarrick J. Wong
Since xfs_file_remap_range is a thin wrapper, move the contents of xfs_reflink_remap_range into the shell. This cuts down on the vfs calls being made from internal xfs code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30xfs: remove redundant remap partial EOF block checksDarrick J. Wong
Now that we've moved the partial EOF block checks to the VFS helpers, we can remove the redundant functionality from XFS. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30xfs: support returning partial reflink resultsDarrick J. Wong
Back when the XFS reflink code only supported clone_file_range, we were only able to return zero or negative error codes to userspace. However, now that copy_file_range (which returns bytes copied) can use XFS' clone_file_range, we have the opportunity to return partial results. For example, if userspace sends a 1GB clone request and we run out of space halfway through, we at least can tell userspace that we completed 512M of that request like a regular write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30xfs: clean up xfs_reflink_remap_blocks call siteDarrick J. Wong
Move the offset <-> blocks unit conversions into xfs_reflink_remap_blocks to make the call site less ugly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30xfs: fix pagecache truncation prior to reflinkDarrick J. Wong
Prior to remapping blocks, it is necessary to remove pages from the destination file's page cache. Unfortunately, the truncation is not aggressive enough -- if page size > block size, we'll end up zeroing subpage blocks instead of removing them. So, round the start offset down and the end offset up to page boundaries. We already wrote all the dirty data so the larger range shouldn't be a problem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: clean up generic_remap_file_range_prep return valueDarrick J. Wong
Since the remap prep function can update the length of the remap request, we can change this function to return the usual return status instead of the odd behavior it has now. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: make remap_file_range functions take and return bytes completedDarrick J. Wong
Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: remap helper should update destination inode metadataDarrick J. Wong
Extend generic_remap_file_range_prep to handle inode metadata updates when remapping into a file. If the operation can possibly alter the file contents, we must update the ctime and mtime and remove security privileges, just like we do for regular file writes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: pass remap flags to generic_remap_file_range_prepDarrick J. Wong
Plumb the remap flags through the filesystem from the vfs function dispatcher all the way to the prep function to prepare for behavior changes in subsequent patches. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: rename vfs_clone_file_prep to be more descriptiveDarrick J. Wong
The vfs_clone_file_prep is a generic function to be called by filesystem implementations only. Rename the prefix to generic_ and make it more clear that it applies to remap operations, not just clones. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: check file ranges before cloning filesDarrick J. Wong
Move the file range checks from vfs_clone_file_prep into a separate generic_remap_checks function so that all the checks are collected in a central location. This forms the basis for adding more checks from generic_write_checks that will make cloning's input checking more consistent with write input checking. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18xfs: fix fork selection in xfs_find_trim_cow_extentChristoph Hellwig
We should want to write directly into the data fork for blocks that don't have an extent in the COW fork covering them yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>