aboutsummaryrefslogtreecommitdiffstats
path: root/fs/zonefs
AgeCommit message (Collapse)Author
2024-03-01zonefs: Improve error handlingDamien Le Moal
commit 14db5f64a971fce3d8ea35de4dfc7f443a3efb92 upstream. Write error handling is racy and can sometime lead to the error recovery path wrongly changing the inode size of a sequential zone file to an incorrect value which results in garbage data being readable at the end of a file. There are 2 problems: 1) zonefs_file_dio_write() updates a zone file write pointer offset after issuing a direct IO with iomap_dio_rw(). This update is done only if the IO succeed for synchronous direct writes. However, for asynchronous direct writes, the update is done without waiting for the IO completion so that the next asynchronous IO can be immediately issued. However, if an asynchronous IO completes with a failure right before the i_truncate_mutex lock protecting the update, the update may change the value of the inode write pointer offset that was corrected by the error path (zonefs_io_error() function). 2) zonefs_io_error() is called when a read or write error occurs. This function executes a report zone operation using the callback function zonefs_io_error_cb(), which does all the error recovery handling based on the current zone condition, write pointer position and according to the mount options being used. However, depending on the zoned device being used, a report zone callback may be executed in a context that is different from the context of __zonefs_io_error(). As a result, zonefs_io_error_cb() may be executed without the inode truncate mutex lock held, which can lead to invalid error processing. Fix both problems as follows: - Problem 1: Perform the inode write pointer offset update before a direct write is issued with iomap_dio_rw(). This is safe to do as partial direct writes are not supported (IOMAP_DIO_PARTIAL is not set) and any failed IO will trigger the execution of zonefs_io_error() which will correct the inode write pointer offset to reflect the current state of the one on the device. - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error() and call this function directly from __zonefs_io_error() after obtaining the zone information using blkdev_report_zones() with a simple callback function that copies to a local stack variable the struct blk_zone obtained from the device. This ensures that error handling is performed holding the inode truncate mutex. This change also simplifies error handling for conventional zone files by bypassing the execution of report zones entirely. This is safe to do because the condition of conventional zones cannot be read-only or offline and conventional zone files are always fully mapped with a constant file size. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-05zonefs: Fix error message in zonefs_file_dio_append()Damien Le Moal
commit 88b170088ad2c3e27086fe35769aa49f8a512564 upstream. Since the expected write location in a sequential file is always at the end of the file (append write), when an invalid write append location is detected in zonefs_file_dio_append(), print the invalid written location instead of the expected write location. Fixes: a608da3bd730 ("zonefs: Detect append writes at invalid locations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-05zonefs: Always invalidate last cached page on append writeDamien Le Moal
commit c1976bd8f23016d8706973908f2bb0ac0d852a8f upstream. When a direct append write is executed, the append offset may correspond to the last page of a sequential file inode which might have been cached already by buffered reads, page faults with mmap-read or non-direct readahead. To ensure that the on-disk and cached data is consistant for such last cached page, make sure to always invalidate it in zonefs_file_dio_append(). If the invalidation fails, return -EBUSY to userspace to differentiate from IO errors. This invalidation will always be a no-op when the FS block size (device zone write granularity) is equal to the page size (e.g. 4K). Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24zonefs: Detect append writes at invalid locationsDamien Le Moal
commit a608da3bd730d718f2d3ebec1c26f9865f8f17ce upstream. Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential files succeeds regardless of the zone write pointer position, as long as the target zone is not full. This means that if an external (buggy) application writes to the zone of a sequential file underneath the file system, subsequent file write() operation will succeed but the file size will not be correct and the file will contain invalid data written by another application. Modify zonefs_file_dio_append() to check the written sector of an append write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a mismatch with the file zone wp offset field. This change triggers a call to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to not expose the unexpected data after the current inode size when the errors=remount-ro mode is used. Other error modes are correctly handled already. Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02zonefs: fix zone report size in __zonefs_io_error()Damien Le Moal
[ Upstream commit 7dd12d65ac646046a3fe0bbf9a4e86f4514207b3 ] When an IO error occurs, the function __zonefs_io_error() is used to issue a zone report to obtain the latest zone information from the device. This function gets a zone report for all zones used as storage for a file, which is always 1 zone except for files representing aggregated conventional zones. The number of zones of a zone report for a file is calculated in __zonefs_io_error() by doing a bit-shift of the inode i_zone_size field, which is equal to or larger than the device zone size. However, this calculation does not take into account that the last zone of a zoned device may be smaller than the zone size reported by bdev_zone_sectors() (which is used to set the bit shift size). As a result, if an error occurs for an IO targetting such last smaller zone, the zone report will ask for 0 zones, leading to an invalid zone report. Fix this by using the fact that all files require a 1 zone report, except if the inode i_zone_size field indicates a zone size larger than the device zone size. This exception case corresponds to a mount with aggregated conventional zones. A check for this exception is added to the file inode initialization during mount. If an invalid setup is detected, emit an error and fail the mount (check contributed by Johannes Thumshirn). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31block: add a bdev_max_zone_append_sectors helperChristoph Hellwig
commit 2aba0d19f4d8c8929b4b3b94a9cfde2aa20e6ee2 upstream Add a helper to check the max supported sectors for zone append based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-25zonefs: fix zonefs_iomap_begin() for readsDamien Le Moal
commit c1c1204c0d0c1dccc1310b9277fb2bd8b663d8fe upstream. If a readahead is issued to a sequential zone file with an offset exactly equal to the current file size, the iomap type is set to IOMAP_UNWRITTEN, which will prevent an IO, but the iomap length is calculated as 0. This causes a WARN_ON() in iomap_iter(): [17309.548939] WARNING: CPU: 3 PID: 2137 at fs/iomap/iter.c:34 iomap_iter+0x9cf/0xe80 [...] [17309.650907] RIP: 0010:iomap_iter+0x9cf/0xe80 [...] [17309.754560] Call Trace: [17309.757078] <TASK> [17309.759240] ? lock_is_held_type+0xd8/0x130 [17309.763531] iomap_readahead+0x1a8/0x870 [17309.767550] ? iomap_read_folio+0x4c0/0x4c0 [17309.771817] ? lockdep_hardirqs_on_prepare+0x400/0x400 [17309.778848] ? lock_release+0x370/0x750 [17309.784462] ? folio_add_lru+0x217/0x3f0 [17309.790220] ? reacquire_held_locks+0x4e0/0x4e0 [17309.796543] read_pages+0x17d/0xb60 [17309.801854] ? folio_add_lru+0x238/0x3f0 [17309.807573] ? readahead_expand+0x5f0/0x5f0 [17309.813554] ? policy_node+0xb5/0x140 [17309.819018] page_cache_ra_unbounded+0x27d/0x450 [17309.825439] filemap_get_pages+0x500/0x1450 [17309.831444] ? filemap_add_folio+0x140/0x140 [17309.837519] ? lock_is_held_type+0xd8/0x130 [17309.843509] filemap_read+0x28c/0x9f0 [17309.848953] ? zonefs_file_read_iter+0x1ea/0x4d0 [zonefs] [17309.856162] ? trace_contention_end+0xd6/0x130 [17309.862416] ? __mutex_lock+0x221/0x1480 [17309.868151] ? zonefs_file_read_iter+0x166/0x4d0 [zonefs] [17309.875364] ? filemap_get_pages+0x1450/0x1450 [17309.881647] ? __mutex_unlock_slowpath+0x15e/0x620 [17309.888248] ? wait_for_completion_io_timeout+0x20/0x20 [17309.895231] ? lock_is_held_type+0xd8/0x130 [17309.901115] ? lock_is_held_type+0xd8/0x130 [17309.906934] zonefs_file_read_iter+0x356/0x4d0 [zonefs] [17309.913750] new_sync_read+0x2d8/0x520 [17309.919035] ? __x64_sys_lseek+0x1d0/0x1d0 Furthermore, this causes iomap_readahead() to loop forever as iomap_readahead_iter() always returns 0, making no progress. Fix this by treating reads after the file size as access to holes, setting the iomap type to IOMAP_HOLE, the iomap addr to IOMAP_NULL_ADDR and using the length argument as is for the iomap length. To simplify the code with this change, zonefs_iomap_begin() is split into the read variant, zonefs_read_iomap_begin() and zonefs_read_iomap_ops, and the write variant, zonefs_write_iomap_begin() and zonefs_write_iomap_ops. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14zonefs: fix handling of explicit_open option on mountDamien Le Moal
commit a2a513be7139b279f1b5b2cee59c6c4950c34346 upstream. Ignoring the explicit_open mount option on mount for devices that do not have a limit on the number of open zones must be done after the mount options are parsed and set in s_mount_opts. Move the check to ignore the explicit_open option after the call to zonefs_parse_options() in zonefs_fill_super(). Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09zonefs: Clear inode information flags on inode creationDamien Le Moal
commit 694852ead287a3433126e7ebda397b242dc99624 upstream. Ensure that the i_flags field of struct zonefs_inode_info is cleared to 0 when initializing a zone file inode, avoiding seeing the flag ZONEFS_ZONE_OPEN being incorrectly set. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09zonefs: Fix management of open zonesDamien Le Moal
commit 1da18a296f5ba4f99429e62a7cf4fdbefa598902 upstream. The mount option "explicit_open" manages the device open zone resources to ensure that if an application opens a sequential file for writing, the file zone can always be written by explicitly opening the zone and accounting for that state with the s_open_zones counter. However, if some zones are already open when mounting, the device open zone resource usage status will be larger than the initial s_open_zones value of 0. Ensure that this inconsistency does not happen by closing any sequential zone that is open when mounting. Furthermore, with ZNS drives, closing an explicitly open zone that has not been written will change the zone state to "closed", that is, the zone will remain in an active state. Since this can then cause failures of explicit open operations on other zones if the drive active zone resources are exceeded, we need to make sure that the zone is not active anymore by resetting it instead of closing it. To address this, zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request into a REQ_OP_ZONE_RESET for sequential zones that have not been written. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01iomap: Add done_before argument to iomap_dio_rwAndreas Gruenbacher
commit 4fdccaa0d184c202f98d73b24e3ec8eeee88ab8d upstream Add a done_before argument to iomap_dio_rw that indicates how much of the request has already been transferred. When the request succeeds, we report that done_before additional bytes were tranferred. This is useful for finishing a request asynchronously when part of the request has already been completed synchronously. We'll use that to allow iomap_dio_rw to be used with page faults disabled: when a page fault occurs while submitting a request, we synchronously complete the part of the request that has already been submitted. The caller can then take care of the page fault and call iomap_dio_rw again for the rest of the request, passing in the number of bytes already tranferred. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-22zonefs: add MODULE_ALIAS_FSNaohiro Aota
commit 8ffea2599f63fdbee968b894eab78170abf3ec2c upstream. Add MODULE_ALIAS_FS() to load the module automatically when you do "mount -t zonefs". Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable <stable@vger.kernel.org> # 5.6+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-30Merge tag 'hole_punch_for_v5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs hole punching vs cache filling race fixes from Jan Kara: "Fix races leading to possible data corruption or stale data exposure in multiple filesystems when hole punching races with operations such as readahead. This is the series I was sending for the last merge window but with your objection fixed - now filemap_fault() has been modified to take invalidate_lock only when we need to create new page in the page cache and / or bring it uptodate" * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: filesystems/locking: fix Malformed table warning cifs: Fix race between hole punch and page fault ceph: Fix race between hole punch and page fault fuse: Convert to using invalidate_lock f2fs: Convert to using invalidate_lock zonefs: Convert to using invalidate_lock xfs: Convert double locking of MMAPLOCK to use VFS helpers xfs: Convert to use invalidate_lock xfs: Refactor xfs_isilocked() ext2: Convert to using invalidate_lock ext4: Convert to use mapping->invalidate_lock mm: Add functions to lock invalidate_lock for two mappings mm: Protect operations adding pages to page cache with invalidate_lock documentation: Sync file_operations members with reality mm: Fix comments mentioning i_mutex
2021-07-16zonefs: remove redundant null bio checkXianting Tian
bio_alloc() with __GFP_DIRECT_RECLAIM, which is included in GFP_NOFS, never fails, see comments in bio_alloc_bioset(). Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-07-13zonefs: Convert to using invalidate_lockJan Kara
Use invalidate_lock instead of zonefs' private i_mmap_sem. The intended purpose is exactly the same. CC: Damien Le Moal <damien.lemoal@wdc.com> CC: Johannes Thumshirn <jth@kernel.org> CC: <linux-fsdevel@vger.kernel.org> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-29mm: move page dirtying prototypes from mm.hMatthew Wilcox (Oracle)
These functions implement the address_space ->set_page_dirty operation and should live in pagemap.h, not mm.h so that the rest of the kernel doesn't get funny ideas about calling them directly. Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29iomap: use __set_page_dirty_nobuffersMatthew Wilcox (Oracle)
The only difference between iomap_set_page_dirty() and __set_page_dirty_nobuffers() is that the latter includes a debugging check that a !Uptodate page has private data. Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29Merge tag 'fsnotify_for_v5.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - support for limited fanotify functionality for unpriviledged users - faster merging of fanotify events - a few smaller fsnotify improvements * tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: shmem: allow reporting fanotify events with file handles on tmpfs fs: introduce a wrapper uuid_to_fsid() fanotify_user: use upper_32_bits() to verify mask fanotify: support limited functionality for unprivileged users fanotify: configurable limits via sysfs fanotify: limit number of event merge attempts fsnotify: use hash table for faster events merge fanotify: mix event info and pid into merge key hash fanotify: reduce event objectid to 29-bit hash fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-19fs: introduce a wrapper uuid_to_fsid()Amir Goldstein
Some filesystem's use a digest of their uuid for f_fsid. Create a simple wrapper for this open coded folding. Filesystems that have a non null uuid but use the block device number for f_fsid may also consider using this helper. [JK: Added missing asm/byteorder.h include] Link: https://lore.kernel.org/r/20210322173944.449469-2-amir73il@gmail.com Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-17zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()Chao Yu
In zonefs_open_zone(), if opened zone count is larger than .s_max_open_zones threshold, we missed to recover .i_wr_refcnt, fix this. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-16zonefs: Fix O_APPEND async write handlingDamien Le Moal
zonefs updates the size of a sequential zone file inode only on completion of direct writes. When executing asynchronous append writes (with a file open with O_APPEND or using RWF_APPEND), the use of the current inode size in generic_write_checks() to set an iocb offset thus leads to unaligned write if an application issues an append write operation with another write already being executed. Fix this problem by introducing zonefs_write_checks() as a modified version of generic_write_checks() using the file inode wp_offset for an append write iocb offset. Also introduce zonefs_write_check_limits() to replace generic_write_check_limits() call. This zonefs special helper makes sure that the maximum file limit used is the maximum size of the file being accessed. Since zonefs_write_checks() already truncates the iov_iter, the calls to iov_iter_truncate() in zonefs_file_dio_write() and zonefs_file_buffered_write() are removed. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-16zonefs: prevent use of seq files as swap fileDamien Le Moal
The sequential write constraint of sequential zone file prevent their use as swap files. Only allow conventional zone files to be used as swap files. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-22Merge tag 'zonefs-5.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs updates from Damien Le Moal: "Two changes: - A fix that did not make it in time for 5.11, to correct the file size initialization of full sequential zone, from Shin'ichiro - Add file operation tracepoints to help with debugging, from Johannes" * tag 'zonefs-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Fix file size of zones in full condition zonefs: add tracepoints for file operations
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-18zonefs: Fix file size of zones in full conditionShin'ichiro Kawasaki
Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values when zones are in full condition. However, when zonefs mounts a zoned block device, zonefs refers write pointers to set file size even when the zones are in full condition. This results in wrong file size. To fix this, refer maximum file size in place of write pointers for zones in full condition. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> # 5.6+ Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-02-16zonefs: add tracepoints for file operationsJohannes Thumshirn
Add tracepoints for file I/O operations to aid in debugging of I/O errors with zonefs. The added tracepoints are in: - zonefs_zone_mgmt() for tracing zone management operations - zonefs_iomap_begin() for tracing regular file I/O - zonefs_file_dio_append() for tracing zone-append operations Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-02-10zonefs: use zone write granularity as block sizeDamien Le Moal
Zoned block devices have different granularity constraints for write operations into sequential zones. E.g. ZBC and ZAC devices require that writes be aligned to the device physical block size while NVMe ZNS devices allow logical block size aligned write operations. To correctly handle such difference, use the device zone write granularity limit to set the block size of a zonefs volume, thus allowing the smallest possible write unit for all zoned device types. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27block: use an on-stack bio in blkdev_issue_flushChristoph Hellwig
There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27zonefs: use bio_alloc in zonefs_file_dio_appendChristoph Hellwig
Use bio_alloc instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24inode: make init and permission helpers idmapped mount awareChristian Brauner
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-23iomap: pass a flags argument to iomap_dio_rwChristoph Hellwig
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-04zonefs: select CONFIG_CRC32Arnd Bergmann
When CRC32 is disabled, zonefs cannot be linked: ld: fs/zonefs/super.o: in function `zonefs_fill_super': Add a Kconfig 'select' statement for it. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-12-10zonefs: fix page reference and BIO leakDamien Le Moal
In zonefs_file_dio_append(), the pages obtained using bio_iov_iter_get_pages() are not released on completion of the REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails. Furthermore, a call to bio_put() is missing when bio_iov_iter_get_pages() fails. Fix these resource leaks by adding BIO resource release code (bio_put()i and bio_release_pages()) at the end of the function after the BIO execution and add a jump to this resource cleanup code in case of bio_iov_iter_get_pages() failure. While at it, also fix the call to task_io_account_write() to be passed the correct BIO size instead of bio_iov_iter_get_pages() return value. Reported-by: Christoph Hellwig <hch@lst.de> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-24Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place (the largest group here is Christoph's stat cleanups)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove KSTAT_QUERY_FLAGS fs: remove vfs_stat_set_lookup_flags fs: move vfs_fstatat out of line fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat fs: remove vfs_statx_fd fs: omfs: use kmemdup() rather than kmalloc+memcpy [PATCH] reduce boilerplate in fsid handling fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS selftests: mount: add nosymfollow tests Add a "nosymfollow" mount option.
2020-09-18[PATCH] reduce boilerplate in fsid handlingAl Viro
Get rid of boilerplate in most of ->statfs() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-15zonefs: open/close zone on file open/closeJohannes Thumshirn
NVMe Zoned Namespace introduced the concept of active zones, which are zones in the implicit open, explicit open or closed condition. Drives may have a limit on the number of zones that can be simultaneously active. This potential limitation translate into a risk for applications to see write IO errors due to this limit if the zone of a file being written to is not already active when a write request is issued. To avoid these potential errors, the zone of a file can explicitly be made active using an open zone command when the file is open for the first time. If the zone open command succeeds, the application is then guaranteed that write requests can be processed. This indirect management of active zones relies on the maximum number of open zones of a drive, which is always lower or equal to the maximum number of active zones. On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN command to the block device. Conversely, on the last release of a zone file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or empty. As truncating a zone file to 0 or max can deactivate a zone as well, we need to serialize against truncates and also be careful not to close a zone as the file may still be open for writing, e.g. the user called ftruncate(). If the zone file is not open and a process does a truncate(), then no close operation is needed. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: provide no-lock zonefs_io_error variantJohannes Thumshirn
Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex already held, so factor out the body of zonefs_io_error() into __zonefs_io_error() which can be called from with the i_truncate_mutex held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: introduce helper for zone managementJohannes Thumshirn
Introduce a helper function for sending zone management commands to the block device. As zone management commands can change a zone write pointer position reflected in the size of the zone file, this function expects the truncate mutex to be held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-08-11zonefs: add zone-capacity supportJohannes Thumshirn
In the zoned storage model, the sectors within a zone are typically all writeable. With the introduction of the Zoned Namespace (ZNS) Command Set in the NVM Express organization, the model was extended to have a specific writeable capacity. This zone capacity can be less than the overall zone size for a NVMe ZNS device or null_blk in zoned-mode. For other ZBC/ZAC devices the zone capacity is always equal to the zone size. Use the zone capacity field instead from blk_zone for determining the maximum inode size and inode blocks in zonefs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-08-06Merge tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "The most notable changes are: - iomap no longer invalidates the page cache when performing a direct read, since doing so is unnecessary and the old directio code doesn't do that either. - iomap embraced the use of returning ENOTBLK from a direct write to trigger falling back to a buffered write since ext4 already did this and btrfs wants it for their port. - iomap falls back to buffered writes if we're doing a direct write and the page cache invalidation after the flush fails; this was necessary to handle a corner case in the btrfs port. - Remove email virus scanner detritus that was accidentally included in yesterday's pull request. Clearly I need(ed) to update my git branch checker scripts. :(" * tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fall back to buffered writes for invalidation failures xfs: use ENOTBLK for direct I/O to buffered I/O fallback iomap: Only invalidate page cache pages on direct IO writes iomap: Make sure iomap_end is called after iomap_begin
2020-08-05iomap: fall back to buffered writes for invalidation failuresChristoph Hellwig
Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-07-20zonefs: count pages after truncating the iteratorJohannes Thumshirn
Count pages after possibly truncating the iterator to the maximum zone append size, not before. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-07-20zonefs: Fix compilation warningDamien Le Moal
Avoid the compilation warning "Variable 'ret' is reassigned a value before the old one has been used." in zonefs_create_zgroup() by setting ret for the error path only if an error happens. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-06-04Merge tag 'zonefs-5.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs update from Damien Le Moal: "Only one patch in this pull request to cleanup handling of uuid using the import_uuid() helper, from Andy" * tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Replace uuid_copy() with import_uuid()
2020-06-02Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-06-02iomap: convert from readpages to readaheadMatthew Wilcox (Oracle)
Use the new readahead operation in iomap. Convert XFS and ZoneFS to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>