aboutsummaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2023-09-27Merge branch 'master-6.5'HEADmasterBruce Ashfield
2023-09-23blk-mq: fix tags UAF when shrinking q->nr_hw_queuesChengming Zhou
[ Upstream commit 6be6d112419713334ddd9c01f219ca16adaa4c76 ] When nr_hw_queues shrink, we free the excess tags before realloc'ing hw_ctxs for each queue. During that resize, we may need to access those tags, like blk_mq_tag_idle(hctx) will access queue shared tags. This can cause a slab use-after-free, as reported by KASAN. Fix it by moving the releasing of excess tags to the end. Fixes: e1dd7bc93029 ("blk-mq: fix tags leak when shrink nr_hw_queues") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_CK63uoDpGBGZ6DN4OCTpzkR3UaVgK=LX8Owr8ej2ieQ@mail.gmail.com/ Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230908005702.2183908-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23blk-mq: prealloc tags when increase tagset nr_hw_queuesChengming Zhou
commit 7222657e51b5626d10154b3e48ad441c33b5da96 upstream. Just like blk_mq_alloc_tag_set(), it's better to prepare all tags before using to map to queue ctxs in blk_mq_map_swqueue(), which now have to consider empty set->tags[]. The good point is that we can fallback easily if increasing nr_hw_queues fail, instead of just mapping to hctx[0] when fail in blk_mq_map_swqueue(). And the fallback path already has tags free & clean handling, so all is good. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23blk-mq: fix tags leak when shrink nr_hw_queuesChengming Zhou
[ Upstream commit e1dd7bc93029024af5688253b0c05181d6e01f8e ] Although we don't need to realloc set->tags[] when shrink nr_hw_queues, we need to free them. Or these tags will be leaked. How to reproduce: 1. mount -t configfs configfs /mnt 2. modprobe null_blk nr_devices=0 submit_queues=8 3. mkdir /mnt/nullb/nullb0 4. echo 1 > /mnt/nullb/nullb0/power 5. echo 4 > /mnt/nullb/nullb0/submit_queues 6. rmdir /mnt/nullb/nullb0 In step 4, will alloc 9 tags (8 submit queues and 1 poll queue), then in step 5, new_nr_hw_queues = 5 (4 submit queues and 1 poll queue). At last in step 6, only these 5 tags are freed, the other 4 tags leaked. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()Yu Kuai
[ Upstream commit eead0056648cef49d7b15c07ae612fa217083165 ] Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(), for consequence, 'carryover_ios/bytes' will be used to throttle bio multiple times, for example: 1) set iops limit to 100, and slice start is 0, slice end is 100ms; 2) current time is 0, and 10 ios are dispatched, those io won't be throttled and io_disp is 10; 3) still at current time 0, update iops limit to 1000, carryover_ios is updated to (0 - 10) = -10; 4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means only 90 ios can be dispatched without waiting; 5) assume that io is throttled in slice(0 - 100ms), and throtl_trim_slice() update silce to (100ms - 200ms). In this case, 'carryover_ios/bytes' is not cleared and still only 90 ios can be dispatched between 100ms - 200ms. Fix this problem by updating 'carryover_ios/bytes' in throtl_trim_slice(). Fixes: a880ae93e5b5 ("blk-throttle: fix io hung due to configuration updates") Reported-by: zhuxiaohui <zhuxiaohui.400@bytedance.com> Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()Yu Kuai
[ Upstream commit e8368b57c006dc0e02dcd8a9dc9f2060ff5476fe ] There are no functional changes, just make the code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: eead0056648c ("blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: don't add or resize partition on the disk with GENHD_FL_NO_PARTLi Lingfeng
commit 1a721de8489fa559ff4471f73c58bb74ac5580d3 upstream. Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") remove disk_expand_part_tbl() in add_partition(), which means all kinds of devices will support extended dynamic `dev_t`. However, some devices with GENHD_FL_NO_PART are not expected to add or resize partition. Fix this by adding check of GENHD_FL_NO_PART before add or resize partition. Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13block: fix pin count management when merging same-page segmentsChristoph Hellwig
commit 5905afc2c7bb713d52c7c7585565feecbb686b44 upstream. There is no need to unpin the added page when adding it to the bio fails as that is done by the loop below. Instead we want to unpin it when adding a single page to the bio more than once as bio_release_pages will only unpin it once. Fixes: d1916c86ccdc ("block: move same page handling from __bio_add_pc_page to the callers") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230905124731.328255-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before ↵Tejun Heo
init [ Upstream commit ec14a87ee1999b19d8b7ed0fa95fea80644624ae ] blk-iocost sometimes causes the following crash: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... RIP: 0010:_raw_spin_lock+0x17/0x30 Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00 RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001 RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0 RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003 R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000 R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600 FS: 00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0 Call Trace: <TASK> ioc_weight_write+0x13d/0x410 cgroup_file_write+0x7a/0x130 kernfs_fop_write_iter+0xf5/0x170 vfs_write+0x298/0x370 ksys_write+0x5f/0xb0 __x64_sys_write+0x1b/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it. blkcg_activate_policy() was doing the following: 1. Allocate pd's for all existing blkg's and install them in blkg->pd[]. 2. Initialize all pd's. 3. Online all pd's. blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg's. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash. The crash can be reproduced with the following script: #!/bin/bash echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight with the following patch applied: > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index fc49be622e05..38d671d5e10c 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) > pd->online = false; > } > > + if (system_state == SYSTEM_RUNNING) { > + spin_unlock_irq(&q->queue_lock); > + ssleep(1); > + spin_lock_irq(&q->queue_lock); > + } > + > /* all allocated, init in the same order */ > if (pol->pd_init_fn) > list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) I don't see a reason why all pd's should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let's fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Breno Leitao <leitao@debian.org> Fixes: 9d179b865449 ("blkcg: Fix multiple bugs in blkcg_activate_policy()") Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13bio-integrity: create multi-page bvecs in bio_integrity_add_page()Jinyoung Choi
[ Upstream commit 0ece1d649b6dd615925a72bc1824d6b9fa5b998a ] In general, the bvec data structure consists of one for physically continuous pages. But, in the bvec configuration for bip, physically continuous integrity pages are composed of each bvec. Allow bio_integrity_add_page() to create multi-page bvecs, just like the bio payloads. This simplifies adding larger payloads, and fixes support for non-tiny workloads with nvme, which stopped using scatterlist for metadata a while ago. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Fixes: 783b94bd9250 ("nvme-pci: do not build a scatterlist to map metadata") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: make bvec_try_merge_hw_page() non-staticJinyoung Choi
[ Upstream commit 7c8998f75d2d42ddefb172239b0f689392958309 ] This will be used for multi-page configuration for integrity payload. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: don't pass a bio to bio_try_merge_hw_segChristoph Hellwig
[ Upstream commit ae42f0b3bf65912e122fc2e8d5f6d94b51156dba ] There is no good reason to pass the bio to bio_try_merge_hw_seg. Just pass the current bvec and rename the function to bvec_try_merge_hw_page. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: move the bi_size update out of __bio_try_merge_pageChristoph Hellwig
[ Upstream commit 858c708d9efb7e8e5c6320793b778cc17cf8368a ] The update of bi_size is the only thing in __bio_try_merge_page that needs a bio. Move it to the callers, and merge __bio_try_merge_page and page_is_mergeable into a single bvec_try_merge_page that only takes the current bvec instead of a full bio. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: move the bi_size overflow check in __bio_try_merge_pageChristoph Hellwig
[ Upstream commit 613699050a49760f1d70c74f71bd0b013ca3c356 ] Checking for availability in bi_size in a function that attempts to merge into an existing segment is a bit odd, as the limit also applies when adding a new segment. This code works fine as we always call __bio_try_merge_page, but contributes to sub-optimal calling conventions and doesn't lead to clear code. Move it to two of the callers instead, the third one already has a more strict check that includes max_hw_segments anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: move the bi_vcnt check out of __bio_try_merge_pageChristoph Hellwig
[ Upstream commit 0eca8b6f97ac705c5806f7d062207379094fb114 ] Move the bi_vcnt out of __bio_try_merge_page and into the two callers that don't already have it in preparation for additional changes to __bio_try_merge_page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: move the BIO_CLONED checks out of __bio_try_merge_pageChristoph Hellwig
[ Upstream commit 939e1a370330841b2c0292a483d7b38f3ee45f88 ] __bio_try_merge_page is a way too low-level helper to assert that the bio is not cloned. Move the check into bio_add_page and bio_iov_iter_get_pages instead, which are the high level entry points that should enforce this variant. bio_add_hw_page already this check, coverig the third (indirect) caller of __bio_try_merge_page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block/mq-deadline: use correct way to throttling write requestsZhiguo Niu
[ Upstream commit d47f9717e5cfd0dd8c0ba2ecfa47c38d140f1bb6 ] The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13blk-flush: fix rq->flush.seq for post-flush requestsChengming Zhou
[ Upstream commit 28b241237470981a96fbd82077c8044466b61e5f ] If the policy == (REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH), it means that the data sequence and post-flush sequence need to be done for this request. The rq->flush.seq should record what sequences have been done (or don't need to be done). So in this case, pre-flush doesn't need to be done, we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH. Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: don't allow enabling a cache on devices that don't support itChristoph Hellwig
[ Upstream commit 43c9835b144c7ce29efe142d662529662a9eb376 ] Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC flag on devices that never claimed the capability. Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by blk_queue_write_cache and guards re-enabling the cache through sysfs. Note that any rescan that calls blk_queue_write_cache will still re-enable the write cache as in the current code. Fixes: 93e9d8e836cb ("block: add ability to flag write back caching on a device") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13block: cleanup queue_wc_storeChristoph Hellwig
[ Upstream commit c4e21bcd0f9d01f9c5d6c52007f5541871a5b1de ] Get rid of the local queue_wc_store variable and handling setting and clearing the QUEUE_FLAG_WC flag diretly instead the if / else if. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230707094239.107968-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 43c9835b144c ("block: don't allow enabling a cache on devices that don't support it") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-19Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "Main thing here is the fix for the regression in flush handling which caused IO hangs/stalls for a few reporters. Hopefully that should all be sorted out now. Outside of that, just a few minor fixes for issues that were introduced in this cycle" * tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux: blk-mq: release scheduler resource when request completes blk-crypto: dynamically allocate fallback profile blk-cgroup: hold queue_lock when removing blkg->q_node drivers/rnbd: restore sysfs interface to rnbd-client
2023-08-19blk-mq: release scheduler resource when request completesChengming Zhou
Chuck reported [1] an IO hang problem on NFS exports that reside on SATA devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests"). We analysed the IO hang problem, found there are two postflush requests waiting for each other. The first postflush request completed the REQ_FSEQ_DATA sequence, so go to the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but failed to blk_kick_flush() because of the second postflush request which is inflight waiting in scheduler queue. The second postflush waiting in scheduler queue can't be dispatched because the first postflush hasn't released scheduler resource even though it has completed by itself. Fix it by releasing scheduler resource when the first postflush request completed, so the second postflush can be dispatched and completed, then make blk_kick_flush() succeed. While at it, remove the check for e->ops.finish_request, as all schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE() at scheduler registration time, just like we do for insert_requests and dispatch_request. [1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/ Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/ Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev [axboe: folded in incremental fix and added tags] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18blk-crypto: dynamically allocate fallback profileSweet Tea Dorminy
blk_crypto_profile_init() calls lockdep_register_key(), which warns and does not register if the provided memory is a static object. blk-crypto-fallback currently has a static blk_crypto_profile and calls blk_crypto_profile_init() thereupon, resulting in the warning and failure to register. Fortunately it is simple enough to use a dynamically allocated profile and make lockdep function correctly. Fixes: 2fb48d88e77f ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock") Cc: stable@vger.kernel.org Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18blk-cgroup: hold queue_lock when removing blkg->q_nodeMing Lei
When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock has to be held, otherwise, all kinds of bugs(list corruption, hard lockup, ..) can be triggered from blkg_destroy_all(). Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Cc: Yu Kuai <yukuai3@huawei.com> Cc: xiaoli feng <xifeng@redhat.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fixes for request_queue state (Ming) - Another uuid quirk (August) - RCU poll fix for NVMe (Ming) - Fix for an IO stall with polled IO (me) - Fix for blk-iocost stats enable/disable accounting (Chengming) - Regression fix for large pages for zram (Christoph) * tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux: nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll blk-iocost: fix queue stats accounting block: don't make REQ_POLLED imply REQ_NOWAIT block: get rid of unused plug->nowait flag zram: take device and not only bvec offset into account nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G nvme-rdma: fix potential unbalanced freeze & unfreeze nvme-tcp: fix potential unbalanced freeze & unfreeze nvme: fix possible hang when removing a controller during error recovery
2023-08-09blk-iocost: fix queue stats accountingChengming Zhou
The q->stats->accounting is not only used by iocost, but iocost only increase this counter, never decrease it. So queue stats accounting will always enabled after using iocost once. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230804070609.31623-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09block: don't make REQ_POLLED imply REQ_NOWAITJens Axboe
Normally these two flags do go together, as the issuer of polled IO generally cannot wait for resources that will get freed as part of IO completion. This is because that very task is the one that will complete the request and free those resources, hence that would introduce a deadlock. But it is possible to have someone else issue the polled IO, eg via io_uring if the request is punted to io-wq. For that case, it's fine to have the task block on IO submission, as it is not the same task that will be completing the IO. It's completely up to the caller to ask for both polled and nowait IO separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in the kiocb, then we can run into repeated -EAGAIN submissions and not make any progress. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08block: get rid of unused plug->nowait flagJens Axboe
This was introduced to add a plug based way of signaling nowait issues, but we have since moved on from that. Kill the old dead code, nobody is setting it anymore. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22Merge tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for loop regressions (Mauricio) - Fix a potential stall with batched wakeups in sbitmap (David) - Fix for stall with recursive plug flushes (Ross) - Skip accounting of empty requests for blk-iocost (Chengming) - Remove a dead field in struct blk_mq_hw_ctx (Chengming) * tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux: loop: do not enforce max_loop hard limit by (new) default loop: deprecate autoloading callback loop_probe() sbitmap: fix batching wakeup blk-iocost: skip empty flush bio in iocost blk-mq: delete dead struct blk_mq_hw_ctx->queued field blk-mq: Fix stall due to recursive flush plug
2023-07-20blk-iocost: skip empty flush bio in iocostChengming Zhou
The flush bio may have data, may have no data (empty flush), we couldn't calculate cost for empty flush bio. So we'd better just skip it for now. Another side effect is that empty flush bio's bio_end_sector() is 0, cause iocg->cursor reset to 0, may break the cost calculation of other bios. This isn't good enough, since flush bio still consume the device bandwidth, but flush request is special, can be merged randomly in the flush state machine, we don't know how to calculate cost for it for now. Its completion time also has flaws, which may include the pre-flush or post-flush completion time, but I don't know if we need to fix that and how to fix it. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230720121441.1408522-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a bunch of small driver fixes and a larger rework of zone disk handling (which reaches into blk and nvme). The aacraid array-bounds fix is now critical since the security people turned on -Werror for some build tests, which now fail without it" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: storvsc: Handle SRB status value 0x30 scsi: block: Improve checks in blk_revalidate_disk_zones() scsi: block: virtio_blk: Set zone limits before revalidating zones scsi: block: nullblk: Set zone limits before revalidating zones scsi: nvme: zns: Set zone limits before revalidating zones scsi: sd_zbc: Set zone limits before revalidating zones scsi: ufs: core: Add support for qTimestamp attribute scsi: aacraid: Avoid -Warray-bounds warning scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER scsi: ufs: core: Update contact email for monitor sysfs nodes scsi: scsi_debug: Remove dead code scsi: qla2xxx: Use vmalloc_array() and vcalloc() scsi: fnic: Use vmalloc_array() and vcalloc() scsi: qla2xxx: Fix error code in qla2x00_start_sp() scsi: qla2xxx: Silence a static checker warning scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()
2023-07-14blk-mq: Fix stall due to recursive flush plugRoss Lagerwall
We have seen rare IO stalls as follows: * blk_mq_plug_issue_direct() is entered with an mq_list containing two requests. * For the first request, it sets last == false and enters the driver's queue_rq callback. * The driver queue_rq callback indirectly calls schedule() which calls blk_flush_plug(). This may happen if the driver has the BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq. * blk_flush_plug() handles the remaining request in the mq_list. mq_list is now empty. * The original call to queue_rq resumes (with last == false). * The loop in blk_mq_plug_issue_direct() terminates because there are no remaining requests in mq_list. The IO is now stalled because the last request submitted to the driver had last == false and there was no subsequent call to commit_rqs(). Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0 which it will be in the recursive case, rather than checking if the mq_list is empty. At the same time, adjust one of the callers to skip the mq_list empty check as it is not necessary. Fixes: dc5fc361d891 ("block: attempt direct issue of plug list") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14block: queue data commands from the flush state machine at the headChristoph Hellwig
We used to insert the data commands following a pre-flush to the head of the queue until commit 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands"). Not doing this seems to cause hangs of such commands on NFS workloads when exported from file systems with SATA SSDs. I have no idea why this would starve these workloads, but doing a semantic revert of this patch (which looks quite different due to various other changes) fixes the hangs. Fixes: 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-13blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rqChengming Zhou
The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation state of the block device. Most of the time request is allocated after rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected. But for plug batched allocation introduced by the commit 47c122e35d7e ("block: pre-allocate requests if plug is started and is a batch"), we can rq_qos_throttle() after the allocation of the request. This is what the blk_mq_get_cached_request() does. In this case, the cached request alloc_time_ns or start_time_ns is much ahead if blocked in any qos ->throttle(). Fix it by setting alloc_time_ns and start_time_ns to now when the allocated request is actually used. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-12block/mq-deadline: Fix a bug in deadline_from_pos()Bart Van Assche
A bug was introduced in deadline_from_pos() while implementing the suggestion to use round_down() in the following code: pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos); This patch makes deadline_from_pos() use round_down() such that 'pos' is rounded down. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/ Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Fixes: 0effb390c4ba ("block: mq-deadline: Handle requeued requests correctly") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-11Merge branch '6.5/scsi-staging' into 6.5/scsi-fixesMartin K. Petersen
Pull in the currently staged SCSI fixes for 6.5. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-06Merge branch 'master-6.4'Bruce Ashfield
2023-07-05scsi: block: Improve checks in blk_revalidate_disk_zones()Damien Le Moal
blk_revalidate_disk_zones() implements checks of the zones of a zoned block device, verifying that the zone size is a power of 2 number of sectors, that all zones (except possibly the last one) have the same size and that zones cover the entire addressing space of the device. While these checks are appropriate to verify that well tested hardware devices have an adequate zone configurations, they lack in certain areas which may result in issues with emulated devices implemented with user drivers such as ublk or tcmu. Specifically, this function does not check if the device driver indicated support for the mandatory zone append writes, that is, if the device max_zone_append_sectors queue limit is set to a non-zero value. Additionally, invalid zones such as a zero length zone with a start sector equal to the device capacity will not be detected and result in out of bounds use of the zone bitmaps prepared with the callback function blk_revalidate_zone_cb(). Improve blk_revalidate_disk_zones() to address these inadequate checks, relying on the fact that all device drivers supporting zoned block devices must set the device zone size (chunk_sectors queue limit) and the max_zone_append_sectors queue limit before executing this function. The check for a non-zero max_zone_append_sectors value is done in blk_revalidate_disk_zones() before executing the zone report. The zone report callback function blk_revalidate_zone_cb() is also modified to add a check that a zone start is below the device capacity. The check that the zone size is a power of 2 number of sectors is moved to blk_revalidate_disk_zones() as the zone size is already known. Similarly, the number of zones of the device can be calculated in blk_revalidate_disk_zones() before executing the zone report. The kdoc comment for blk_revalidate_disk_zones() is also updated to mention that device drivers must set the device zone size and the max_zone_append_sectors queue limit before calling this function. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230703024812.76778-6-dlemoal@kernel.org Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-05blk-crypto: use dynamic lock class for blk_crypto_profile::lockEric Biggers
When a device-mapper device is passing through the inline encryption support of an underlying device, calls to blk_crypto_evict_key() take the blk_crypto_profile::lock of the device-mapper device, then take the blk_crypto_profile::lock of the underlying device (nested). This isn't a real deadlock, but it causes a lockdep report because there is only one lock class for all instances of this lock. Lockdep subclasses don't really work here because the hierarchy of block devices is dynamic and could have more than 2 levels. Instead, register a dynamic lock class for each blk_crypto_profile, and associate that with the lock. This avoids false-positive lockdep reports like the following: ============================================ WARNING: possible recursive locking detected 6.4.0-rc5 #2 Not tainted -------------------------------------------- fscryptctl/1421 is trying to acquire lock: ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0 but task is already holding lock: ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&profile->lock); lock(&profile->lock); *** DEADLOCK *** May be due to missing lock nesting notation Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-05block/partition: fix signedness issue for Amiga partitionsMichael Schmitz
Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the 'blk>0' test in the partition block loop if a value of (signed int) -1 is used to mark the end of the partition block list. Explicitly cast 'blk' to signed int to allow use of -1 to terminate the partition block linked list. Fixes: b6f3f28f604b ("block: add overflow checks for Amiga partition support") Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Reviewed-by: Martin Steigerwald <martin@lichtvoll.de> Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-03Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: "Mostly items that came in a bit late for the initial pull request, wanted to make sure they had the appropriate amount of linux-next soak before going upstream. Outside of stragglers, just generic fixes for either merge window items, or longer standing bugs" * tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits) md/raid0: add discard support for the 'original' layout nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown cdrom/gdrom: Fix build error nvme: improved uring polling block: add request polling helper nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock ...
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-06-28block: add request polling helperKeith Busch
Provide a direct request polling will for drivers. The interface does not require a bio, and can skip the overhead associated with polling those. The biggest gain from skipping the relatively expensive xarray lookup unnecessary when you already have the request. With this, the simple rq/qc conversion functions have only one caller each, so open code this and remove the helpers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28Merge branch 'for-6.5/block-late' into block-6.5Jens Axboe
* for-6.5/block-late: blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock md: use mddev->external to select holder in export_rdev() md/raid1-10: fix casting from randomized structure in raid1_submit_write() md/raid10: fix the condition to call bio_end_io_acct()
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-26Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Nothing major in this release, just a bunch of cleanups and some optimizations around networking mostly. - clean up file request flags handling (Christoph) - clean up request freeing and CQ locking (Pavel) - support for using pre-registering the io_uring fd at setup time (Josh) - Add support for user allocated ring memory, rather than having the kernel allocate it. Mostly for packing rings into a huge page (me) - avoid an unnecessary double retry on receive (me) - maintain ordering for task_work, which also improves performance (me) - misc cleanups/fixes (Pavel, me)" * tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits) io_uring: merge conditional unlock flush helpers io_uring: make io_cq_unlock_post static io_uring: inline __io_cq_unlock io_uring: fix acquire/release annotations io_uring: kill io_cq_unlock() io_uring: remove IOU_F_TWQ_FORCE_NORMAL io_uring: don't batch task put on reqs free io_uring: move io_clean_op() io_uring: inline io_dismantle_req() io_uring: remove io_free_req_tw io_uring: open code io_put_req_find_next io_uring: add helpers to decode the fixed file file_ptr io_uring: use io_file_from_index in io_msg_grab_file io_uring: use io_file_from_index in __io_sync_cancel io_uring: return REQ_F_ flags from io_file_get_flags io_uring: remove io_req_ffs_set io_uring: remove a confusing comment above io_file_get_flags io_uring: remove the mode variable in io_file_get_flags io_uring: remove __io_file_supports_nowait io_uring: wait interruptibly for request completions on exit ...
2023-06-26Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...
2023-06-26blk-sysfs: add a new attr_group for blk_mqYu Kuai
Currently wbt sysfs entry is created for bio based device, and wbt can be enabled for such device through sysfs while it doesn't make sense because wbt can only work for rq based device. In the meantime, there are other similar sysfs entries. Fix this by adding a new attr_group for blk_mq, and sysfs entries will only be created when the device is rq based. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26blk-iocost: move wbt_enable/disable_default() out of spinlockYu Kuai
There are following smatch warning: block/blk-wbt.c:843 wbt_init() warn: sleeping in atomic context ioc_qos_write() <- disables preempt -> wbt_enable_default() -> wbt_init() wbt_init() will be called from wbt_enable_default() if wbt is not initialized, currently this is only possible in blk_register_queue(), hence wbt_init() will never be called from iocost and this warning is false positive. However, we might support rq_qos destruction dynamically in the future, and it's better to prevent that, hence move wbt_enable_default() outside 'ioc->lock'. This is safe because queue is still freezed. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/lkml/Y+Ja5SRs886CEz7a@kadam/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>