aboutsummaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"NeilBrown
bfq_get_queue() expects a "bool" for the third arg, so pass "false" rather than "BLK_RW_ASYNC" which will soon be removed. Link: https://lkml.kernel.org/r/164549983746.9187.7949730109246767909.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-18block: cancel all throttled bios in del_gendisk()Yu Kuai
Throttled bios can't be issued after del_gendisk() is done, thus it's better to cancel them immediately rather than waiting for throttle is done. For example, if user thread is throttled with low bps while it's issuing large io, and the device is deleted. The user thread will wait for a long time for io to return. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18block: let blkcg_gq grab request queue's refcntMing Lei
In the whole lifetime of blkcg_gq instance, ->q will be referred, such as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still may touch the request queue via &tg->service_queue.pending_timer which is handled by throtl_pending_timer_fn(), so it is reasonable to grab request queue's refcnt by blkcg_gq instance. Previously blkcg_exit_queue() is called from blk_release_queue, and it is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") is merged to for-5.18/block, it becomes simple to fix the issue by simply grabbing request queue's refcnt. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18block: avoid use-after-free on throttle dataMing Lei
In throtl_pending_timer_fn(), request queue is retrieved from throttle data. And tg's pending timer is deleted synchronously when releasing the associated blkg, at that time, throttle data may have been freed since commit 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") moves freeing q->td to disk_release() from blk_release_queue(). So use-after-free on q->td in throtl_pending_timer_fn can be triggered. Fixes the issue by: - do nothing in case that disk is released, when there isn't any bio to dispatch - retrieve request queue from blkg instead of throttle data for non top-level pending timer. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-17block: limit request dispatch loop durationShin'ichiro Kawasaki
When IO requests are made continuously and the target block device handles requests faster than request arrival, the request dispatch loop keeps on repeating to dispatch the arriving requests very long time, more than a minute. Since the loop runs as a workqueue worker task, the very long loop duration triggers workqueue watchdog timeout and BUG [1]. To avoid the very long loop duration, break the loop periodically. When opportunity to dispatch requests still exists, check need_resched(). If need_resched() returns true, the dispatch loop already consumed its time slice, then reschedule the dispatch work and break the loop. With heavy IO load, need_resched() does not return true for 20~30 seconds. To cover such case, check time spent in the dispatch loop with jiffies. If more than 1 second is spent, reschedule the dispatch work and break the loop. [1] [ 609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s! [ 609.701820] Showing busy workqueues and worker pools: [ 609.707915] workqueue events: flags=0x0 [ 609.712615] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.712626] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 609.712687] workqueue events_freezable: flags=0x4 [ 609.732943] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.732952] pending: pci_pme_list_scan [ 609.732968] workqueue events_power_efficient: flags=0x80 [ 609.751947] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.751955] pending: neigh_managed_work [ 609.752018] workqueue kblockd: flags=0x18 [ 609.769480] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4 [ 609.769488] in-flight: 1020:blk_mq_run_work_fn [ 609.769498] pending: blk_mq_timeout_work, blk_mq_run_work_fn [ 609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67 [ 639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s! [ 639.909513] Showing busy workqueues and worker pools: [ 639.915404] workqueue events: flags=0x0 [ 639.920197] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 639.920215] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 639.920365] workqueue kblockd: flags=0x18 [ 639.939932] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4 [ 639.939942] in-flight: 1020:blk_mq_run_work_fn [ 639.939955] pending: blk_mq_timeout_work, blk_mq_run_work_fn [ 639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67 Fixes: 6e6fcbc27e778 ("blk-mq: support batching dispatch in case of io") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org # v5.10+ Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/ Link: https://lore.kernel.org/r/20220318022641.133484-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16fs: Convert __set_page_dirty_buffers to block_dirty_folioMatthew Wilcox (Oracle)
Convert all callers; mostly this is just changing the aops to point at it, but a few implementations need a little more work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-16block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"Colin Ian King
There is a spelling mistake in a bfq_log_bfqq message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220315221539.2959167-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-15fs: Turn block_invalidatepage into block_invalidate_folioMatthew Wilcox (Oracle)
Remove special-casing of a NULL invalidatepage, since there is no more block_invalidatepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14block: don't merge across cgroup boundaries if blkcg is enabledTejun Heo
blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't disable merges across different cgroups. This obviously can lead to accounting and control errors but more importantly to priority inversions - e.g. an IO which belongs to a higher priority cgroup or IO class may end up getting throttled incorrectly because it gets merged to an IO issued from a low priority cgroup. Fix it by adding blk_cgroup_mergeable() which is called from merge paths and rejects cross-cgroup and cross-issue_as_root merges. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14block: fix rq-qos breakage from skipping rq_qos_done_bio()Tejun Heo
a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set. While this fixed a potential oops, it also broke blk-iocost by skipping the done_bio callback for merged bios. Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(), rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED distinguishing the former from the latter. rq_qos_done_bio() is not called for bios which wenth through rq_qos_merge(). This royally confuses blk-iocost as the merged bios never finish and are considered perpetually in-flight. One reliably reproducible failure mode is an intermediate cgroup geting stuck active preventing its children from being activated due to the leaf-only rule, leading to loss of control. The following is from resctl-bench protection scenario which emulates isolating a web server like workload from a memory bomb run on an iocost configuration which should yield a reasonable level of protection. # cat /sys/block/nvme2n1/device/model Samsung SSD 970 PRO 512GB # cat /sys/fs/cgroup/io.cost.model 259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025 # cat /sys/fs/cgroup/io.cost.qos 259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00 # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82 lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6 Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96% The isolation result of 58.12% is close to what this device would show without any IO control. Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and calling rq_qos_done_bio() on them too. For consistency and clarity, rename BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into rq_qos_done_bio() so that it's next to the code paths that set the flags. With the patch applied, the above same benchmark shows: # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81 lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68 Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0% Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") Cc: stable@vger.kernel.org # v5.15+ Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14block: release rq qos structures for queue without diskMing Lei
blkcg_init_queue() may add rq qos structures to request queue, previously blk_cleanup_queue() calls rq_qos_exit() to release them, but commit 8e141f9eb803 ("block: drain file system I/O on del_gendisk") moves rq_qos_exit() into del_gendisk(), so memory leak is caused because queues may not have disk, such as un-present scsi luns, nvme admin queue, ... Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back. BTW, v5.18 won't need this patch any more since we move blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release handler, and patches have been in for-5.18/block. Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-11block: flush plug based on hardware and software queue orderJens Axboe
We used to sort the plug list if we had multiple queues before dispatching requests to the IO scheduler. This usually isn't needed, but for certain workloads that interleave requests to disks, it's a less efficient to process the plug list one-by-one if everything is interleaved. Don't sort the list, but skip through it and flush out entries that have the same target at the same time. Fixes: df87eb0fce8f ("block: get rid of plug list sorting") Reported-and-tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-11block: ensure plug merging checks the correct queue at least onceJens Axboe
Song reports that a RAID rebuild workload runs much slower recently, and it is seeing a lot less merging than it did previously. The reason is that a previous commit reduced the amount of work we do for plug merging. RAID rebuild interleaves requests between disks, so a last-entry check in plug merging always misses a merge opportunity since we always find a different disk than what we are looking for. Modify the logic such that it's still a one-hit cache, but ensure that we check enough to find the right target before giving up. Fixes: d38a9c04c0d5 ("block: only check previous entry for plug merge attempt") Reported-and-tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move rq_qos_exit() into disk_release()Ming Lei
Keep all teardown of file system I/O related functionality in one place. There can't be file system I/O in disk_release(), so it is safe to move rq_qos_exit() there. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: do more work in elevator_exitChristoph Hellwig
Move the calls to ioc_clear_queue and blk_mq_sched_free_rqs into elevator_exit. Except for one call where we know we can't have io_cq structures yet these always go together, and that extra call in an error path is harmless. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move blk_exit_queue into disk_releaseMing Lei
There can't be file system I/O in disk_release(), so move the call to blk_exit_queue() there, preparing to have the teardown of file system I/O only functionality in one place, when the gendisk that is needed for it is torn down. We still need to freeze queue here since the request is freed after the bio is completed and passthrough request rely on scheduler tags as well. The disk can be released before or after queue is cleaned up, and we have to free the scheduler request pool before blk_cleanup_queue returns, while the static request pool has to be freed before exiting the I/O scheduler. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: rebased, updated the commit log] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220308055200.735835-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move q_usage_counter release into blk_queue_releaseMing Lei
After blk_cleanup_queue() returns, disk may not be released yet, so probably bio may still be submitted and ->q_usage_counter may be touched, so far this way seems safe, but not good from API's viewpoint. Move the release q_usage_counter into blk_queue_release(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: don't remove hctx debugfs dir from blk_mq_exit_queueMing Lei
The queue's top debugfs dir is removed from blk_release_queue(), so all hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue() is only called from blk_cleanup_queue(), it isn't necessary to remove hctx debugfs from blk_mq_exit_queue(). So remove it from blk_mq_exit_queue(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move blkcg initialization/destroy into disk allocation/release handlerMing Lei
blkcg works on FS bio level, so it is reasonable to make both blkcg and gendisk sharing same lifetime. Meantime there won't be any FS IO when releasing disk, so safe to move blkcg initialization/destroy into disk allocation/release handler Long term, we can move blkcg into gendisk completely. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308055200.735835-10-hch@lst.de [axboe: fixup missing blk-cgroup.h include] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: handle already freed tags gracefully in blk_mq_free_rqsMing Lei
To simplify further changes allow for double calling blk_mq_free_rqs on a queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: split out from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: do not include passthrough requests in I/O accountingChristoph Hellwig
I/O accounting buckets I/O into the read/write/discard categories into which passthrough I/O does not fit at all. It also accounts to the block_device, which may not even exist for passthrough I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: manage hctx map via xarrayMing Lei
First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: prepare for implementing hctx table via xarrayMing Lei
It is inevitable to cause use-after-free on q->queue_hw_ctx between queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting to xarray can fix the uaf, meantime code gets cleaner. Prepare for converting q->queue_hctx_ctx into xarray, one thing is that xa_for_each() can only accept 'unsigned long' as index, so changes type of hctx index of queue_for_each_hw_ctx() into 'unsigned long'. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: reconfigure poll after queue map is changedMing Lei
queue map can be changed when updating nr_hw_queues, so we need to reconfigure queue's poll capability. Add one helper for doing this job. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: simplify reallocation of hw ctxs a bitMing Lei
blk_mq_alloc_and_init_hctx() has already taken reuse into account, so no need to do it outside, then we can simplify blk_mq_realloc_hw_ctxs(). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: figure out correct numa node for hw queueMing Lei
The current code always uses default queue map and hw queue index for figuring out the numa node for hw queue, this way isn't correct because blk-mq supports three queue maps, and the correct queue map should be used for the specified hw queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08Revert "Revert "block, bfq: honor already-setup queue merges""Paolo Valente
A crash [1] happened to be triggered in conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq: honor already-setup queue merges""). Yet, the reverted commit was not the one introducing the bug. In fact, it actually triggered a UAF introduced by a different commit, and now fixed by commit d29bd41428cf ("block, bfq: reset last_bfqq_created on group change"). So, there is no point in keeping commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") out. This commit restores it. [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protectionShin'ichiro Kawasaki
Commit 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle calls out of q_usage_counter protection. However, these functions require q_usage_counter protection. The blk_mq_attempt_bio_merge call without the protection resulted in blktests block/005 failure with KASAN null- ptr-deref or use-after-free at bio merge. The rq_qos_throttle call without the protection caused kernel hang at qos throttle. To fix the failures, move the blk_mq_attempt_bio_merge and rq_qos_throttle calls back to q_usage_counter protection. Fixes: 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07block: remove bio_devnameChristoph Hellwig
All callers are gone, so remove this wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07block: remove handle_bad_sectorChristoph Hellwig
Use the %pg format specifier instead of the stack hungry bdevname function, and remove handle_bad_sector given that it is not pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07block: fix and cleanup bio_check_roChristoph Hellwig
Don't use a WARN_ON when printing a potentially user triggered condition. Also don't print the partno when the block device name already includes it, and use the %pg specifier to simplify printing the block device name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-05bfq: fix use-after-free in bfq_dispatch_requestZhang Wensheng
KASAN reports a use-after-free report when doing normal scsi-mq test [69832.239032] ================================================================== [69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0 [69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155 [69832.244656] [69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8 [69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [69832.249069] Workqueue: kblockd blk_mq_run_work_fn [69832.250022] Call Trace: [69832.250541] dump_stack+0x9b/0xce [69832.251232] ? bfq_dispatch_request+0x1045/0x44b0 [69832.252243] print_address_description.constprop.6+0x3e/0x60 [69832.253381] ? __cpuidle_text_end+0x5/0x5 [69832.254211] ? vprintk_func+0x6b/0x120 [69832.254994] ? bfq_dispatch_request+0x1045/0x44b0 [69832.255952] ? bfq_dispatch_request+0x1045/0x44b0 [69832.256914] kasan_report.cold.9+0x22/0x3a [69832.257753] ? bfq_dispatch_request+0x1045/0x44b0 [69832.258755] check_memory_region+0x1c1/0x1e0 [69832.260248] bfq_dispatch_request+0x1045/0x44b0 [69832.261181] ? bfq_bfqq_expire+0x2440/0x2440 [69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170 [69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830 [69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100 [69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0 [69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570 [69832.267147] ? __switch_to+0x5f4/0xee0 [69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140 [69832.268946] __blk_mq_run_hw_queue+0xc0/0x270 [69832.269840] blk_mq_run_work_fn+0x51/0x60 [69832.278170] process_one_work+0x6d4/0xfe0 [69832.278984] worker_thread+0x91/0xc80 [69832.279726] ? __kthread_parkme+0xb0/0x110 [69832.280554] ? process_one_work+0xfe0/0xfe0 [69832.281414] kthread+0x32d/0x3f0 [69832.282082] ? kthread_park+0x170/0x170 [69832.282849] ret_from_fork+0x1f/0x30 [69832.283573] [69832.283886] Allocated by task 7725: [69832.284599] kasan_save_stack+0x19/0x40 [69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0 [69832.286350] kmem_cache_alloc_node+0x13f/0x460 [69832.287237] bfq_get_queue+0x3d4/0x1140 [69832.287993] bfq_get_bfqq_handle_split+0x103/0x510 [69832.289015] bfq_init_rq+0x337/0x2d50 [69832.289749] bfq_insert_requests+0x304/0x4e10 [69832.290634] blk_mq_sched_insert_requests+0x13e/0x390 [69832.291629] blk_mq_flush_plug_list+0x4b4/0x760 [69832.292538] blk_flush_plug_list+0x2c5/0x480 [69832.293392] io_schedule_prepare+0xb2/0xd0 [69832.294209] io_schedule_timeout+0x13/0x80 [69832.295014] wait_for_common_io.constprop.1+0x13c/0x270 [69832.296137] submit_bio_wait+0x103/0x1a0 [69832.296932] blkdev_issue_discard+0xe6/0x160 [69832.297794] blk_ioctl_discard+0x219/0x290 [69832.298614] blkdev_common_ioctl+0x50a/0x1750 [69832.304715] blkdev_ioctl+0x470/0x600 [69832.305474] block_ioctl+0xde/0x120 [69832.306232] vfs_ioctl+0x6c/0xc0 [69832.306877] __se_sys_ioctl+0x90/0xa0 [69832.307629] do_syscall_64+0x2d/0x40 [69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [69832.309382] [69832.309701] Freed by task 155: [69832.310328] kasan_save_stack+0x19/0x40 [69832.311121] kasan_set_track+0x1c/0x30 [69832.311868] kasan_set_free_info+0x1b/0x30 [69832.312699] __kasan_slab_free+0x111/0x160 [69832.313524] kmem_cache_free+0x94/0x460 [69832.314367] bfq_put_queue+0x582/0x940 [69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0 [69832.317275] bfq_bfqq_expire+0xb27/0x2440 [69832.318084] bfq_dispatch_request+0x697/0x44b0 [69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830 [69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0 [69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140 [69832.322225] __blk_mq_run_hw_queue+0xc0/0x270 [69832.323114] blk_mq_run_work_fn+0x51/0x60 [69832.323942] process_one_work+0x6d4/0xfe0 [69832.324772] worker_thread+0x91/0xc80 [69832.325518] kthread+0x32d/0x3f0 [69832.326205] ret_from_fork+0x1f/0x30 [69832.326932] [69832.338297] The buggy address belongs to the object at ffff88802622b968 [69832.338297] which belongs to the cache bfq_queue of size 512 [69832.340766] The buggy address is located 288 bytes inside of [69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68) [69832.343091] The buggy address belongs to the page: [69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228 [69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0 [69832.347719] flags: 0x1fffff80010200(slab|head) [69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840 [69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000 [69832.356547] page dumped because: kasan: bad access detected [69832.357652] [69832.357970] Memory state around the buggy address: [69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.363273] ^ [69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc [69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [69832.377405] ================================================================== In bfq_dispatch_requestfunction, it may have function call: bfq_dispatch_request __bfq_dispatch_request bfq_select_queue bfq_bfqq_expire __bfq_bfqd_reset_in_service bfq_put_queue kmem_cache_free In this function call, in_serv_queue has beed expired and meet the conditions to free. In the function bfq_dispatch_request, the address of in_serv_queue pointing to has been released. For getting the value of idle_timer_disabled, it will get flags value from the address which in_serv_queue pointing to, then the problem of use-after-free happens; Fix the problem by check in_serv_queue == bfqd->in_service_queue, to get the value of idle_timer_disabled if in_serve_queue is equel to bfqd->in_service_queue. If the space of in_serv_queue pointing has been released, this judge will aviod use-after-free problem. And if in_serv_queue may be expired or finished, the idle_timer_disabled will be false which would not give effects to bfq_update_dispatch_stats. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com> Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-28blk-crypto: show crypto capabilities in sysfsEric Biggers
Add sysfs files that expose the inline encryption capabilities of request queues: /sys/block/$disk/queue/crypto/max_dun_bits /sys/block/$disk/queue/crypto/modes/$mode /sys/block/$disk/queue/crypto/num_keyslots Userspace can use these new files to decide what encryption settings to use, or whether to use inline encryption at all. This also brings the crypto capabilities in line with the other queue properties, which are already discoverable via the queue directory in sysfs. Design notes: - Place the new files in a new subdirectory "crypto" to group them together and to avoid complicating the main "queue" directory. This also makes it possible to replace "crypto" with a symlink later if we ever make the blk_crypto_profiles into real kobjects (see below). - It was necessary to define a new kobject that corresponds to the crypto subdirectory. For now, this kobject just contains a pointer to the blk_crypto_profile. Note that multiple queues (and hence multiple such kobjects) may refer to the same blk_crypto_profile. An alternative design would more closely match the current kernel data structures: the blk_crypto_profile could be a kobject itself, located directly under the host controller device's kobject, while /sys/block/$disk/queue/crypto would be a symlink to it. I decided not to do that for now because it would require a lot more changes, such as no longer embedding blk_crypto_profile in other structures, and also because I'm not sure we can rule out moving the crypto capabilities into 'struct queue_limits' in the future. (Even if multiple queues share the same crypto engine, maybe the supported data unit sizes could differ due to other queue properties.) It would also still be possible to switch to that design later without breaking userspace, by replacing the directory with a symlink. - Use "max_dun_bits" instead of "max_dun_bytes". Currently, the kernel internally stores this value in bytes, but that's an implementation detail. It probably makes more sense to talk about this value in bits, and choosing bits is more future-proof. - "modes" is a sub-subdirectory, since there may be multiple supported crypto modes, sysfs is supposed to have one value per file, and it makes sense to group all the mode files together. - Each mode had to be named. The crypto API names like "xts(aes)" are not appropriate because they don't specify the key size. Therefore, I assigned new names. The exact names chosen are arbitrary, but they happen to match the names used in log messages in fs/crypto/. - The "num_keyslots" file is a bit different from the others in that it is only useful to know for performance reasons. However, it's included as it can still be useful. For example, a user might not want to use inline encryption if there aren't very many keyslots. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-28block: don't delete queue kobject before its childrenEric Biggers
kobjects aren't supposed to be deleted before their child kobjects are deleted. Apparently this is usually benign; however, a WARN will be triggered if one of the child kobjects has a named attribute group: sysfs group 'modes' not found for kobject 'crypto' WARNING: CPU: 0 PID: 1 at fs/sysfs/group.c:278 sysfs_remove_group+0x72/0x80 ... Call Trace: sysfs_remove_groups+0x29/0x40 fs/sysfs/group.c:312 __kobject_del+0x20/0x80 lib/kobject.c:611 kobject_cleanup+0xa4/0x140 lib/kobject.c:696 kobject_release lib/kobject.c:736 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x53/0x70 lib/kobject.c:753 blk_crypto_sysfs_unregister+0x10/0x20 block/blk-crypto-sysfs.c:159 blk_unregister_queue+0xb0/0x110 block/blk-sysfs.c:962 del_gendisk+0x117/0x250 block/genhd.c:610 Fix this by moving the kobject_del() and the corresponding kobject_uevent() to the correct place. Fixes: 2c2086afc2b8 ("block: Protect less code with sysfs_lock in blk_{un,}register_queue()") Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124215938.2769-3-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-28block: simplify calling convention of elv_unregister_queue()Eric Biggers
Make elv_unregister_queue() a no-op if q->elevator is NULL or is not registered. This simplifies the existing callers, as well as the future caller in the error path of blk_register_queue(). Also don't bother checking whether q is NULL, since it never is. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124215938.2769-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-27block: remove redundant semicolonNian Yanchuan
Remove redundant semicolon from block/bdev.c Signed-off-by: Nian Yanchuan <yanchuan@nfschina.com> Link: https://lore.kernel.org/r/20220227170124.GA14658@localhost.localdomain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-27block: default BLOCK_LEGACY_AUTOLOAD to yChristoph Hellwig
As Luis reported, losetup currently doesn't properly create the loop device without this if the device node already exists because old scripts created it manually. So default to y for now and remove the aggressive removal schedule. Reported-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220225181440.1351591-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-22block: clear iocb->private in blkdev_bio_end_io_async()Stefano Garzarella
iocb_bio_iopoll() expects iocb->private to be cleared before releasing the bio. We already do this in blkdev_bio_end_io(), but we forgot in the recently added blkdev_bio_end_io_async(). Fixes: 54a88eb838d3 ("block: add single bio async direct IO helper") Cc: asml.silence@gmail.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220211090136.44471-1-sgarzare@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-22block: update io_ticks when io hangZhang Wensheng
When the inflight IOs are slow and no new IOs are issued, we expect iostat could manifest the IO hang problem. However after commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), io_tick and time_in_queue will not be updated until the end of IO, and the avgqu-sz and %util columns of iostat will be zero. Because it has using stat.nsecs accumulation to express time_in_queue which is not suitable to change, and may %util will express the status better when io hang occur. To fix io_ticks, we use update_io_ticks and inflight to update io_ticks when diskstats_show and part_stat_show been called. Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220217064247.4041435-1-zhangwensheng5@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-18block, bfq: don't move oom_bfqqYu Kuai
Our test report a UAF: [ 2073.019181] ================================================================== [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584 [ 2073.019192] [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5 [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2073.019200] Call trace: [ 2073.019203] dump_backtrace+0x0/0x310 [ 2073.019206] show_stack+0x28/0x38 [ 2073.019210] dump_stack+0xec/0x15c [ 2073.019216] print_address_description+0x68/0x2d0 [ 2073.019220] kasan_report+0x238/0x2f0 [ 2073.019224] __asan_store8+0x88/0xb0 [ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019233] bfq_put_async_queues+0xbc/0x208 [ 2073.019236] bfq_pd_offline+0x178/0x238 [ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019244] bfq_exit_queue+0x128/0x178 [ 2073.019249] blk_mq_exit_sched+0x12c/0x160 [ 2073.019252] elevator_exit+0xc8/0xd0 [ 2073.019256] blk_exit_queue+0x50/0x88 [ 2073.019259] blk_cleanup_queue+0x228/0x3d8 [ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019274] null_exit+0x90/0x114 [null_blk] [ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019282] el0_svc_common+0xc8/0x320 [ 2073.019287] el0_svc_handler+0xf8/0x160 [ 2073.019290] el0_svc+0x10/0x218 [ 2073.019291] [ 2073.019294] Allocated by task 14163: [ 2073.019301] kasan_kmalloc+0xe0/0x190 [ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418 [ 2073.019308] bfq_pd_alloc+0x54/0x118 [ 2073.019313] blkcg_activate_policy+0x250/0x460 [ 2073.019317] bfq_create_group_hierarchy+0x38/0x110 [ 2073.019321] bfq_init_queue+0x6d0/0x948 [ 2073.019325] blk_mq_init_sched+0x1d8/0x390 [ 2073.019330] elevator_switch_mq+0x88/0x170 [ 2073.019334] elevator_switch+0x140/0x270 [ 2073.019338] elv_iosched_store+0x1a4/0x2a0 [ 2073.019342] queue_attr_store+0x90/0xe0 [ 2073.019348] sysfs_kf_write+0xa8/0xe8 [ 2073.019351] kernfs_fop_write+0x1f8/0x378 [ 2073.019359] __vfs_write+0xe0/0x360 [ 2073.019363] vfs_write+0xf0/0x270 [ 2073.019367] ksys_write+0xdc/0x1b8 [ 2073.019371] __arm64_sys_write+0x50/0x60 [ 2073.019375] el0_svc_common+0xc8/0x320 [ 2073.019380] el0_svc_handler+0xf8/0x160 [ 2073.019383] el0_svc+0x10/0x218 [ 2073.019385] [ 2073.019387] Freed by task 72584: [ 2073.019391] __kasan_slab_free+0x120/0x228 [ 2073.019394] kasan_slab_free+0x10/0x18 [ 2073.019397] kfree+0x94/0x368 [ 2073.019400] bfqg_put+0x64/0xb0 [ 2073.019404] bfqg_and_blkg_put+0x90/0xb0 [ 2073.019408] bfq_put_queue+0x220/0x228 [ 2073.019413] __bfq_put_async_bfqq+0x98/0x168 [ 2073.019416] bfq_put_async_queues+0xbc/0x208 [ 2073.019420] bfq_pd_offline+0x178/0x238 [ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019429] bfq_exit_queue+0x128/0x178 [ 2073.019433] blk_mq_exit_sched+0x12c/0x160 [ 2073.019437] elevator_exit+0xc8/0xd0 [ 2073.019440] blk_exit_queue+0x50/0x88 [ 2073.019443] blk_cleanup_queue+0x228/0x3d8 [ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019459] null_exit+0x90/0x114 [null_blk] [ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019467] el0_svc_common+0xc8/0x320 [ 2073.019471] el0_svc_handler+0xf8/0x160 [ 2073.019474] el0_svc+0x10/0x218 [ 2073.019475] [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00 which belongs to the cache kmalloc-1024 of size 1024 [ 2073.019484] The buggy address is located 552 bytes inside of 1024-byte region [ffff8000ccf63f00, ffff8000ccf64300) [ 2073.019486] The buggy address belongs to the page: [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0 [ 2073.020123] flags: 0x7ffff0000008100(slab|head) [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00 [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [ 2073.020411] page dumped because: kasan: bad access detected [ 2073.020412] [ 2073.020414] Memory state around the buggy address: [ 2073.020420] ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020424] ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020430] ^ [ 2073.020434] ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020438] ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020439] ================================================================== The same problem exist in mainline as well. This is because oom_bfqq is moved to a non-root group, thus root_group is freed earlier. Thus fix the problem by don't move oom_bfqq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-18block, bfq: avoid moving bfqq to it's parent bfqgYu Kuai
Moving bfqq to it's parent bfqg is pointless. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-18block, bfq: cleanup bfq_bfqq_to_bfqg()Yu Kuai
Use bfq_group() instead, which do the same thing. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17block/wbt: fix negative inflight counter when remove scsi deviceLaibin Qiu
Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in wbt_disable_default() when switch elevator to bfq. And when we remove scsi device, wbt will be enabled by wbt_enable_default. If it become false positive between wbt_wait() and wbt_track() when submit write request. The following is the scenario that triggered the problem. T1 T2 T3 elevator_switch_mq bfq_init_queue wbt_disable_default <= Set rwb->enable_state (OFF) Submit_bio blk_mq_make_request rq_qos_throttle <= rwb->enable_state (OFF) scsi_remove_device sd_remove del_gendisk blk_unregister_queue elv_unregister_queue wbt_enable_default <= Set rwb->enable_state (ON) q_qos_track <= rwb->enable_state (ON) ^^^^^^ this request will mark WBT_TRACKED without inflight add and will lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung. Fix this by move wbt_enable_default() from elv_unregister to bfq_exit_queue(). Only re-enable wbt when bfq exit. Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly") Remove oneline stale comment, and kill one oneshot local variable. Signed-off-by: Ming Lei <ming.lei@rehdat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/ Signed-off-by: Laibin Qiu <qiulaibin@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17block: fix surprise removal for drivers calling blk_set_queue_dyingChristoph Hellwig
Various block drivers call blk_set_queue_dying to mark a disk as dead due to surprise removal events, but since commit 8e141f9eb803 that doesn't work given that the GD_DEAD flag needs to be set to stop I/O. Replace the driver calls to blk_set_queue_dying with a new (and properly documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the only remaining caller. Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kernHaimin Zhang
Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize the buffer of a bio. Signed-off-by: Haimin Zhang <tcs.kernel@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220216084038.15635-1-tcs.kernel@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-16block/bfq_wf2q: correct weight to ioprioYahu Gao
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0. What we want is ioprio or 0. Correct this by changing the calculation. Signed-off-by: Yahu Gao <gaoyahu19@gmail.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>