summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2020-09-09blk-stat: make q->stats->lock irqsafeTejun Heo
commit e11d80a849e010f78243bb6f6af7dccef3a71a90 upstream. blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's make it irqsafe. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09blk-iocost: ioc_pd_free() shouldn't assume irq disabledTejun Heo
commit 5aeac7c4b16069aae49005f0a8d4526baa83341b upstream. ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled when it can be called with irq disabled or enabled. This has a small chance of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations instead. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09block: ensure bdi->io_pages is always initializedJens Axboe
commit de1b0ee490eafdf65fac9eef9925391a8369f2dc upstream. If a driver leaves the limit settings as the defaults, then we don't initialize bdi->io_pages. This means that file systems may need to work around bdi->io_pages == 0, which is somewhat messy. Initialize the default value just like we do for ->ra_pages. Cc: stable@vger.kernel.org Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09block: fix locking in bdev_del_partitionChristoph Hellwig
[ Upstream commit 08fc1ab6d748ab1a690fd483f41e2938984ce353 ] We need to hold the whole device bd_mutex to protect against other thread concurrently deleting out partition before we get to it, and thus causing a use after free. Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition") Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTARTMing Lei
commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream. SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03block: fix get_max_io_size()Keith Busch
commit e4b469c66f3cbb81c2e94d31123d7bcdf3c1dabd upstream. A previous commit aligning splits to physical block sizes inadvertently modified one return case such that that it now returns 0 length splits when the number of sectors doesn't exceed the physical offset. This later hits a BUG in bio_split(). Restore the previous working behavior. Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios") Reported-by: Eric Deal <eric.deal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03blkcg: fix memleak for iolatencyYufen Yu
[ Upstream commit 27029b4b18aa5d3b060f0bf2c26dae254132cfce ] Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03blk-mq: insert request not through ->queue_rq into sw/scheduler queueMing Lei
[ Upstream commit db03f88fae8a2c8007caafa70287798817df2875 ] c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed to add request which has been through ->queue_rq() to the hw queue dispatch list, however it adds request running out of budget or driver tag to hw queue too. This way basically bypasses request merge, and causes too many request dispatched to LLD, and system% is unnecessary increased. Fixes this issue by adding request not through ->queue_rq into sw/scheduler queue, and this way is safe because no ->queue_rq is called on this request yet. High %system can be observed on Azure storvsc device, and even soft lock is observed. This patch reduces %system during heavy sequential IO, meantime decreases soft lockup risk. Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03bfq: fix blkio cgroup leakage v4Dmitry Monakhov
[ Upstream commit 2de791ab4918969d8108f15238a701968375f235 ] Changes from v1: - update commit description with proper ref-accounting justification commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. In fact whole idea of original commit is wrong because bfq_group entity can not dissapear under us because it is referenced by child bfq_queue's entities from here: -> bfq_init_entity() ->bfqg_and_blkg_get(bfqg); ->entity->parent = bfqg->my_entity -> bfq_put_queue(bfqq) FINAL_PUT ->bfqg_and_blkg_put(bfqq_group(bfqq)) ->kmem_cache_free(bfq_pool, bfqq); So parent entity can not disappear while child entity is in tree, and child entities already has proper protection. This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") bfq_group leak trace caused by bad commit: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) <==== : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i<max_iters;i++)) do mkdir -p /sys/fs/cgroup/blkio/a echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03block: Fix page_is_mergeable() for compound pagesMatthew Wilcox (Oracle)
[ Upstream commit d81665198b83e55a28339d1f3e4890ed8a434556 ] If we pass in an offset which is larger than PAGE_SIZE, then page_is_mergeable() thinks it's not mergeable with the previous bio_vec, leading to a large number of bio_vecs being used. Use a slightly more obvious test that the two pages are compatible with each other. Fixes: 52d52d1c98a9 ("block: only allow contiguous page structs in a bio_vec") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03block: respect queue limit of max discard segmentMing Lei
[ Upstream commit 943b40c832beb71115e38a1c4d99b640b5342738 ] When queue_max_discard_segments(q) is 1, blk_discard_mergable() will return false for discard request, then normal request merge is applied. However, only queue_max_segments() is checked, so max discard segment limit isn't respected. Check max discard segment limit in the request merge code for fixing the issue. Discard request failure of virtio_blk is fixed. Fixes: 69840466086d ("block: fix the DISCARD request merge") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19block: don't do revalidate zones on invalid devicesJohannes Thumshirn
[ Upstream commit 1a1206dc4cf02cee4b5cbce583ee4c22368b4c28 ] When we loose a device for whatever reason while (re)scanning zones, we trip over a NULL pointer in blk_revalidate_zone_cb, like in the following log: sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB) sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] Sense Key : 0xb [current] sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6 sda: failed to revalidate zones sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B) sda: detected capacity change from 14000519643136 to 0 ================================================================== BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550 Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58 CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x7d/0xb0 ? blk_revalidate_zone_cb+0x1b7/0x550 kasan_report.cold+0x5/0x37 ? blk_revalidate_zone_cb+0x1b7/0x550 check_memory_region+0x145/0x1a0 blk_revalidate_zone_cb+0x1b7/0x550 sd_zbc_parse_report+0x1f1/0x370 ? blk_req_zone_write_trylock+0x200/0x200 ? sectors_to_logical+0x60/0x60 ? blk_req_zone_write_trylock+0x200/0x200 ? blk_req_zone_write_trylock+0x200/0x200 sd_zbc_report_zones+0x3c4/0x5e0 ? sd_dif_config_host+0x500/0x500 blk_revalidate_disk_zones+0x231/0x44d ? _raw_write_lock_irqsave+0xb0/0xb0 ? blk_queue_free_zone_bitmaps+0xd0/0xd0 sd_zbc_read_zones+0x8cf/0x11a0 sd_revalidate_disk+0x305c/0x64e0 ? __device_add_disk+0x776/0xf20 ? read_capacity_16.part.0+0x1080/0x1080 ? blk_alloc_devt+0x250/0x250 ? create_object.isra.0+0x595/0xa20 ? kasan_unpoison_shadow+0x33/0x40 sd_probe+0x8dc/0xcd2 really_probe+0x20e/0xaf0 __driver_attach_async_helper+0x249/0x2d0 async_run_entry_fn+0xbe/0x560 process_one_work+0x764/0x1290 ? _raw_read_unlock_irqrestore+0x30/0x30 worker_thread+0x598/0x12f0 ? __kthread_parkme+0xc6/0x1b0 ? schedule+0xed/0x2c0 ? process_one_work+0x1290/0x1290 kthread+0x36b/0x440 ? kthread_create_worker_on_cpu+0xa0/0xa0 ret_from_fork+0x22/0x30 ================================================================== When the device is already gone we end up with the following scenario: The device's capacity is 0 and thus the number of zones will be 0 as well. When allocating the bitmap for the conventional zones, we then trip over a NULL pointer. So if we encounter a zoned block device with a 0 capacity, don't dare to revalidate the zones sizes. Fixes: 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19iocost: Fix check condition of iocg abs_vdebtChengming Zhou
[ Upstream commit d9012a59db54442d5b2fcfdfcded35cf566397d3 ] We shouldn't skip iocg when its abs_vdebt is not zero. Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-10Merge tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for inflight accounting, which affects only dm (Ming) - Fix documentation error for bfq (Yufen) - Fix memory leak for nbd (Zheng) * tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block: nbd: Fix memory leak in nbd_add_socket blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight() docs: block: update and fix tiny error for bfq
2020-07-07blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()Ming Lei
dm-multipath is the only user of blk_mq_queue_inflight(). When dm-multipath calls blk_mq_queue_inflight() to check if it has outstanding IO it can get a false negative. The reason for this is blk_mq_rq_inflight() doesn't consider requests that are no longer MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't called or finished yet) as "inflight". This causes request-based dm-multipath's dm_wait_for_completion() to return before all outstanding dm-multipath requests have actually completed. This breaks DM multipath's suspend functionality because blk-mq requests complete after DM's suspend has finished -- which shouldn't happen. Fix this by considering any request not in the MQ_RQ_IDLE state (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in blk_mq_rq_inflight(). Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05Merge tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe fixes from Christoph: - Fix crash in multi-path disk add (Christoph) - Fix ignore of identify error (Sagi) - Fix a compiler complaint that a function should be static (Wei) * tag 'block-5.8-2020-07-05' of git://git.kernel.dk/linux-block: block: make function __bio_integrity_free() static nvme: fix a crash in nvme_mpath_add_disk nvme: fix identify error status silent ignore
2020-07-02Merge tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Use kvfree_sensitive() for the block keyslot free (Eric) - Sync blk-mq debugfs flags (Hou) - Memory leak fix in virtio-blk error path (Hou) * tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block: virtio-blk: free vblk-vqs in error path of virtblk_probe() block/keyslot-manager: use kvfree_sensitive() blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags
2020-07-02block: make function __bio_integrity_free() staticWei Yongjun
Fix sparse build warning: block/bio-integrity.c:27:6: warning: symbol '__bio_integrity_free' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29block/keyslot-manager: use kvfree_sensitive()Eric Biggers
Make blk_ksm_destroy() use the kvfree_sensitive() function (which was introduced in v5.8-rc1) instead of open-coding it. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flagsHou Tao
Else there may be magic numbers in /sys/kernel/debug/block/*/state. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-27Merge tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph: - multipath deadlock fixes (Anton) - NUMA fixes (Max) - RDMA completion vector fix (Max) - IO deadlock fix (Sagi) - multipath reference fix (Sagi) - NS mutation fix (Sagi) - Use right allocator when freeing bip in error path (Chengguang) * tag 'block-5.8-2020-06-26' of git://git.kernel.dk/linux-block: nvme-multipath: fix bogus request queue reference put nvme-multipath: fix deadlock due to head->lock nvme: don't protect ns mutation with ns->head->lock nvme-multipath: fix deadlock between ana_work and scan_work nvme: fix possible deadlock when I/O is blocked nvme-rdma: assign completion vector correctly nvme-loop: initialize tagset numa value to the value of the ctrl nvme-tcp: initialize tagset numa value to the value of the ctrl nvme-pci: initialize tagset numa value to the value of the ctrl nvme-pci: override the value of the controller's numa node nvme: set initial value for controller's numa node block: release bip in a right way in error path
2020-06-24block: release bip in a right way in error pathChengguang Xu
Release bip using kfree() in error path when that was allocated by kmalloc(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-19Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Use import_uuid() where appropriate (Andy) - bcache fixes (Coly, Mauricio, Zhiqiang) - blktrace sparse warnings fix (Jan) - blktrace concurrent setup fix (Luis) - blkdev_get use-after-free fix (Jason) - Ensure all blk-mq maps are updated (Weiping) - Loop invalidate bdev fix (Zheng) * tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block: block: make function 'kill_bdev' static loop: replace kill_bdev with invalidate_bdev partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense block: update hctx map when use multiple maps blktrace: Avoid sparse warnings when assigning q->blk_trace blktrace: break out of blktrace setup on concurrent calls block: Fix use-after-free in blkdev_get() trace/events/block.h: drop kernel-doc for dropped function parameter blk-mq: Remove redundant 'return' statement bcache: pr_info() format clean up in bcache_device_init() bcache: use delayed kworker fo asynchronous devices registration bcache: check and adjust logical block size for backing devices bcache: fix potential deadlock problem in btree_gc_coalesce
2020-06-18partitions/ldm: Replace uuid_copy() with import_uuid() where it makes senseAndy Shevchenko
There is a specific API to treat raw data as UUID, i.e. import_uuid(). Use it instead of uuid_copy() with explicit casting. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17block: update hctx map when use multiple mapsWeiping Zhang
There is an issue when tune the number for read and write queues, if the total queue count was not changed. The hctx->type cannot be updated, since __blk_mq_update_nr_hw_queues will return directly if the total queue count has not been changed. Reproduce: dmesg | grep "default/read/poll" [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c 48 default tune the write queues to 24: echo 24 > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dmesg | grep "default/read/poll" [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c 48 default The driver's hardware queue mapping is not same as block layer. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15block: Replace zero-length array with flexible-arrayGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15blk-mq: Remove redundant 'return' statementBaolin Wang
The blk_mq_all_tag_iter() is a void function, thus remove the redundant 'return' statement in this function. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-13Merge tag 'kbuild-v5.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-07blk-mq: fix blk_mq_all_tag_iterMing Lei
blk_mq_all_tag_iter() is added to iterate all requests, so we should fetch the request from ->static_rqs][] instead of ->rqs[] which is for holding in-flight request only. Fix it by adding flag of BT_TAG_ITER_STATIC_RQS. Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-07blk-mq: split out a __blk_mq_get_driver_tag helperChristoph Hellwig
Allocation of the driver tag in the case of using a scheduler shares very little code with the "normal" tag allocation. Split out a new helper to streamline this path, and untangle it from the complex normal tag allocation. This way also avoids to fail driver tag allocation because of inactive hctx during cpu hotplug, and fixes potential hang risk. Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04block: nr_sects_write(): Disable preemption on seqcount writeAhmed S. Darwish
For optimized block readers not holding a mutex, the "number of sectors" 64-bit value is protected from tearing on 32-bit architectures by a sequence counter. Disable preemption before entering that sequence counter's write side critical section. Otherwise, the read side can preempt the write side section and spin for the entire scheduler tick. If the reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock. Fixes: c83f6bf98dc1 ("block: add partition resize function to blkpg ioctl") Cc: <stable@vger.kernel.org> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04block: remove the error argument to the block_bio_complete tracepointChristoph Hellwig
The status can be trivially derived from the bio itself. That also avoid callers like NVMe to incorrectly pass a blk_status_t instead of the errno, and the overhead of translating the blk_status_t to the errno in the I/O completion fast path when no tracing is enabled. Fixes: 35fe0d12c8a3 ("nvme: trace bio completion") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-02block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failedyu kuai
commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") added a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However, the object will be freed in bio_integrity_free() since 'bio->bi_opf' and 'bio->bi_integrity' were set previousy in bio_integrity_alloc(). Fixes: commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") Signed-off-by: yu kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-02Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "On top of the core changes, here are the block driver changes for this merge window: - NVMe changes: - NVMe over Fibre Channel protocol updates, which also reach over to drivers/scsi/lpfc (James Smart) - namespace revalidation support on the target (Anthony Iliopoulos) - gcc zero length array fix (Arnd Bergmann) - nvmet cleanups (Chaitanya Kulkarni) - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg) - use a SRQ per completion vector (Max Gurtovoy) - fix handling of runtime changes to the queue count (Weiping Zhang) - t10 protection information support for nvme-rdma and nvmet-rdma (Israel Rukshin and Max Gurtovoy) - target side AEN improvements (Chaitanya Kulkarni) - various fixes and minor improvements all over, icluding the nvme part of the lpfc driver" - Floppy code cleanup series (Willy, Denis) - Floppy contention fix (Jiri) - Loop CONFIGURE support (Martijn) - bcache fixes/improvements (Coly, Joe, Colin) - q->queuedata cleanups (Christoph) - Get rid of ioctl_by_bdev (Christoph, Stefan) - md/raid5 allocation fixes (Coly) - zero length array fixes (Gustavo) - swim3 task state fix (Xu)" * tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits) bcache: configure the asynchronous registertion to be experimental bcache: asynchronous devices registration bcache: fix refcount underflow in bcache_device_free() bcache: Convert pr_<level> uses to a more typical style bcache: remove redundant variables i and n lpfc: Fix return value in __lpfc_nvme_ls_abort lpfc: fix axchg pointer reference after free and double frees lpfc: Fix pointer checks and comments in LS receive refactoring nvme: set dma alignment to qword nvmet: cleanups the loop in nvmet_async_events_process nvmet: fix memory leak when removing namespaces and controllers concurrently nvmet-rdma: add metadata/T10-PI support nvmet: add metadata support for block devices nvmet: add metadata/T10-PI support nvme: add Metadata Capabilities enumerations nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len nvmet: rename nvmet_rw_len to nvmet_rw_data_len nvmet: add metadata characteristics for a namespace nvme-rdma: add metadata/T10-PI support nvme-rdma: introduce nvme_rdma_sgl structure ...
2020-06-02Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-06-02mm: move readahead prototypes from mm.hMatthew Wilcox (Oracle)
Patch series "Change readahead API", v11. This series adds a readahead address_space operation to replace the readpages operation. The key difference is that pages are added to the page cache as they are allocated (and then looked up by the filesystem) instead of passing them on a list to the readpages operation and having the filesystem add them to the page cache. It's a net reduction in code for each implementation, more efficient than walking a list, and solves the direct-write vs buffered-read problem reported by yu kuai at http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com The only unconverted filesystems are those which use fscache. Their conversion is pending Dave Howells' rewrite which will make the conversion substantially easier. This should be completed by the end of the year. I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and Miklos Szeredi have done a marvellous job of providing constructive criticism. These patches pass an xfstests run on ext4, xfs & btrfs with no regressions that I can tell (some of the tests seem a little flaky before and remain flaky afterwards). This patch (of 25): The readahead code is part of the page cache so should be found in the pagemap.h file. force_page_cache_readahead is only used within mm, so move it to mm/internal.h instead. Remove the parameter names where they add no value, and rename the ones which were actively misleading. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-29blk-wbt: rename __wbt_update_limits to wbt_update_limitsGuoqing Jiang
Now let's rename __wbt_update_limits to wbt_update_limits after the previous one is deleted. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-wbt: remove wbt_update_limitsGuoqing Jiang
No one call this function after commit 2af2783f2ea4f ("rq-qos: get rid of redundant wbt_update_limits()"), so remove it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-throttle: remove tg_drain_biosGuoqing Jiang
After blk_throtl_drain is removed, there is no caller of tg_drain_bios, so remove it as well. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-throttle: remove blk_throtl_drainGuoqing Jiang
After the commit 5addeae1bedc4 ("blk-cgroup: remove blkcg_drain_queue"), there is no caller of blk_throtl_drain, so let's remove it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: drain I/O when all CPUs in a hctx are offlineMing Lei
Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: add blk_mq_all_tag_iterMing Lei
Add a new blk_mq_all_tag_iter function to iterate over all allocated scheduler tags and driver tags. This is more flexible than the existing blk_mq_all_tag_busy_iter function as it allows the callers to do whatever they want on allocated request instead of being limited to started requests. It will be used to implement draining allocated requests on specified hctx in this patchset. [hch: switch from the two booleans to a more readable flags field and consolidate the tags iter functions] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctxChristoph Hellwig
blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so tailor it to the specific requirements, and don't bother the general fast path code with its special twinkles. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: use BLK_MQ_NO_TAG in more placesChristoph Hellwig
Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAGChristoph Hellwig
To prepare for wider use of this constant give it a more applicable name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: move more request initialization to blk_mq_rq_ctx_initChristoph Hellwig
Don't split request initialization between __blk_mq_alloc_request and blk_mq_rq_ctx_init. Also remove the op argument as it can be derived from the blk_mq_alloc_data structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: simplify the blk_mq_get_request calling conventionChristoph Hellwig
The bio argument is entirely unused, and the request_queue can be passed through the alloc_data, given that it needs to be filled out for the low-level tag allocation anyway. Also rename the function to __blk_mq_alloc_request as the switch between get and alloc in the call chains is rather confusing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: remove the bio argument to ->prepare_requestChristoph Hellwig
None of the I/O schedulers actually needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: blk-mq: provide forced completion methodKeith Busch
Drivers may need to bypass error injection for error recovery. Rename __blk_mq_complete_request() to blk_mq_force_complete_rq() and export that function so drivers may skip potential fake timeouts after they've reclaimed lost requests. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>