aboutsummaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2020-08-19block: don't do revalidate zones on invalid devicesJohannes Thumshirn
[ Upstream commit 1a1206dc4cf02cee4b5cbce583ee4c22368b4c28 ] When we loose a device for whatever reason while (re)scanning zones, we trip over a NULL pointer in blk_revalidate_zone_cb, like in the following log: sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB) sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] Sense Key : 0xb [current] sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6 sda: failed to revalidate zones sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B) sda: detected capacity change from 14000519643136 to 0 ================================================================== BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550 Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58 CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x7d/0xb0 ? blk_revalidate_zone_cb+0x1b7/0x550 kasan_report.cold+0x5/0x37 ? blk_revalidate_zone_cb+0x1b7/0x550 check_memory_region+0x145/0x1a0 blk_revalidate_zone_cb+0x1b7/0x550 sd_zbc_parse_report+0x1f1/0x370 ? blk_req_zone_write_trylock+0x200/0x200 ? sectors_to_logical+0x60/0x60 ? blk_req_zone_write_trylock+0x200/0x200 ? blk_req_zone_write_trylock+0x200/0x200 sd_zbc_report_zones+0x3c4/0x5e0 ? sd_dif_config_host+0x500/0x500 blk_revalidate_disk_zones+0x231/0x44d ? _raw_write_lock_irqsave+0xb0/0xb0 ? blk_queue_free_zone_bitmaps+0xd0/0xd0 sd_zbc_read_zones+0x8cf/0x11a0 sd_revalidate_disk+0x305c/0x64e0 ? __device_add_disk+0x776/0xf20 ? read_capacity_16.part.0+0x1080/0x1080 ? blk_alloc_devt+0x250/0x250 ? create_object.isra.0+0x595/0xa20 ? kasan_unpoison_shadow+0x33/0x40 sd_probe+0x8dc/0xcd2 really_probe+0x20e/0xaf0 __driver_attach_async_helper+0x249/0x2d0 async_run_entry_fn+0xbe/0x560 process_one_work+0x764/0x1290 ? _raw_read_unlock_irqrestore+0x30/0x30 worker_thread+0x598/0x12f0 ? __kthread_parkme+0xc6/0x1b0 ? schedule+0xed/0x2c0 ? process_one_work+0x1290/0x1290 kthread+0x36b/0x440 ? kthread_create_worker_on_cpu+0xa0/0xa0 ret_from_fork+0x22/0x30 ================================================================== When the device is already gone we end up with the following scenario: The device's capacity is 0 and thus the number of zones will be 0 as well. When allocating the bitmap for the conventional zones, we then trip over a NULL pointer. So if we encounter a zoned block device with a 0 capacity, don't dare to revalidate the zones sizes. Fixes: 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19iocost: Fix check condition of iocg abs_vdebtChengming Zhou
[ Upstream commit d9012a59db54442d5b2fcfdfcded35cf566397d3 ] We shouldn't skip iocg when its abs_vdebt is not zero. Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flagsHou Tao
[ Upstream commit bfe373f608cf81b7626dfeb904001b0e867c5110 ] Else there may be magic numbers in /sys/kernel/debug/block/*/state. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-16blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()Ming Lei
commit 05a4fed69ff00a8bd83538684cb602a4636b07a7 upstream. dm-multipath is the only user of blk_mq_queue_inflight(). When dm-multipath calls blk_mq_queue_inflight() to check if it has outstanding IO it can get a false negative. The reason for this is blk_mq_rq_inflight() doesn't consider requests that are no longer MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't called or finished yet) as "inflight". This causes request-based dm-multipath's dm_wait_for_completion() to return before all outstanding dm-multipath requests have actually completed. This breaks DM multipath's suspend functionality because blk-mq requests complete after DM's suspend has finished -- which shouldn't happen. Fix this by considering any request not in the MQ_RQ_IDLE state (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in blk_mq_rq_inflight(). Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-16block: release bip in a right way in error pathChengguang Xu
[ Upstream commit 0b8eb629a700c0ef15a437758db8255f8444e76c ] Release bip using kfree() in error path when that was allocated by kmalloc(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30block: update hctx map when use multiple mapsWeiping Zhang
[ Upstream commit fe35ec58f0d339221643287bbb7cee15c93a5389 ] There is an issue when tune the number for read and write queues, if the total queue count was not changed. The hctx->type cannot be updated, since __blk_mq_update_nr_hw_queues will return directly if the total queue count has not been changed. Reproduce: dmesg | grep "default/read/poll" [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c 48 default tune the write queues to 24: echo 24 > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dmesg | grep "default/read/poll" [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c 48 default The driver's hardware queue mapping is not same as block layer. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failedyu kuai
commit a75ca9303175d36af93c0937dd9b1a6422908b8d upstream. commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") added a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However, the object will be freed in bio_integrity_free() since 'bio->bi_opf' and 'bio->bi_integrity' were set previousy in bio_integrity_alloc(). Fixes: commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") Signed-off-by: yu kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22block: nr_sects_write(): Disable preemption on seqcount writeAhmed S. Darwish
commit 15b81ce5abdc4b502aa31dff2d415b79d2349d2f upstream. For optimized block readers not holding a mutex, the "number of sectors" 64-bit value is protected from tearing on 32-bit architectures by a sequence counter. Disable preemption before entering that sequence counter's write side critical section. Otherwise, the read side can preempt the write side section and spin for the entire scheduler tick. If the reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock. Fixes: c83f6bf98dc1 ("block: add partition resize function to blkpg ioctl") Cc: <stable@vger.kernel.org> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22iocost: don't let vrate run wild while there's no saturation signalTejun Heo
[ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ] When the QoS targets are met and nothing is being throttled, there's no way to tell how saturated the underlying device is - it could be almost entirely idle, at the cusp of saturation or anywhere inbetween. Given that there's no information, it's best to keep vrate as-is in this state. Before 7cd806a9a953 ("iocost: improve nr_lagging handling"), this was the case - if the device isn't missing QoS targets and nothing is being throttled, busy_level was reset to zero. While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve nr_lagging handling") broke this. Now, while the device is hitting QoS targets and nothing is being throttled, vrate keeps getting adjusted according to the existing busy_level. This led to vrate keeping climing till it hits max when there's an IO issuer with limited request concurrency if the vrate started low. vrate starts getting adjusted upwards until the issuer can issue IOs w/o being throttled. From then on, QoS targets keeps getting met and nothing on the system needs throttling and vrate keeps getting increased due to the existing busy_level. This patch makes the following changes to the busy_level logic. * Reset busy_level if nr_shortages is zero to avoid the above scenario. * Make non-zero nr_lagging block lowering nr_level but still clear positive busy_level if there's clear non-saturation signal - QoS targets are met and nr_shortages is non-zero. nr_lagging's role is preventing adjusting vrate upwards while there are long-running commands and it shouldn't keep busy_level positive while there's clear non-saturation signal. * Restructure code for clarity and add comments. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Newell <newella@fb.com> Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22block: reset mapping if failed to update hardware queue countWeiping Zhang
[ Upstream commit aa880ad690ab6d4c53934af85fb5a43e69ecb0f5 ] When we increase hardware queue count, blk_mq_update_queue_map will reset the mapping between cpu and hardware queue base on the hardware queue count(set->nr_hw_queues). The mapping cannot be reset if it encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will continue using it, then blk_mq_map_swqueue will touch a invalid memory, because the mapping points to a wrong hctx. blktest block/030: null_blk: module loaded Increasing nr_hw_queues to 8 fails, fallback to 1 ================================================================== BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830 Read of size 8 at addr 0000000000000128 by task nproc/8541 CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0xa5/0xe6 __kasan_report.cold+0x65/0xbb kasan_report+0x45/0x60 check_memory_region+0x15e/0x1c0 __kasan_check_read+0x15/0x20 blk_mq_map_swqueue+0x2f2/0x830 __blk_mq_update_nr_hw_queues+0x3df/0x690 blk_mq_update_nr_hw_queues+0x32/0x50 nullb_device_submit_queues_store+0xde/0x160 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x14b/0x2d0 ksys_write+0xdd/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x310 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22block: alloc map and request for new hardware queueMing Lei
[ Upstream commit fd689871bbfbb41cd77379d3e9e5f4def0f7d6c6 ] Alloc new map and request for new hardware queue when increse hardware queue count. Before this patch, it will show a warning for each new hardware queue, but it's not enough, these hctx have no maps and reqeust, when a bio was mapped to these hardware queue, it will trigger kernel panic when get request from these hctx. Test environment: * A NVMe disk supports 128 io queues * 96 cpus in system A corner case can always trigger this panic, there are 96 io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write queues to 96, then nvme will alloc others(32) queues for read, but blk_mq_update_nr_hw_queues does not alloc map and request for these new added io queues. So when process read nvme disk, it will trigger kernel panic when get request from these hardware context. Reproduce script: nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1) echo $nr > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1 [ 8040.805626] ------------[ cut here ]------------ [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_ cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core] [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56 [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246 [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90 [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008 [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000 [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0 [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8040.805647] PKRU: 55555554 [ 8040.805647] Call Trace: [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390 [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme] [ 8040.805651] process_one_work+0x1a7/0x370 [ 8040.805652] worker_thread+0x1c9/0x380 [ 8040.805653] ? max_active_store+0x80/0x80 [ 8040.805655] kthread+0x112/0x130 [ 8040.805656] ? __kthread_parkme+0x70/0x70 [ 8040.805657] ret_from_fork+0x35/0x40 [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]--- [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004 [ 8229.365165] #PF: supervisor read access in kernel mode [ 8229.365178] #PF: error_code(0x0000) - not-present page [ 8229.365191] PGD 0 P4D 0 [ 8229.365201] Oops: 0000 [#1] SMP PTI [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250 [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246 [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998 [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38 [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198 [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000 [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001 [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000 [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0 [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8229.365476] PKRU: 55555554 [ 8229.365484] Call Trace: [ 8229.365498] ? finish_wait+0x80/0x80 [ 8229.365512] blk_mq_get_request+0xcb/0x3f0 [ 8229.365525] blk_mq_make_request+0x143/0x5d0 [ 8229.365538] generic_make_request+0xcf/0x310 [ 8229.365553] ? scan_shadow_nodes+0x30/0x30 [ 8229.365564] submit_bio+0x3c/0x150 [ 8229.365576] mpage_readpages+0x163/0x1a0 [ 8229.365588] ? blkdev_direct_IO+0x490/0x490 [ 8229.365601] read_pages+0x6b/0x190 [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0 [ 8229.365626] ondemand_readahead+0x182/0x2f0 [ 8229.365639] generic_file_buffered_read+0x590/0xab0 [ 8229.365655] new_sync_read+0x12a/0x1c0 [ 8229.365666] vfs_read+0x8a/0x140 [ 8229.365676] ksys_read+0x59/0xd0 [ 8229.365688] do_syscall_64+0x55/0x1d0 [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-28Revert "block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT"Jens Axboe
This reverts commit c58c1f83436b501d45d4050fd1296d71a9760bcb. io_uring does do the right thing for this case, and we're still returning -EAGAIN to userspace for the cases we don't support. Revert this change to avoid doing endless spins of resubmits. Cc: stable@vger.kernel.org # v5.6 Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: use bdi_dev_name() to get device nameYufen Yu
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-05iocost: protect iocg->abs_vdebt with iocg->waitq.lockTejun Heo
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation. However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen. 1. The iocg is in the process of being deactivated by the periodic timer. 2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active. 3. The iocg is deactivated. 4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay. 5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs. This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups. The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path. This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vlad Dmitriev <vvd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30block: remove the bd_openers checks in blk_drop_partitionsChristoph Hellwig
When replacing the bd_super check with a bd_openers I followed a logical conclusion, which turns out to be utterly wrong. When a block device has bd_super sets it has a mount file system on it (although not every mounted file system sets bd_super), but that also implies it doesn't even have partitions to start with. So instead of trying to come up with a logical check for all openers, just remove the check entirely. Fixes: d3ef5536274f ("block: fix busy device checking in blk_drop_partitions") Fixes: cb6b771b05c3 ("block: fix busy device checking in blk_drop_partitions again") Reported-by: Michal Koutný <mkoutny@suse.com> Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-21blk-iocost: Fix error on iocost_ioc_vrate_adjWaiman Long
Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]" argument of the iocost_ioc_vrate_adj trace entry defined in include/trace/events/iocost.h leading to the following error: /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8: error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token , u32[]* __tracepoint_arg_missed_ppm That argument type is indeed rather complex and hard to read. Looking at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying the argument to a simple "u32 *missed_ppm" and adjusting the trace entry accordingly, the compilation error was gone. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-17blk-wbt: Use tracepoint_string() for wbt_step tracepoint string literalsTommi Rantala
Use tracepoint_string() for string literals that are used in the wbt_step tracepoint, so that userspace tools can display the string content. Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-16blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budgetJohn Garry
If in blk_mq_dispatch_rq_list() we find no budget, then we break of the dispatch loop, but the request may keep the driver tag, evaulated in 'nxt' in the previous loop iteration. Fix by putting the driver tag for that request. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-10Merge tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here's a set of fixes that should go into this merge window. This contains: - NVMe pull request from Christoph with various fixes - Better discard support for loop (Evan) - Only call ->commit_rqs() if we have queued IO (Keith) - blkcg offlining fixes (Tejun) - fix (and fix the fix) for busy partitions" * tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block: block: fix busy device checking in blk_drop_partitions again block: fix busy device checking in blk_drop_partitions nvmet-rdma: fix double free of rdma queue blk-mq: don't commit_rqs() if none were queued nvme-fc: Revert "add module to ops template to allow module references" nvme: fix deadlock caused by ANA update wrong locking nvmet-rdma: fix bonding failover possible NULL deref loop: Better discard support for block devices loop: Report EOPNOTSUPP properly nvmet: fix NULL dereference when removing a referral nvme: inherit stable pages constraint in the mpath stack device blkcg: don't offline parent blkcg first blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it nvme-tcp: fix possible crash in recv error flow nvme-tcp: don't poll a non-live queue nvme-tcp: fix possible crash in write_zeroes processing nvmet-fc: fix typo in comment nvme-rdma: Replace comma with a semicolon nvme-fcloop: fix deallocation of working context nvme: fix compat address handling in several ioctls
2020-04-10block: fix busy device checking in blk_drop_partitions againChristoph Hellwig
The previous fix had an off by one in the bd_openers checking, counting the callers blkdev_get. Fixes: d3ef5536274f ("block: fix busy device checking in blk_drop_partitions") Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-07block: fix busy device checking in blk_drop_partitionsChristoph Hellwig
bd_super is only set by get_tree_bdev and mount_bdev, and thus not by other openers like btrfs or the XFS realtime and log devices, as well as block devices directly opened from user space. Check bd_openers instead. Fixes: 77032ca66f86 ("Return EBUSY from BLKRRPART for mounted whole-dev fs") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-06blk-mq: don't commit_rqs() if none were queuedKeith Busch
Unburden the drivers from checking if a call to commit_rqs() is valid by not calling it when there are no requests to commit. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-02Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This series has a huge amount of churn because it pulls in Mauro's doc update changing all our txt files to rst ones. Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and some other minor updates. The major core change is Hannes moving functions out of the aacraid driver and into the core" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits) scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code scsi: ufs: Do not rely on prefetched data scsi: dc395x: remove dc395x_bios_param scsi: libiscsi: Fix error count for active session scsi: hpsa: correct race condition in offload enabled scsi: message: fusion: Replace zero-length array with flexible-array member scsi: qedi: Add PCI shutdown handler support scsi: qedi: Add MFW error recovery process scsi: ufs: Enable block layer runtime PM for well-known logical units scsi: ufs-qcom: Override devfreq parameters scsi: ufshcd: Let vendor override devfreq parameters scsi: ufshcd: Update the set frequency to devfreq scsi: ufs: Resume ufs host before accessing ufs device scsi: ufs-mediatek: customize the delay for enabling host scsi: ufs: make HCE polling more compact to improve initialization latency scsi: ufs: allow custom delay prior to host enabling scsi: ufs-mediatek: use common delay function scsi: ufs: introduce common and flexible delay function scsi: ufs: use an enum for host capabilities scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc() ...
2020-04-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "My attempt to revitalize trivial queue I've been neglecting for years (what a disaster that was for this world, right? :) ) with patches collected from backlog that were still relevant and not applied elsewhere in the meantime" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: err.h: remove deprecated PTR_RET for good blk-mq: Fix typo in comment x86/boot: Fix comment spelling sh: mach-highlander: Fix comment spelling s390/dasd: Fix comment spelling mfd: wm8994: Fix comment spelling docs: Add reference in binfmt-misc.rst genirq: fix kerneldoc comment for irq_desc drm/amdgpu: fix two documentation mismatch issues HID: fix Kconfig word ordering list/hashtable: minor documentation corrections.
2020-04-01blkcg: don't offline parent blkcg firstTejun Heo
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. Let's fix this by making child blkcgs pin the parents' online states. Note that pin/unpin names are chosen over get/put intentionally because css uses get/put online for something different. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use itTejun Heo
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. To fix it, we want child blkcgs to pin the parents' online states turning the refcnt into a more generic online pinning mechanism. In prepartion, * blkcg->cgwb_refcnt -> blkcg->online_pin * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online() * Take them out of CONFIG_CGROUP_WRITEBACK Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-30Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The EFI changes in this cycle are much larger than usual, for two (positive) reasons: - The GRUB project is showing signs of life again, resulting in the introduction of the generic Linux/UEFI boot protocol, instead of x86 specific hacks which are increasingly difficult to maintain. There's hope that all future extensions will now go through that boot protocol. - Preparatory work for RISC-V EFI support. The main changes are: - Boot time GDT handling changes - Simplify handling of EFI properties table on arm64 - Generic EFI stub cleanups, to improve command line handling, file I/O, memory allocation, etc. - Introduce a generic initrd loading method based on calling back into the firmware, instead of relying on the x86 EFI handover protocol or device tree. - Introduce a mixed mode boot method that does not rely on the x86 EFI handover protocol either, and could potentially be adopted by other architectures (if another one ever surfaces where one execution mode is a superset of another) - Clean up the contents of 'struct efi', and move out everything that doesn't need to be stored there. - Incorporate support for UEFI spec v2.8A changes that permit firmware implementations to return EFI_UNSUPPORTED from UEFI runtime services at OS runtime, and expose a mask of which ones are supported or unsupported via a configuration table. - Partial fix for the lack of by-VA cache maintenance in the decompressor on 32-bit ARM. - Changes to load device firmware from EFI boot service memory regions - Various documentation updates and minor code cleanups and fixes" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) efi/libstub/arm: Fix spurious message that an initrd was loaded efi/libstub/arm64: Avoid image_base value from efi_loaded_image partitions/efi: Fix partition name parsing in GUID partition entry efi/x86: Fix cast of image argument efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations efi: Fix a mistype in comments mentioning efivar_entry_iter_begin() efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map() efi/x86: Ignore the memory attributes table on i386 efi/x86: Don't relocate the kernel unless necessary efi/x86: Remove extra headroom for setup block efi/x86: Add kernel preferred address to PE header efi/x86: Decompress at start of PE image load address x86/boot/compressed/32: Save the output address instead of recalculating it efi/libstub/x86: Deal with exit() boot service returning x86/boot: Use unsigned comparison for addresses efi/x86: Avoid using code32_start efi/x86: Make efi32_pe_entry() more readable efi/x86: Respect 32-bit ABI in efi32_pe_entry() efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA ...
2020-03-30Merge tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - floppy driver cleanup series from Willy - NVMe updates and fixes (Various) - null_blk trace improvements (Chaitanya) - bcache fixes (Coly) - md fixes (via Song) - loop block size change optimizations (Martijn) - scnprintf() use (Takashi) * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits) null_blk: add trace in null_blk_zoned.c null_blk: add tracepoint helpers for zoned mode block: add a zone condition debug helper nvme: cleanup namespace identifier reporting in nvme_init_ns_head nvme: rename __nvme_find_ns_head to nvme_find_ns_head nvme: refactor nvme_identify_ns_descs error handling nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl nvme: Fix controller creation races with teardown flow nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl nvme: Fix ctrl use-after-free during sysfs deletion nvme-pci: Re-order nvme_pci_free_ctrl nvme: Remove unused return code from nvme_delete_ctrl_sync nvme: Use nvme_state_terminal helper nvme: release ida resources nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO nvmet-tcp: optimize tcp stack TX when data digest is used nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow nvme-multipath: do not reset on unknown status nvmet-rdma: allocate RW ctxs according to mdts ...
2020-03-30Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Online capacity resizing (Balbir) - Number of hardware queue change fixes (Bart) - null_blk fault injection addition (Bart) - Cleanup of queue allocation, unifying the node/no-node API (Christoph) - Cleanup of genhd, moving code to where it makes sense (Christoph) - Cleanup of the partition handling code (Christoph) - disk stat fixes/improvements (Konstantin) - BFQ improvements (Paolo) - Various fixes and improvements * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits) block: return NULL in blk_alloc_queue() on error block: move bio_map_* to blk-map.c Revert "blkdev: check for valid request queue before issuing flush" block: simplify queue allocation bcache: pass the make_request methods to blk_queue_make_request null_blk: use blk_mq_init_queue_data block: add a blk_mq_init_queue_data helper block: move the ->devnode callback to struct block_device_operations block: move the part_stat* helpers from genhd.h to a new header block: move block layer internals out of include/linux/genhd.h block: move guard_bio_eod to bio.c block: unexport get_gendisk block: unexport disk_map_sector_rcu block: unexport disk_get_part block: mark part_in_flight and part_in_flight_rw static block: mark block_depr static block: factor out requeue handling from dispatch code block/diskstats: replace time_in_queue with sum of request times block/diskstats: accumulate all per-cpu counters in one pass block/diskstats: more accurate approximation of io_ticks for slow disks ...
2020-03-29block: return NULL in blk_alloc_queue() on errorChaitanya Kulkarni
This patch fixes follwoing warning: block/blk-core.c: In function ‘blk_alloc_queue’: block/blk-core.c:558:10: warning: returning ‘int’ from a function with return type ‘struct request_queue *’ makes pointer from integer without a cast [-Wint-conversion] return -EINVAL; Fixes: 3d745ea5b095a ("block: simplify queue allocation") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: add a zone condition debug helperChaitanya Kulkarni
Add a helper to stringify the zone conditions. We use this helper in the next patch to track zone conditions in tracepoints. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: move bio_map_* to blk-map.cChristoph Hellwig
The bio_map_* helpers are just the low-level helpers for the blk_rq_map_* APIs. Move them together for better logical grouping, as no there isn't much overlap with other code in bio.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27Revert "blkdev: check for valid request queue before issuing flush"Christoph Hellwig
This reverts commit f10d9f617a65905c556c3b37c9b9646ae7d04ed7. We can't have queues without a make_request_fn any more (and the loop device uses blk-mq these days anyway..). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: simplify queue allocationChristoph Hellwig
Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: add a blk_mq_init_queue_data helperChristoph Hellwig
This allows a driver to pass a queuedata member before ->init_hctx is called. null_blk currently open codes this logic, but I'd rather have it in the core to ease future maintainance. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: move the ->devnode callback to struct block_device_operationsChristoph Hellwig
There really isn't any good reason to stash a method directly into struct gendisk. Move it together with the other block device operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: move the part_stat* helpers from genhd.h to a new headerChristoph Hellwig
These macros are just used by a few files. Move them out of genhd.h, which is included everywhere into a new standalone header. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: move block layer internals out of include/linux/genhd.hChristoph Hellwig
None of this needs to be exposed to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: move guard_bio_eod to bio.cChristoph Hellwig
This is bio layer functionality and not related to buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: unexport get_gendiskChristoph Hellwig
get_gendisk is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: unexport disk_map_sector_rcuChristoph Hellwig
disk_map_sector_rcu is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: unexport disk_get_partChristoph Hellwig
disk_get_part is not used by any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: mark part_in_flight and part_in_flight_rw staticChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: mark block_depr staticChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: factor out requeue handling from dispatch codeJohannes Thumshirn
Factor out the requeue handling from the dispatch code, this will make subsequent addition of different requeueing schemes easier. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block/diskstats: replace time_in_queue with sum of request timesKonstantin Khlebnikov
Column "time_in_queue" in diskstats is supposed to show total waiting time of all requests. I.e. value should be equal to the sum of times from other columns. But this is not true, because column "time_in_queue" is counted separately in jiffies rather than in nanoseconds as other times. This patch removes redundant counter for "time_in_queue" and shows total time of read, write, discard and flush requests. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block/diskstats: accumulate all per-cpu counters in one passKonstantin Khlebnikov
Reading /proc/diskstats iterates over all cpus for summing each field. It's faster to sum all fields in one pass. Hammering /proc/diskstats with fio shows 2x performance improvement: fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \ --size=1k --bs=1k --fallocate=none --create_on_open=1 \ --time_based=1 --runtime=10 --invalidate=0 --group_report JOBS=1 JOBS=10 Before: 7k iops 64k iops After: 18k iops 120k iops Also this way code is more compact: add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346) Function old new delta part_stat_read_all - 194 +194 diskstats_show 1344 631 -713 part_stat_show 1219 392 -827 Total: Before=14966947, After=14965601, chg -0.01% Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block/diskstats: more accurate approximation of io_ticks for slow disksKonstantin Khlebnikov
Currently io_ticks is approximated by adding one at each start and end of requests if jiffies counter has changed. This works perfectly for requests shorter than a jiffy or if one of requests starts/ends at each jiffy. If disk executes just one request at a time and they are longer than two jiffies then only first and last jiffies will be accounted. Fix is simple: at the end of request add up into io_ticks jiffies passed since last update rather than just one jiffy. Example: common HDD executes random read 4k requests around 12ms. fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 & iostat -x 10 sdb Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch: Before: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43 After: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99 Now io_ticks does not loose time between start and end of requests, but for queue-depth > 1 some I/O time between adjacent starts might be lost. For load estimation "%util" is not as useful as average queue length, but it clearly shows how often disk queue is completely empty. Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24block: merge partition-generic.c and check.cChristoph Hellwig
Merge block/partition-generic.c and block/partitions/check.c into a single block/partitions/core.c as the content is closely related and both files are tiny. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24block: move the various x86 Unix label formats out of genhd.hChristoph Hellwig
All these are just used in block/partitions/msdos.c, so move them out of the genhd.h driver included by every driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>