summaryrefslogtreecommitdiffstats
path: root/block/bfq-iosched.c
AgeCommit message (Collapse)Author
2021-10-06Revert "block, bfq: honor already-setup queue merges"Jens Axboe
[ Upstream commit ebc69e897e17373fbe1daaff1debaa77583a5284 ] This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b. We have had several folks complain that this causes hangs for them, which is especially problematic as the commit has also hit stable already. As no resolution seems to be forthcoming right now, revert the patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503 Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22block, bfq: honor already-setup queue mergesPaolo Valente
[ Upstream commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b ] The function bfq_setup_merge prepares the merging between two bfq_queues, say bfqq and new_bfqq. To this goal, it assigns bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives, the process that generated that I/O is disassociated from bfqq and associated with new_bfqq (merging is actually a redirection). In this respect, bfq_setup_merge increases new_bfqq->ref in advance, adding the number of processes that are expected to be associated with new_bfqq. Unfortunately, the stable-merging mechanism interferes with this setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and before all the expected processes have been associated with bfqq->new_bfqq, bfqq may happen to be stably merged with a different queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq gets changed. So, some of the processes that have been already accounted for in the ref counter of the previous new_bfqq will not be associated with that queue. This creates an unbalance, because those references will never be decremented. This commit fixes this issue by reestablishing the previous, natural behaviour: once bfqq->new_bfqq has been set, it will not be changed until all expected redirections have occurred. Signed-off-by: Davide Zini <davidezini2@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22block: bfq: fix bfq_set_next_ioprio_data()Damien Le Moal
commit a680dd72ec336b81511e3bff48efac6dbfa563e7 upstream. For a request that has a priority level equal to or larger than IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This is not consistent with the warning and the allowed values for priority levels. Fix this by setting the request new_ioprio field to IOPRIO_BE_NR - 1, the lowest priority level allowed. Cc: <stable@vger.kernel.org> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19kyber: fix out of bounds access when preemptedOmar Sandoval
[ Upstream commit efed9a3337e341bd0989161b97453b52567bc59d ] __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04bfq: Avoid false bfq queue mergingJan Kara
commit 41e76c85660c022c6bf5713bfb6c21e64a487cec upstream. bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it makes sense to merge current bfq queue with the in-service queue. However if the in-service queue is freshly scheduled and didn't dispatch any requests yet, bfqd->in_serv_last_pos is stale and contains value from the previously scheduled bfq queue which can thus result in a bogus decision that the two queues should be merged. This bug can be observed for example with the following fio jobfile: [global] direct=0 ioengine=sync invalidate=1 size=1g rw=read [reader] numjobs=4 directory=/mnt where the 4 processes will end up in the one shared bfq queue although they do IO to physically very distant files (for some reason I was able to observe this only with slice_idle=1ms setting). Fix the problem by invalidating bfqd->in_serv_last_pos when switching in-service queue. Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-17bfq-iosched: Revert "bfq: Fix computation of shallow depth"Lin Feng
[ Upstream commit 388c705b95f23f317fa43e6abf9ff07b583b721a ] This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a. bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core sbitmap_get_shallow, which uses just the number to limit the scan depth of each bitmap word, formula: scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100% That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct. But after commit patch 'bfq: Fix computation of shallow depth', we use sbitmap.depth instead, as a example in following case: sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64. The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit nothing. Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19bfq: Fix computation of shallow depthJan Kara
[ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ] BFQ computes number of tags it allows to be allocated for each request type based on tag bitmap. However it uses 1 << bitmap.shift as number of available tags which is wrong. 'shift' is just an internal bitmap value containing logarithm of how many bits bitmap uses in each bitmap word. Thus number of tags allowed for some request types can be far to low. Use proper bitmap.depth which has the number of tags instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-23block: only call sched requeue_request() for scheduled requestsOmar Sandoval
[ Upstream commit e8a8a185051a460e3eb0617dca33f996f4e31516 ] Yang Yang reported the following crash caused by requeueing a flush request in Kyber: [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00 ... [ 2.517468] pc : clear_bit+0x18/0x2c [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228 [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145 ... [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000) [ 2.517602] Call trace: [ 2.517606] clear_bit+0x18/0x2c [ 2.517619] kyber_finish_request+0x74/0x80 [ 2.517627] blk_mq_requeue_request+0x3c/0xc0 [ 2.517637] __scsi_queue_insert+0x11c/0x148 [ 2.517640] scsi_softirq_done+0x114/0x130 [ 2.517643] blk_done_softirq+0x7c/0xb0 [ 2.517651] __do_softirq+0x208/0x3bc [ 2.517657] run_ksoftirqd+0x34/0x60 [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0 [ 2.517667] kthread+0x110/0x120 [ 2.517669] ret_from_fork+0x10/0x18 This happens because Kyber doesn't track flush requests, so kyber_finish_request() reads a garbage domain token. Only call the scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for the finish_request() hook in blk_mq_free_request()). Now that we're handling it in blk-mq, also remove the check from BFQ. Reported-by: Yang Yang <yang.yang@vivo.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroupPaolo Valente
commit c8997736650060594845e42c5d01d3118aec8d25 upstream. A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The goal of this put is to release a process reference to a bfq_queue. But process-reference releases may trigger also some extra operation, and, to this goal, are handled through bfq_release_process_ref(). So, turn the invocation of bfq_put_queue() into an invocation of bfq_release_process_ref(). Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17block, bfq: fix use-after-free in bfq_idle_slice_timer_bodyZhiqiang Liu
[ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ] In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is not in bfqd-lock critical section. The bfqq, which is not equal to NULL in bfq_idle_slice_timer, may be freed after passing to bfq_idle_slice_timer_body. So we will access the freed memory. In addition, considering the bfqq may be in race, we should firstly check whether bfqq is in service before doing something on it in bfq_idle_slice_timer_body func. If the bfqq in race is not in service, it means the bfqq has been expired through __bfq_bfqq_expire func, and wait_request flags has been cleared in __bfq_bfqd_reset_in_service func. So we do not need to re-clear the wait_request of bfqq which is not in service. KASAN log is given as follows: [13058.354613] ================================================================== [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290 [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767 [13058.354646] [13058.354655] CPU: 96 PID: 19767 Comm: fork13 [13058.354661] Call trace: [13058.354667] dump_backtrace+0x0/0x310 [13058.354672] show_stack+0x28/0x38 [13058.354681] dump_stack+0xd8/0x108 [13058.354687] print_address_description+0x68/0x2d0 [13058.354690] kasan_report+0x124/0x2e0 [13058.354697] __asan_load8+0x88/0xb0 [13058.354702] bfq_idle_slice_timer+0xac/0x290 [13058.354707] __hrtimer_run_queues+0x298/0x8b8 [13058.354710] hrtimer_interrupt+0x1b8/0x678 [13058.354716] arch_timer_handler_phys+0x4c/0x78 [13058.354722] handle_percpu_devid_irq+0xf0/0x558 [13058.354731] generic_handle_irq+0x50/0x70 [13058.354735] __handle_domain_irq+0x94/0x110 [13058.354739] gic_handle_irq+0x8c/0x1b0 [13058.354742] el1_irq+0xb8/0x140 [13058.354748] do_wp_page+0x260/0xe28 [13058.354752] __handle_mm_fault+0x8ec/0x9b0 [13058.354756] handle_mm_fault+0x280/0x460 [13058.354762] do_page_fault+0x3ec/0x890 [13058.354765] do_mem_abort+0xc0/0x1b0 [13058.354768] el0_da+0x24/0x28 [13058.354770] [13058.354773] Allocated by task 19731: [13058.354780] kasan_kmalloc+0xe0/0x190 [13058.354784] kasan_slab_alloc+0x14/0x20 [13058.354788] kmem_cache_alloc_node+0x130/0x440 [13058.354793] bfq_get_queue+0x138/0x858 [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328 [13058.354801] bfq_init_rq+0x1f4/0x1180 [13058.354806] bfq_insert_requests+0x264/0x1c98 [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488 [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0 [13058.354826] blk_flush_plug_list+0x230/0x548 [13058.354830] blk_finish_plug+0x60/0x80 [13058.354838] read_pages+0xec/0x2c0 [13058.354842] __do_page_cache_readahead+0x374/0x438 [13058.354846] ondemand_readahead+0x24c/0x6b0 [13058.354851] page_cache_sync_readahead+0x17c/0x2f8 [13058.354858] generic_file_buffered_read+0x588/0xc58 [13058.354862] generic_file_read_iter+0x1b4/0x278 [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4] [13058.354972] __vfs_read+0x238/0x320 [13058.354976] vfs_read+0xbc/0x1c0 [13058.354980] ksys_read+0xdc/0x1b8 [13058.354984] __arm64_sys_read+0x50/0x60 [13058.354990] el0_svc_common+0xb4/0x1d8 [13058.354994] el0_svc_handler+0x50/0xa8 [13058.354998] el0_svc+0x8/0xc [13058.354999] [13058.355001] Freed by task 19731: [13058.355007] __kasan_slab_free+0x120/0x228 [13058.355010] kasan_slab_free+0x10/0x18 [13058.355014] kmem_cache_free+0x288/0x3f0 [13058.355018] bfq_put_queue+0x134/0x208 [13058.355022] bfq_exit_icq_bfqq+0x164/0x348 [13058.355026] bfq_exit_icq+0x28/0x40 [13058.355030] ioc_exit_icq+0xa0/0x150 [13058.355035] put_io_context_active+0x250/0x438 [13058.355038] exit_io_context+0xd0/0x138 [13058.355045] do_exit+0x734/0xc58 [13058.355050] do_group_exit+0x78/0x220 [13058.355054] __wake_up_parent+0x0/0x50 [13058.355058] el0_svc_common+0xb4/0x1d8 [13058.355062] el0_svc_handler+0x50/0xa8 [13058.355066] el0_svc+0x8/0xc [13058.355067] [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464 [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040) [13058.355077] The buggy address belongs to the page: [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0 [13058.366175] flags: 0x2ffffe0000008100(slab|head) [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780 [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000 [13058.370789] page dumped because: kasan: bad access detected [13058.370791] [13058.370792] Memory state around the buggy address: [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370808] ^ [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [13058.370817] ================================================================== [13058.370820] Disabling lock debugging due to kernel taint Here, we directly pass the bfqd to bfq_idle_slice_timer_body func. -- V2->V3: rewrite the comment as suggested by Paolo Valente V1->V2: add one comment, and add Fixes and Reported-by tag. Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Acked-by: Paolo Valente <paolo.valente@linaro.org> Reported-by: Wang Wang <wangwang2@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-12block, bfq: remove ifdefs from around gets/puts of bfq groupsPaolo Valente
commit 4d8340d0d4d90e7ca367d18ec16c2fefa89a339c upstream. ifdefs around gets and puts of bfq groups reduce readability, remove them. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-12block, bfq: do not insert oom queue into position treePaolo Valente
[ Upstream commit 32c59e3a9a5a0b180dd015755d6d18ca31e55935 ] BFQ maintains an ordered list, implemented with an RB tree, of head-request positions of non-empty bfq_queues. This position tree, inherited from CFQ, is used to find bfq_queues that contain I/O close to each other. BFQ merges these bfq_queues into a single shared queue, if this boosts throughput on the device at hand. There is however a special-purpose bfq_queue that does not participate in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be wrongly added to the position tree. So bfqq_find_close() could return the oom bfq_queue, which is a source of further troubles in an out-of-memory situation. This commit prevents the oom bfq_queue from being inserted into the position tree. Tested-by: Patrick Dung <patdung100@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24block, bfq: do not plug I/O for bfq_queues with no proc refsPaolo Valente
[ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ] Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not referred by any process") fixed commit 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") by descheduling an empty bfq_queue when it remains with not process reference. Yet, this still left a case uncovered: an empty bfq_queue with not process reference that remains in service. This happens for an in-service sync bfq_queue that is deemed to deserve I/O-dispatch plugging when it remains empty. Yet no new requests will arrive for such a bfq_queue if no process sends requests to it any longer. Even worse, the bfq_queue may happen to be prematurely freed while still in service (because there may remain no reference to it any longer). This commit solves this problem by preventing I/O dispatch from being plugged for the in-service bfq_queue, if the latter has no process reference (the bfq_queue is then prevented from remaining in service). Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Patrick Dung <patdung100@gmail.com> Tested-by: Patrick Dung <patdung100@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-14block, bfq: deschedule empty bfq_queues not referred by any processPaolo Valente
Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging"), to prevent the service guarantees of a bfq_queue from being violated, the bfq_queue may be left busy, i.e., scheduled for service, even if empty (see comments in __bfq_bfqq_expire() for details). But, if no process will send requests to the bfq_queue any longer, then there is no point in keeping the bfq_queue scheduled for service. In addition, keeping the bfq_queue scheduled for service, but with no process reference any longer, may cause the bfq_queue to be freed when descheduled from service. But this is assumed to never happen, and causes a UAF if it happens. This, in turn, caused crashes [1, 2]. This commit fixes this issue by descheduling an empty bfq_queue when it remains with not process reference. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539 [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447 Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Reported-by: Chris Evich <cevich@redhat.com> Reported-by: Patrick Dung <patdung100@gmail.com> Reported-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: push up injection only after setting service timePaolo Valente
If equal to 0, the injection limit for a bfq_queue is pushed to 1 after a first sample of the total service time of the I/O requests of the queue is computed (to allow injection to start). Yet, because of a mistake in the branch that performs this action, the push may happen also in some other case. This commit fixes this issue. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: increase update frequency of inject limitPaolo Valente
The update period of the injection limit has been tentatively set to 100 ms, to reduce fluctuations. This value however proved to cause, occasionally, the limit to be decremented for some bfq_queue only after the queue underwent excessive injection for a lot of time. This commit reduces the period to 10 ms. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1Paolo Valente
Upon an increment attempt of the injection limit, the latter is constrained not to become higher than twice the maximum number max_rq_in_driver of I/O requests that have happened to be in service in the drive. This high bound allows the injection limit to grow beyond max_rq_in_driver, which may then cause max_rq_in_driver itself to grow. However, since the limit is incremented by only one unit at a time, there is no need for such a high bound, and just max_rq_in_driver+1 is enough. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: update inject limit only after injection occurredPaolo Valente
BFQ updates the injection limit of each bfq_queue as a function of how much the limit inflates the service times experienced by the I/O requests of the queue. So only service times affected by injection must be taken into account. Unfortunately, in the current implementation of this update scheme, the service time of an I/O request rq not affected by injection may happen to be considered in the following case: there is no I/O request in service when rq arrives. This commit fixes this issue by making sure that only service times affected by injection are considered for updating the injection limit. In particular, the service time of an I/O request rq is now considered only if at least one of the following two conditions holds: - the destination bfq_queue for rq underwent injection before rq arrival, and there is still I/O in service in the drive on rq arrival (the service of such unfinished I/O may delay the service of rq); - injection occurs between the arrival and the completion time of rq. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: handle NULL return value by bfq_init_rq()Paolo Valente
As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: move update of waker and woken list to queue freeingPaolo Valente
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O"), every bfq_queue has a pointer to a waker bfq_queue and a list of the bfq_queues it may wake. In this respect, when a bfq_queue, say Q, remains with no I/O source attached to it, Q cannot be woken by any other bfq_queue, and cannot wake any other bfq_queue. Then Q must be removed from the woken list of its possible waker bfq_queue, and all bfq_queues in the woken list of Q must stop having a waker bfq_queue. Q remains with no I/O source in two cases: when the last process associated with Q exits or when such a process gets associated with a different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") performed the above updates only in the first case. This commit fixes this bug by moving these updates to when Q gets freed. This is a simple and safe way to handle all cases, as both the above events, process exit and re-association, lead to Q being freed soon, and because dangling references would come out only after Q gets freed (if no update were performed). Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: reset last_completed_rq_bfqq if the pointed queue is freedPaolo Valente
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O"), BFQ stores, in a per-device pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O request completed. If some bfq_queue receives new I/O right after the last request of last_completed_rq_bfqq has been completed, then last_completed_rq_bfqq may be a waker bfq_queue. But if the bfq_queue last_completed_rq_bfqq points to is freed, then last_completed_rq_bfqq becomes a dangling reference. This commit resets last_completed_rq_bfqq if the pointed bfq_queue is freed. Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-26Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
2019-07-18block, bfq: check also in-flight I/O in dispatch pluggingPaolo Valente
Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy *before* all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-15docs: block: convert to ReSTMauro Carvalho Chehab
Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-09Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "This is the main block updates for 5.3. Nothing earth shattering or major in here, just fixes, additions, and improvements all over the map. This contains: - Series of documentation fixes (Bart) - Optimization of the blk-mq ctx get/put (Bart) - null_blk removal race condition fix (Bob) - req/bio_op() cleanups (Chaitanya) - Series cleaning up the segment accounting, and request/bio mapping (Christoph) - Series cleaning up the page getting/putting for bios (Christoph) - block cgroup cleanups and moving it to where it is used (Christoph) - block cgroup fixes (Tejun) - Series of fixes and improvements to bcache, most notably a write deadlock fix (Coly) - blk-iolatency STS_AGAIN and accounting fixes (Dennis) - Series of improvements and fixes to BFQ (Douglas, Paolo) - debugfs_create() return value check removal for drbd (Greg) - Use struct_size(), where appropriate (Gustavo) - Two lighnvm fixes (Heiner, Geert) - MD fixes, including a read balance and corruption fix (Guoqing, Marcos, Xiao, Yufen) - block opal shadow mbr additions (Jonas, Revanth) - sbitmap compare-and-exhange improvemnts (Pavel) - Fix for potential bio->bi_size overflow (Ming) - NVMe pull requests: - improved PCIe suspent support (Keith Busch) - error injection support for the admin queue (Akinobu Mita) - Fibre Channel discovery improvements (James Smart) - tracing improvements including nvmetc tracing support (Minwoo Im) - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya Kulkarni)" - Various little fixes and improvements to drivers and core" * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits) blk-iolatency: fix STS_AGAIN handling block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES blk-mq: simplify blk_mq_make_request() blk-mq: remove blk_mq_put_ctx() sbitmap: Replace cmpxchg with xchg block: fix .bi_size overflow block: sed-opal: check size of shadow mbr block: sed-opal: ioctl for writing to shadow mbr block: sed-opal: add ioctl for done-mark of shadow mbr block: never take page references for ITER_BVEC direct-io: use bio_release_pages in dio_bio_complete block_dev: use bio_release_pages in bio_unmap_user block_dev: use bio_release_pages in blkdev_bio_end_io iomap: use bio_release_pages in iomap_dio_bio_end_io block: use bio_release_pages in bio_map_user_iov block: use bio_release_pages in bio_unmap_user block: optionally mark pages dirty in bio_release_pages block: move the BIO_NO_PAGE_REF check into bio_release_pages block: skd_main.c: Remove call to memset after dma_alloc_coherent block: mtip32xx: Remove call to memset after dma_alloc_coherent ...
2019-06-28block, bfq: NULL out the bic when it's no longer validDouglas Anderson
In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-26block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely caseDouglas Anderson
Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: fix operator in BFQQ_TOTALLY_SEEKYPaolo Valente
By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: re-schedule empty queues if they deserve I/O pluggingPaolo Valente
Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: preempt lower-weight or lower-priority queuesPaolo Valente
BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: detect wakers and unconditionally inject their I/OPaolo Valente
A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: bring forward seek&think time updatePaolo Valente
Until the base value for request service times gets finally computed for a bfq_queue, the inject limit for that queue does depend on the think-time state (short|long) of the queue. A timely update of the think time then guarantees a quicker activation or deactivation of the injection. Fortunately, the think time of a bfq_queue is updated in the same code path as the inject limit; but after the inject limit. This commits moves the update of the think time before the update of the inject limit. For coherence, it moves the update of the seek time too. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: update base request service times when possiblePaolo Valente
I/O injection gets reduced if it increases the request service times of the victim queue beyond a certain threshold. The threshold, in its turn, is computed as a function of the base service time enjoyed by the queue when it undergoes no injection. As a consequence, for injection to work properly, the above base value has to be accurate. In this respect, such a value may vary over time. For example, it varies if the size or the spatial locality of the I/O requests in the queue change. It is then important to update this value whenever possible. This commit performs this update. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: fix rq_in_driver check in bfq_update_inject_limitPaolo Valente
One of the cases where the parameters for injection may be updated is when there are no more in-flight I/O requests. The number of in-flight requests is stored in the field bfqd->rq_in_driver of the descriptor bfqd of the device. So, the controlled condition is bfqd->rq_in_driver == 0. Unfortunately, this is wrong because, the instruction that checks this condition is in the code path that handles the completion of a request, and, in particular, the instruction is executed before bfqd->rq_in_driver is decremented in such a code path. This commit fixes this issue by just replacing 0 with 1 in the comparison. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: reset inject limit when think-time state changesPaolo Valente
Until the base value of the request service times gets finally computed for a bfq_queue, the inject limit does depend on the think-time state (short|long). The limit must be 0 or 1 if the think time is deemed, respectively, as short or long. However, such a check and possible limit update is performed only periodically, once per second. So, to make the injection mechanism much more reactive, this commit performs the update also every time the think-time state changes. In addition, in the following special case, this commit lets the inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's think time is short: bfqq's I/O is synchronized with that of some other queue, i.e., bfqq may receive new I/O only after the I/O of the other queue is completed. Keeping the inject limit to 1 allows the blocking I/O to be served while bfqq is in service. And this is very convenient both for bfqq and for the total throughput, as explained in detail in the comments in bfq_update_has_short_ttime(). Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUGChristoph Hellwig
This option is entirely bfq specific, give it an appropinquate name. Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all the functionality already does so anyway. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20block: remove the bi_phys_segments field in struct bioChristoph Hellwig
We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30block: switch all files cleared marked as GPLv2 or later to SPDX tagsChristoph Hellwig
All these files have some form of the usual GPLv2 or later boilerplate. Switch them to use SPDX tags instead. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22Merge tag 'v5.1-rc6' into for-5.2/blockJens Axboe
Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a comment, and is trivial. The other one is a conflict due to a later fix in the bio multi-page work, and needs a bit more care. * tag 'v5.1-rc6': (770 commits) Linux 5.1-rc6 block: make sure that bvec length can't be overflow block: kill all_q_node in request_queue x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping mm/kmemleak.c: fix unused-function warning init: initialize jump labels before command line option parsing kernel/watchdog_hld.c: hard lockup message should end with a newline kcov: improve CONFIG_ARCH_HAS_KCOV help text mm: fix inactive list balancing between NUMA nodes and cgroups mm/hotplug: treat CMA pages as unmovable proc: fixup proc-pid-vm test proc: fix map_files test on F29 mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock mm: swapoff: shmem_unuse() stop eviction without igrab() mm: swapoff: take notice of completion sooner mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES mm: swapoff: shmem_find_swap_entries() filter out other types slab: store tagged freelist for off-slab slabmgmt ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13bfq: update internal depth state when queue depth changesJens Axboe
A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Tested-by: Kai Krakow <kai@kaishome.de> Reported-by: Paolo Valente <paolo.valente@linaro.org> Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10block, bfq: fix use after free in bfq_bfqq_expirePaolo Valente
The function bfq_bfqq_expire() invokes the function __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue. If this happens, then no other instruction of bfq_bfqq_expire() must be executed, or a use-after-free will occur. Basing on the assumption that __bfq_bfqq_expire() invokes bfq_put_queue() on the in-service bfq-queue exactly once, the queue is assumed to be freed if its refcounter is equal to one right before invoking __bfq_bfqq_expire(). But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from weights tree") this assumption is false. __bfq_bfqq_expire() may also invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e ("block, bfq: fix queue removal from weights tree"), also the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire() may invoke bfq_put_queue() twice, and this is the actual case where the in-service queue may happen to be freed. To address this issue, this commit moves the check on the refcounter of the queue right around the last bfq_put_queue() that may be invoked on the queue. Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree") Reported-by: Dmitrii Tcvetkov <demfloro@demfloro.ru> Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08block, bfq: fix some typos in commentsAngelo Ruocco
Some of the comments in the bfq files had typos. This patch fixes them. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: save & resume weight on a queue merge/splitFrancesco Pollicino
bfq saves the state of a queue each time a merge occurs, to be able to resume such a state when the queue is associated again with its original process, on a split. Unfortunately bfq does not save & restore also the weight of the queue. If the weight is not correctly resumed when the queue is recycled, then the weight of the recycled queue could differ from the weight of the original queue. This commit adds the missing save & resume of the weight. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: print SHARED instead of pid for shared queues in logsFrancesco Pollicino
The function "bfq_log_bfqq" prints the pid of the process associated with the queue passed as input. Unfortunately, if the queue is shared, then more than one process is associated with the queue. The pid that gets printed in this case is the pid of one of the associated processes. Which process gets printed depends on the exact sequence of merge events the queue underwent. So printing such a pid is rather useless and above all is often rather confusing because it reports a random pid between those of the associated processes. This commit addresses this issue by printing SHARED instead of a pid if the queue is shared. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: always protect newly-created queues from existing active queuesPaolo Valente
If many bfq_queues belonging to the same group happen to be created shortly after each other, then the processes associated with these queues have typically a common goal. In particular, bursts of queue creations are usually caused by services or applications that spawn many parallel threads/processes. Examples are systemd during boot, or git grep. If there are no other active queues, then, to help these processes get their job done as soon as possible, the best thing to do is to reach a high throughput. To this goal, it is usually better to not grant either weight-raising or device idling to the queues associated with these processes. And this is exactly what BFQ currently does. There is however a drawback: if, in contrast, some other queues are already active, then the newly created queues must be protected from the I/O flowing through the already existing queues. In this case, the best thing to do is the opposite as in the other case: it is much better to grant weight-raising and device idling to the newly-created queues, if they deserve it. This commit addresses this issue by doing so if there are already other active queues. This change also helps eliminating false positives, which occur when the newly-created queues do not belong to an actual large burst of creations, but some background task (e.g., a service) happens to trigger the creation of new queues in the middle, i.e., very close to when the victim queues are created. These false positive may cause total loss of control on process latencies. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: do not tag totally seeky queues as soft rtPaolo Valente
Sync random I/O is likely to be confused with soft real-time I/O, because it is characterized by limited throughput and apparently isochronous arrival pattern. To avoid false positives, this commits prevents bfq_queues containing only random (seeky) I/O from being tagged as soft real-time. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: do not merge queues on flash storage with queueingPaolo Valente
To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Alessio Masola <alessio.masola@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: tune service injection basing on request service timesPaolo Valente
The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: do not idle for lowest-weight queuesPaolo Valente
In most cases, it is detrimental for throughput to plug I/O dispatch when the in-service bfq_queue becomes temporarily empty (plugging is performed to wait for the possible arrival, soon, of new I/O from the in-service queue). There is however a case where plugging is needed for service guarantees. If a bfq_queue, say Q, has a higher weight than some other active bfq_queue, and is sync, i.e., contains sync I/O, then, to guarantee that Q does receive a higher share of the throughput than other lower-weight queues, it is necessary to plug I/O dispatch when Q remains temporarily empty while being served. For this reason, BFQ performs I/O plugging when some active bfq_queue has a higher weight than some other active bfq_queue. But this is unnecessarily overkill. In fact, if the in-service bfq_queue actually has a weight lower than or equal to the other queues, then the queue *must not* be guaranteed a higher share of the throughput than the other queues. So, not plugging I/O cannot cause any harm to the queue. And can boost throughput. Taking advantage of this fact, this commit does not plug I/O for sync bfq_queues with a weight lower than or equal to the weights of the other queues. Here is an example of the resulting throughput boost with the dbench workload, which is particularly nasty for BFQ. With the dbench test in the Phoronix suite, BFQ reaches its lowest total throughput with 6 clients on a filesystem with journaling, in case the journaling daemon has a higher weight than normal processes. Before this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5, after this commit it is ~100 MB/sec. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: increase idling for weight-raised queuesPaolo Valente
If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>