summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2020-01-12block: fix memleak when __blk_rq_map_user_iov() is failedYang Yingliang
[ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ] When I doing fuzzy test, get the memleak report: BUG: memory leak unreferenced object 0xffff88837af80000 (size 4096): comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ............... backtrace: [<000000001c894df8>] bio_alloc_bioset+0x393/0x590 [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0 [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0 [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160 [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870 [<00000000064bb208>] sg_write.part.25+0x5d9/0x950 [<000000004fc670f6>] sg_write+0x5f/0x8c [<00000000b0d05c7b>] __vfs_write+0x7c/0x100 [<000000008e177714>] vfs_write+0x1c3/0x500 [<0000000087d23f34>] ksys_write+0xf9/0x200 [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0 [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(), the bio(s) which is allocated before this failing will leak. The refcount of the bio(s) is init to 1 and increased to 2 by calling bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so the bio cannot be freed. Fix it by calling blk_rq_unmap_user(). Reviewed-by: Bob Liu <bob.liu@oracle.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONEArnd Bergmann
commit 673bdf8ce0a387ef585c13b69a2676096c6edfe9 upstream. These were added to blkdev_ioctl() but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.10+ Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09compat_ioctl: block: handle Persistent ReservationsArnd Bergmann
commit b2c0fcd28772f99236d261509bcd242135677965 upstream. These were added to blkdev_ioctl() in linux-5.5 but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.4+ Fixes: bbd3e064362e ("block: add an API for Persistent Reservations") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fold in followup patch from Arnd with missing pr.h header include. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17blk-mq: make sure that line break can be printedMing Lei
commit d2c9be89f8ebe7ebcc97676ac40f8dec1cf9b43a upstream. 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") avoids sysfs buffer overflow, and reserves one character for line break. However, the last snprintf() doesn't get correct 'size' parameter passed in, so fixed it. Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17block: fix single range discard mergeMing Lei
commit 2a5cf35cd6c56b2924bce103413ad3381bdc31fa upstream. There are actually two kinds of discard merge: - one is the normal discard merge, just like normal read/write request, and call it single-range discard - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1 For the former case, queue_max_discard_segments(rq->q) is 1, and we should handle this kind of discard merge like the normal read/write request. This patch fixes the following kernel panic issue[1], which is caused by not removing the single-range discard request from elevator queue. Guangwu has one raid discard test case, in which this issue is a bit easier to trigger, and I verified that this patch can fix the kernel panic issue in Guangwu's test case. [1] kernel panic log from Jens's report BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 PGD 0 P4D 0. Oops: 0000 [#1] SMP PTI CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \ 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \ Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \ blk_mq_run_work_fn RIP: \ 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \ 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \ 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \ f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \ RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_dispatch_rq_list+0xec/0x480 ? elv_rb_del+0x11/0x30 blk_mq_do_dispatch_sched+0x6e/0xf0 blk_mq_sched_dispatch_requests+0xfa/0x170 __blk_mq_run_hw_queue+0x5f/0xe0 process_one_work+0x154/0x350 worker_thread+0x46/0x3c0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x1f/0x30 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \ kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \ cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \ button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \ nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \ ---[ end trace 340a1fb996df1b9b ]--- RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \ 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \ 20 72 37 f6 87 b0 00 00 00 02 Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached") Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Guangwu Zhang <guazhang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Andre Tomt <andre@tomt.net> Cc: Jack Wang <jack.wang.usish@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17blk-mq: avoid sysfs buffer overflow with too many CPU coresMing Lei
commit 8962842ca5abdcf98e22ab3b2b45a103f0408b95 upstream. It is reported that sysfs buffer overflow can be triggered if the system has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of hctx via /sys/block/$DEV/mq/$N/cpu_list. Use snprintf to avoid the potential buffer overflow. This version doesn't change the attribute format, and simply stops showing CPU numbers if the buffer is going to overflow. Cc: stable@vger.kernel.org Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-01block: call rq_qos_exit() after queue is frozenMing Lei
[ Upstream commit c57cdf7a9e51d97a43e29b8f4a04157875104000 ] rq_qos_exit() removes the current q->rq_qos, this action has to be done after queue is frozen, otherwise the IO queue path may never be waken up, then IO hang is caused. So fixes this issue by moving rq_qos_exit() after queue is frozen. Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01block: fix the DISCARD request mergeJianchao Wang
[ Upstream commit 69840466086d2248898020a08dda52732686c4e6 ] There are two cases when handle DISCARD merge. If max_discard_segments == 1, the bios/requests need to be contiguous to merge. If max_discard_segments > 1, it takes every bio as a range and different range needn't to be contiguous. But now, attempt_merge screws this up. It always consider contiguity for DISCARD for the case max_discard_segments > 1 and cannot merge contiguous DISCARD for the case max_discard_segments == 1, because rq_attempt_discard_merge always returns false in this case. This patch fixes both of the two cases above. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20blok, bfq: do not plug I/O if all queues are weight-raisedPaolo Valente
[ Upstream commit c8765de0adfcaaf4ffb2d951e07444f00ffa9453 ] To reduce latency for interactive and soft real-time applications, bfq privileges the bfq_queues containing the I/O of these applications. These privileged queues, referred-to as weight-raised queues, get a much higher share of the device throughput w.r.t. non-privileged queues. To preserve this higher share, the I/O of any non-weight-raised queue must be plugged whenever a sync weight-raised queue, while being served, remains temporarily empty. To attain this goal, bfq simply plugs any I/O (from any queue), if a sync weight-raised queue remains empty while in service. Unfortunately, this plugging typically lowers throughput with random I/O, on devices with internal queueing (because it reduces the filling level of the internal queues of the device). This commit addresses this issue by restricting the cases where plugging is performed: if a sync weight-raised queue remains empty while in service, then I/O plugging is performed only if some of the active bfq_queues are *not* weight-raised (which is actually the only circumstance where plugging is needed to preserve the higher share of the throughput of weight-raised queues). This restriction proved able to boost throughput in really many use cases needing only maximum throughput. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20block, bfq: inject other-queue I/O into seeky idle queues on NCQ flashPaolo Valente
[ Upstream commit d0edc2473be9d70f999282e1ca7863ad6ae704dc ] The Achilles' heel of BFQ is its failing to reach a high throughput with sync random I/O on flash storage with internal queueing, in case the processes doing I/O have differentiated weights. The cause of this failure is as follows. If at least two processes do sync I/O, and have a different weight from each other, then BFQ plugs I/O dispatching every time one of these processes, while it is being served, remains temporarily without pending I/O requests. This plugging is necessary to guarantee that every process enjoys a bandwidth proportional to its weight; but it empties the internal queue(s) of the drive. And this kills throughput with random I/O. So, if some processes have differentiated weights and do both sync and random I/O, the end result is a throughput collapse. This commit tries to counter this problem by injecting the service of other processes, in a controlled way, while the process in service happens to have no I/O. This injection is performed only if the medium is non rotational and performs internal queueing, and the process in service does random I/O (service injection might be beneficial for sequential I/O too, we'll work on that). As an example of the benefits of this commit, on a PLEXTOR PX-256M5S SSD, and with five processes having differentiated weights and doing sync random 4KB I/O, this commit makes the throughput with bfq grow by 400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than that reached with none. As some less random I/O is added to the mix, the throughput becomes equal to or higher than that with none. This commit is a very first attempt to recover throughput without losing control, and certainly has many limitations. One is, e.g., that the processes whose service is injected are not chosen so as to distribute the extra bandwidth they receive in accordance to their weights. Thus there might be loss of weighted fairness in some cases. Anyway, this loss concerns extra service, which would not have been received at all without this commit. Other limitations and issues will probably show up with usage. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12blkcg: make blkcg_print_stat() print stats only for online blkgsTejun Heo
[ Upstream commit b0814361a25cba73a224548843ed92d8ea78715a ] blkcg_print_stat() iterates blkgs under RCU and doesn't test whether the blkg is online. This can call into pd_stat_fn() on a pd which is still being initialized leading to an oops. The heaviest operation - recursively summing up rwstat counters - is already done while holding the queue_lock. Expand queue_lock to cover the other operations and skip the blkg if it isn't online yet. The online state is protected by both blkcg and queue locks, so this guarantees that only online blkgs are processed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Roman Gushchin <guro@fb.com> Cc: Josef Bacik <jbacik@fb.com> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-29blk-rq-qos: fix first node deletion of rq_qos_del()Tejun Heo
commit 307f4065b9d7c1e887e8bdfb2487e4638559fea1 upstream. rq_qos_del() incorrectly assigns the node being deleted to the head if it was the first on the list in the !prev path. Fix it by iterating with ** instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17blk-wbt: fix performance regression in wbt scale_up/scale_downHarshad Shirwadkar
commit b84477d3ebb96294f87dc3161e53fa8fe22d9bfd upstream. scale_up wakes up waiters after scaling up. But after scaling max, it should not wake up more waiters as waiters will not have anything to do. This patch fixes this by making scale_up (and also scale_down) return when threshold is reached. This bug causes increased fdatasync latency when fdatasync and dd conv=sync are performed in parallel on 4.19 compared to 4.14. This bug was introduced during refactoring of blk-wbt code. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: stable@vger.kernel.org Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07block: mq-deadline: Fix queue restart handlingDamien Le Moal
[ Upstream commit cb8acabbe33b110157955a7425ee876fb81e6bbc ] Commit 7211aef86f79 ("block: mq-deadline: Fix write completion handling") added a call to blk_mq_sched_mark_restart_hctx() in dd_dispatch_request() to make sure that write request dispatching does not stall when all target zones are locked. This fix left a subtle race when a write completion happens during a dispatch execution on another CPU: CPU 0: Dispatch CPU1: write completion dd_dispatch_request() lock(&dd->lock); ... lock(&dd->zone_lock); dd_finish_request() rq = find request lock(&dd->zone_lock); unlock(&dd->zone_lock); zone write unlock unlock(&dd->zone_lock); ... __blk_mq_free_request check restart flag (not set) -> queue not run ... if (!rq && have writes) blk_mq_sched_mark_restart_hctx() unlock(&dd->lock) Since the dispatch context finishes after the write request completion handling, marking the queue as needing a restart is not seen from __blk_mq_free_request() and blk_mq_sched_restart() not executed leading to the dispatch stall under 100% write workloads. Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from dd_dispatch_request() into dd_finish_request() under the zone lock to ensure full mutual exclusion between write request dispatch selection and zone unlock on write request completion. Fixes: 7211aef86f79 ("block: mq-deadline: Fix write completion handling") Cc: stable@vger.kernel.org Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05block: fix null pointer dereference in blk_mq_rq_timed_out()Yufen Yu
commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream. We got a null pointer deference BUG_ON in blk_mq_rq_timed_out() as following: [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 108.827059] PGD 0 P4D 0 [ 108.827313] Oops: 0000 [#1] SMP PTI [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431 [ 108.829503] Workqueue: kblockd blk_mq_timeout_work [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330 [ 108.838191] Call Trace: [ 108.838406] bt_iter+0x74/0x80 [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450 [ 108.839074] ? __switch_to_asm+0x34/0x70 [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f [ 108.840732] blk_mq_timeout_work+0x74/0x200 [ 108.841151] process_one_work+0x297/0x680 [ 108.841550] worker_thread+0x29c/0x6f0 [ 108.841926] ? rescuer_thread+0x580/0x580 [ 108.842344] kthread+0x16a/0x1a0 [ 108.842666] ? kthread_flush_work+0x170/0x170 [ 108.843100] ret_from_fork+0x35/0x40 The bug is caused by the race between timeout handle and completion for flush request. When timeout handle function blk_mq_rq_timed_out() try to read 'req->q->mq_ops', the 'req' have completed and reinitiated by next flush request, which would call blk_rq_init() to clear 'req' as 0. After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"), normal requests lifetime are protected by refcount. Until 'rq->ref' drop to zero, the request can really be free. Thus, these requests cannot been reused before timeout handle finish. However, flush request has defined .end_io and rq->end_io() is still called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq' can be reused by the next flush request handle, resulting in null pointer deference BUG ON. We fix this problem by covering flush request with 'rq->ref'. If the refcount is not zero, flush_end_io() return and wait the last holder recall it. To record the request status, we add a new entry 'rq_status', which will be used in flush_end_io(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org # v4.18+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ------- v2: - move rq_status from struct request to struct blk_flush_queue v3: - remove unnecessary '{}' pair. v4: - let spinlock to protect 'fq->rq_status' v5: - move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-01blk-mq: move cancel of requeue_work to the front of blk_exit_queuezhengbin
[ Upstream commit e26cc08265dda37d2acc8394604f220ef412299d ] blk_exit_queue will free elevator_data, while blk_mq_requeue_work will access it. Move cancel of requeue_work to the front of blk_exit_queue to avoid use-after-free. blk_exit_queue blk_mq_requeue_work __elevator_exit blk_mq_run_hw_queues blk_mq_exit_sched blk_mq_run_hw_queue dd_exit_queue blk_mq_hctx_has_pending kfree(elevator_data) blk_mq_sched_has_work dd_has_work Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-01blk-mq: change gfp flags to GFP_NOIO in blk_mq_realloc_hw_ctxsJianchao Wang
[ Upstream commit 5b202853ffbc54b29f23c4b1b5f3948efab489a2 ] blk_mq_realloc_hw_ctxs could be invoked during update hw queues. At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL to GFP_NOIO to avoid forever hang during memory allocation in blk_mq_realloc_hw_ctxs. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-16blk-mq: free hw queue's resource in hctx's release handlerMing Lei
[ Upstream commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 ] Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d909b2, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2f315 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reported-by: James Smart <james.smart@broadcom.com> Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-16blk-iolatency: fix STS_AGAIN handlingDennis Zhou
[ Upstream commit c9b3007feca018d3f7061f5d5a14cb00766ffe9b ] The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios") Debugged-by: Tejun Heo <tj@kernel.org> Debugged-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-16Blk-iolatency: warn on negative inflight IO counterLiu Bo
[ Upstream commit 391f552af213985d3d324c60004475759a7030c5 ] This is to catch any unexpected negative value of inflight IO counter. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29block, bfq: handle NULL return value by bfq_init_rq()Paolo Valente
[ Upstream commit fd03177c33b287c6541f4048f1d67b7b45a1abc9 ] As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-04block, scsi: Change the preempt-only flag into a counterBart Van Assche
commit cd84a62e0078dce09f4ed349bec84f86c9d54b30 upstream. The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31block/bio-integrity: fix a memory leak bugWenwen Wang
[ Upstream commit e7bf90e5afe3aa1d1282c1635a49e17a32c4ecec ] In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to hold integrity metadata. Later on, the buffer will be attached to the bio structure through bio_integrity_add_page(), which returns the number of bytes of integrity metadata attached. Due to unexpected situations, bio_integrity_add_page() may return 0. As a result, bio_integrity_prep() needs to be terminated with 'false' returned to indicate this error. However, the allocated kernel buffer is not freed on this execution path, leading to a memory leak. To fix this issue, free the allocated buffer before returning from bio_integrity_prep(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31block: init flush rq ref count to 1Josef Bacik
[ Upstream commit b554db147feea39617b533ab6bca247c91c6198a ] We discovered a problem in newer kernels where a disconnect of a NBD device while the flush request was pending would result in a hang. This is because the blk mq timeout handler does if (!refcount_inc_not_zero(&rq->ref)) return true; to determine if it's ok to run the timeout handler for the request. Flush_rq's don't have a ref count set, so we'd skip running the timeout handler for this request and it would just sit there in limbo forever. Fix this by always setting the refcount of any request going through blk_init_rq() to 1. I tested this with a nbd-server that dropped flush requests to verify that it hung, and then tested with this patch to verify I got the timeout as expected and the error handling kicked in. Thanks, Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26blkcg: update blkcg_print_stat() to handle larger outputsTejun Heo
commit f539da82f2158916e154d206054e0efd5df7ab61 upstream. Depending on the number of devices, blkcg stats can go over the default seqfile buf size. seqfile normally retries with a larger buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't tell seqfile that overflow has happened and the output gets printed truncated. Fix it by calling seq_commit() w/ -1 on possible overflows. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-iolatency: clear use_delay when io.latency is set to zeroTejun Heo
commit 5de0073fcd50cc1f150895a7bb04d3cf8067b1d7 upstream. If use_delay was non-zero when the latency target of a cgroup was set to zero, it will stay stuck until io.latency is enabled on the cgroup again. This keeps readahead disabled for the cgroup impacting performance negatively. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <jbacik@fb.com> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-throttle: fix zero wait time for iops throttled groupKonstantin Khlebnikov
commit 3a10f999ffd464d01c5a05592a15470a3c4bbc36 upstream. After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") wait time could be zero even if group is throttled and cannot issue requests right now. As a result throtl_select_dispatch() turns into busy-loop under irq-safe queue spinlock. Fix is simple: always round up target time to the next throttle slice. Fixes: 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-iolatency: only account submitted biosDennis Zhou
[ Upstream commit a3fb01ba5af066521f3f3421839e501bb2c71805 ] As is, iolatency recognizes done_bio and cleanup as ending paths. If a request is marked REQ_NOWAIT and fails to get a request, the bio is cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error(). This results in underflowing the inflight counter. Fix this by only accounting bios that were actually submitted. Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-14block, bfq: NULL out the bic when it's no longer validDouglas Anderson
commit dbc3117d4ca9e17819ac73501e914b8422686750 upstream. In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10block: Fix a NULL pointer dereference in generic_make_request()Guilherme G. Piccoli
----------------------------------------------------------------- This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that. ----------------------------------------------------------------- Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q). This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched. A simple test case/reproducer is as follows: a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n. b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem. c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member) This will trigger the following oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...] This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17. ---------------------------- Why this is not on mainline? ---------------------------- The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[0] observed that there was a clean-up series ready to be accepted from Ming Lei[1] that fixed the same issue. The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving: 1) Putting back blk_exit_queue() logic into __blk_release_queue(); that path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue(). 2) Removing the guard/protection in generic_make_request_checks() with blk_queue_enter(). The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit. [0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Eric Ren <renzhengeek@gmail.com> Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15block, bfq: increase idling for weight-raised queuesPaolo Valente
[ Upstream commit 778c02a236a8728bb992de10ed1f12c0be5b7b0e ] If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15blk-mq: move cancel of requeue_work into blk_mq_releaseMing Lei
[ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ] With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBRDavid Kozub
[ Upstream commit 78bf47353b0041865564deeed257a54f047c2fdc ] The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value opal_mbr_data.enable_disable incorrectly: enable_disable is expected to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable was passed directly to set_mbr_done and set_mbr_enable_disable where is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE actually disabled the shadow MBR and vice versa. This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to OPAL_FALSE/TRUE. The change affects existing programs using IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when setting up an Opal drive. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31block: fix use-after-free on gendiskYufen Yu
[ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ] commit 2da78092dda "block: Fix dev_t minor allocation lifetime" specifically moved blk_free_devt(dev->devt) call to part_release() to avoid reallocating device number before the device is fully shutdown. However, it can cause use-after-free on gendisk in get_gendisk(). We use md device as example to show the race scenes: Process1 Worker Process2 md_free blkdev_open del_gendisk add delete_partition_work_fn() to wq __blkdev_get get_gendisk put_disk disk_release kfree(disk) find part from ext_devt_idr get_disk_and_module(disk) cause use after free delete_partition_work_fn put_device(part) part_release remove part from ext_devt_idr Before <devt, hd_struct pointer> is removed from ext_devt_idr by delete_partition_work_fn(), we can find the devt and then access gendisk by hd_struct pointer. But, if we access the gendisk after it have been freed, it can cause in use-after-freeon gendisk in get_gendisk(). We fix this by adding a new helper blk_invalidate_devt() in delete_partition() and del_gendisk(). It replaces hd_struct pointer in idr with value 'NULL', and deletes the entry from idr in part_release() as we do now. Thanks to Jan Kara for providing the solution and more clear comments for the code. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16bfq: update internal depth state when queue depth changesJens Axboe
commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Signed-off-by: Eric Wheeler <bfq@linux.ewheeler.net> Tested-by: Kai Krakow <kai@kaishome.de> Reported-by: Paolo Valente <paolo.valente@linaro.org> Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time") Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-08block: pass no-op callback to INIT_WORK().Tetsuo Handa
[ Upstream commit 2e3c18d0ada16f145087b2687afcad1748c0827c ] syzbot is hitting flush_work() warning caused by commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without INIT_WORK().") [1]. Although that commit did not expect INIT_WORK(NULL) case, calling flush_work() without setting a valid callback should be avoided anyway. Fix this problem by setting a no-op callback instead of NULL. [1] https://syzkaller.appspot.com/bug?id=e390366bc48bc82a7c668326e0663be3b91cbd29 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: syzbot <syzbot+ba2a929dcf8e704c180e@syzkaller.appspotmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> [sl: rename blk_timeout_work] Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-08block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctxShenghui Wang
[ Upstream commit b9a1ff504b9492ad6beb7d5606e0e3365d4d8499 ] kfree() can leak the hctx->fq->flush_rq field. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-20blk-iolatency: #include "blk.h"Bart Van Assche
[ Upstream commit 373e915cd8e84544609eced57a44fbc084f8d60f ] This patch avoids that the following warning is reported when building with W=1: block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes] Cc: Josef Bacik <jbacik@fb.com> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") # v4.19 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17block: do not leak memory in bio_copy_user_iov()Jérôme Glisse
commit a3761c3c91209b58b6f33bf69dd8bb8ec0c9d925 upstream. When bio_add_pc_page() fails in bio_copy_user_iov() we should free the page we just allocated otherwise we are leaking it. Cc: linux-block@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-05block, bfq: fix in-service-queue check for queue mergingPaolo Valente
[ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ] When a new I/O request arrives for a bfq_queue, say Q, bfq checks whether that request is close to (a) the head request of some other queue waiting to be served, or (b) the last request dispatched for the in-service queue (in case Q itself is not the in-service queue) If a queue, say Q2, is found for which the above condition holds, then bfq merges Q and Q2, to hopefully get a more sequential I/O in the resulting merged queue, and thus a possibly higher throughput. Case (b) is checked by comparing the new request for Q with the last request dispatched, assuming that the latter necessarily belonged to the in-service queue. Unfortunately, this assumption is no longer always correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash"). When the assumption does not hold, queues that must not be merged may be merged, causing unexpected loss of control on per-queue service guarantees. This commit solves this problem by adding an extra field, which stores the actual last request dispatched for the in-service queue, and by using this new field to correctly check case (b). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23blk-mq: insert rq with DONTPREP to hctx dispatch list when requeueJianchao Wang
[ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ] When requeue, if RQF_DONTPREP, rq has contained some driver specific data, so insert it to hctx dispatch list to avoid any merge. Take scsi as example, here is the trace event log (no io scheduler, because RQF_STARTED would prevent merging), kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H] scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test] scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test] kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H] scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0] scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0] kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0] (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP. Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP, the sdb only contained the part of (32768 + 8), then only that part was completed. The lucky thing was that scsi_io_completion detected it and requeued the remaining part. So we didn't get corrupted data. However, the requeue of (32776 + 8) is not expected. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13blk-iolatency: fix IO hang due to negative inflight counterLiu Bo
[ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ] Our test reported the following stack, and vmcore showed that ->inflight counter is -1. [ffffc9003fcc38d0] __schedule at ffffffff8173d95d [ffffc9003fcc3958] schedule at ffffffff8173de26 [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e The ->inflight counter may be negative (-1) if 1) blk-iolatency was disabled when the IO was issued, 2) blk-iolatency was enabled before this IO reached its endio, 3) the ->inflight counter is decreased from 0 to -1 in endio() In fact the hang can be easily reproduced by the below script, H=/sys/fs/cgroup/unified/ P=/sys/fs/cgroup/unified/test echo "+io" > $H/cgroup.subtree_control mkdir -p $P echo $$ > $P/cgroup.procs xfs_io -f -d -c "pwrite 0 4k" /dev/sdg echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency xfs_io -f -d -c "pwrite 0 4k" /dev/sdg This fixes the problem by freezing the queue so that while enabling/disabling iolatency, there is no inflight rq running. Note that quiesce_queue is not needed as this only updating iolatency configuration about which dispatching request_queue doesn't care. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-20blk-mq: fix a hung issue when fsyncJianchao Wang
[ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ] Florian reported a io hung issue when fsync(). It should be triggered by following race condition. data + post flush a flush blk_flush_complete_seq case REQ_FSEQ_DATA blk_flush_queue_rq issued to driver blk_mq_dispatch_rq_list try to issue a flush req failed due to NON-NCQ command .queue_rq return BLK_STS_DEV_RESOURCE request completion req->end_io // doesn't check RESTART mq_flush_data_end_io case REQ_FSEQ_POSTFLUSH blk_kick_flush do nothing because previous flush has not been completed blk_mq_run_hw_queue insert rq to hctx->dispatch due to RESTART is still set, do nothing To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io with blk_mq_sched_restart to check and clear the RESTART flag. Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers) Reported-by: Florian Stecker <m19@florianstecker.de> Tested-by: Florian Stecker <m19@florianstecker.de> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-22block: use rcu_work instead of call_rcu to avoid sleep in softirqYufen Yu
commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream. We recently got a stack by syzkaller like this: BUG: sleeping function called from invalid context at mm/slab.h:361 in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid INFO: lockdep is turned off. CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00 0000000000000000 0000000000000001 0000000000000004 0000000000000000 Call Trace: <IRQ> [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline] <IRQ> [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline] [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline] [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline] [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline] [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline] [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline] [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline] [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline] [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline] [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline] [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline] [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline] [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline] [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746 <EOI> [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline] [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a In softirq context, we call rcu callback function delete_partition_rcu_cb(), which may allocate memory by kzalloc with GFP_KERNEL flag. If the allocation cannot be satisfied, it may sleep. However, That is not allowed in softirq contex. Although we found this problem on linux 4.4, the latest kernel version seems to have this problem as well. And it is very similar to the previous one: https://lkml.org/lkml/2018/7/9/391 Fix it by using RCU workqueue, which allows sleep. Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13block: mq-deadline: Fix write completion handlingDamien Le Moal
commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream. For a zoned block device using mq-deadline, if a write request for a zone is received while another write was already dispatched for the same zone, dd_dispatch_request() will return NULL and the newly inserted write request is kept in the scheduler queue waiting for the ongoing zone write to complete. With this behavior, when no other request has been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to __blk_mq_free_request() call of blk_mq_sched_restart() to not run the queue when the already dispatched write request completes. The newly dispatched request stays stuck in the scheduler queue until eventually another request is submitted. This problem does not affect SCSI disk as the SCSI stack handles queue restart on request completion. However, this problem is can be triggered the nullblk driver with zoned mode enabled. Fix this by always requesting a queue restart in dd_dispatch_request() if no request was dispatched while WRITE requests are queued. Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Add missing export of blk_mq_sched_restart() Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-13block: deactivate blk_stat timer in wbt_disable_default()Ming Lei
commit 544fbd16a461a318cd80537d1331c0df5c6cf930 upstream. rwb_enabled() can't be changed when there is any inflight IO. wbt_disable_default() may set rwb->wb_normal as zero, however the blk_stat timer may still be pending, and the timer function will update wrb->wb_normal again. This patch introduces blk_stat_deactivate() and applies it in wbt_disable_default(), then the following IO hang triggered when running parted & switching io scheduler can be fixed: [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds. [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498 [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.940768] parted D 0 3645 3239 0x00000000 [ 369.941500] Call Trace: [ 369.941874] ? __schedule+0x6d9/0x74c [ 369.942392] ? wbt_done+0x5e/0x5e [ 369.942864] ? wbt_cleanup_cb+0x16/0x16 [ 369.943404] ? wbt_done+0x5e/0x5e [ 369.943874] schedule+0x67/0x78 [ 369.944298] io_schedule+0x12/0x33 [ 369.944771] rq_qos_wait+0xb5/0x119 [ 369.945193] ? karma_partition+0x1c2/0x1c2 [ 369.945691] ? wbt_cleanup_cb+0x16/0x16 [ 369.946151] wbt_wait+0x85/0xb6 [ 369.946540] __rq_qos_throttle+0x23/0x2f [ 369.947014] blk_mq_make_request+0xe6/0x40a [ 369.947518] generic_make_request+0x192/0x2fe [ 369.948042] ? submit_bio+0x103/0x11f [ 369.948486] ? __radix_tree_lookup+0x35/0xb5 [ 369.949011] submit_bio+0x103/0x11f [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44 [ 369.949962] submit_bio_wait+0x53/0x7f [ 369.950469] blkdev_issue_flush+0x8a/0xae [ 369.951032] blkdev_fsync+0x2f/0x3a [ 369.951502] do_fsync+0x2e/0x47 [ 369.951887] __x64_sys_fsync+0x10/0x13 [ 369.952374] do_syscall_64+0x89/0x149 [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 369.953492] RIP: 0033:0x7f95a1e729d4 [ 369.953996] Code: Bad RIP value. [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4 [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004 [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0 [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380 [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008 Cc: stable@vger.kernel.org Cc: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-19block/bio: Do not zero user pagesKeith Busch
commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream. We don't need to zero fill the bio if not using kernel allocated pages. Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2 Reported-by: Todd Aiken <taiken@mvtech.ca> Cc: Laurence Oberman <loberman@redhat.com> Cc: stable@vger.kernel.org Cc: Bart Van Assche <bvanassche@acm.org> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08blk-mq: punt failed direct issue to dispatch listJens Axboe
commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream. After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. This results in a livelock. Instead of making special cases for what we can direct issue, and now having to deal with DM solving the livelock while still retaining a BUSY condition feedback loop, always just add a request that has been through ->queue_rq() to the hardware queue dispatch list. These are safe to use as no merging can take place there. Additionally, if requests do have prepped data from drivers, we aren't dependent on them not sharing space in the request structure to safely add them to the IO scheduler lists. This basically reverts ffe81d45322c and is based on a patch from Ming, but with the list insert case covered as well. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Cc: stable@vger.kernel.org Suggested-by: Ming Lei <ming.lei@redhat.com> Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08blk-mq: fix corruption with direct issueJens Axboe
commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream. If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-01block: copy ioprio in __bio_clone_fast() and bounceHannes Reinecke
[ Upstream commit ca474b73896bf6e0c1eb8787eb217b0f80221610 ] We need to copy the io priority, too; otherwise the clone will run with a different priority than the original one. Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field") Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jean Delvare <jdelvare@suse.de> Fixed up subject, and ordered stores. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>