summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2019-08-29block, bfq: handle NULL return value by bfq_init_rq()Paolo Valente
[ Upstream commit fd03177c33b287c6541f4048f1d67b7b45a1abc9 ] As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25blk-mq: move cancel of requeue_work to the front of blk_exit_queuezhengbin
commit e26cc08265dda37d2acc8394604f220ef412299d upstream. blk_exit_queue will free elevator_data, while blk_mq_requeue_work will access it. Move cancel of requeue_work to the front of blk_exit_queue to avoid use-after-free. blk_exit_queue blk_mq_requeue_work __elevator_exit blk_mq_run_hw_queues blk_mq_exit_sched blk_mq_run_hw_queue dd_exit_queue blk_mq_hctx_has_pending kfree(elevator_data) blk_mq_sched_has_work dd_has_work Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16rq-qos: use a mb for got_tokenJosef Bacik
[ Upstream commit ac38297f7038cd5b80d66f8809c7bbf5b70031f3 ] Oleg noticed that our checking of data.got_token is unsafe in the cleanup case, and should really use a memory barrier. Use a wmb on the write side, and a rmb() on the read side. We don't need one in the main loop since we're saved by set_current_state(). Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16rq-qos: set ourself TASK_UNINTERRUPTIBLE after we scheduleJosef Bacik
[ Upstream commit d14a9b389a86a5154b704bc88ce8dd37c701456a ] In case we get a spurious wakeup we need to make sure to re-set ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16rq-qos: don't reset has_sleepers on spurious wakeupsJosef Bacik
[ Upstream commit 64e7ea875ef63b2801be7954cf7257d1bfccc266 ] If we raced with somebody else getting an inflight counter we could fail to get an inflight counter with no sleepers on the list, and thus need to go to sleep. In this case has_sleepers should be true because we are now relying on the waker to get our inflight counter for us. And in the case of spurious wakeups we'd still want this to be the case. So set has_sleepers to true if we went to sleep to make sure we're woken up the proper way. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31block/bio-integrity: fix a memory leak bugWenwen Wang
[ Upstream commit e7bf90e5afe3aa1d1282c1635a49e17a32c4ecec ] In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to hold integrity metadata. Later on, the buffer will be attached to the bio structure through bio_integrity_add_page(), which returns the number of bytes of integrity metadata attached. Due to unexpected situations, bio_integrity_add_page() may return 0. As a result, bio_integrity_prep() needs to be terminated with 'false' returned to indicate this error. However, the allocated kernel buffer is not freed on this execution path, leading to a memory leak. To fix this issue, free the allocated buffer before returning from bio_integrity_prep(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31block: init flush rq ref count to 1Josef Bacik
[ Upstream commit b554db147feea39617b533ab6bca247c91c6198a ] We discovered a problem in newer kernels where a disconnect of a NBD device while the flush request was pending would result in a hang. This is because the blk mq timeout handler does if (!refcount_inc_not_zero(&rq->ref)) return true; to determine if it's ok to run the timeout handler for the request. Flush_rq's don't have a ref count set, so we'd skip running the timeout handler for this request and it would just sit there in limbo forever. Fix this by always setting the refcount of any request going through blk_init_rq() to 1. I tested this with a nbd-server that dropped flush requests to verify that it hung, and then tested with this patch to verify I got the timeout as expected and the error handling kicked in. Thanks, Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-28block: Limit zone array allocation sizeDamien Le Moal
commit 26202928fafad8bda8b478edb7e62c885be623d7 upstream. Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages(). Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blkcg: update blkcg_print_stat() to handle larger outputsTejun Heo
commit f539da82f2158916e154d206054e0efd5df7ab61 upstream. Depending on the number of devices, blkcg stats can go over the default seqfile buf size. seqfile normally retries with a larger buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't tell seqfile that overflow has happened and the output gets printed truncated. Fix it by calling seq_commit() w/ -1 on possible overflows. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-iolatency: clear use_delay when io.latency is set to zeroTejun Heo
commit 5de0073fcd50cc1f150895a7bb04d3cf8067b1d7 upstream. If use_delay was non-zero when the latency target of a cgroup was set to zero, it will stay stuck until io.latency is enabled on the cgroup again. This keeps readahead disabled for the cgroup impacting performance negatively. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <jbacik@fb.com> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-throttle: fix zero wait time for iops throttled groupKonstantin Khlebnikov
commit 3a10f999ffd464d01c5a05592a15470a3c4bbc36 upstream. After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") wait time could be zero even if group is throttled and cannot issue requests right now. As a result throtl_select_dispatch() turns into busy-loop under irq-safe queue spinlock. Fix is simple: always round up target time to the next throttle slice. Fixes: 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26block: Fix potential overflow in blk_report_zones()Damien Le Moal
commit 113ab72ed4794c193509a97d7c6d32a6886e1682 upstream. For large values of the number of zones reported and/or large zone sizes, the sector increment calculated with blk_queue_zone_sectors(q) * n in blk_report_zones() loop can overflow the unsigned int type used for the calculation as both "n" and blk_queue_zone_sectors() value are unsigned int. E.g. for a device with 256 MB zones (524288 sectors), overflow happens with 8192 or more zones reported. Changing the return type of blk_queue_zone_sectors() to sector_t, fixes this problem and avoids overflow problem for all other callers of this helper too. The same change is also applied to the bdev_zone_sectors() helper. Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26block: Allow mapping of vmalloc-ed buffersDamien Le Moal
commit b4c5875d36178e8df409bdce232f270cac89fafe upstream. To allow the SCSI subsystem scsi_execute_req() function to issue requests using large buffers that are better allocated with vmalloc() rather than kmalloc(), modify bio_map_kern() to allow passing a buffer allocated with vmalloc(). To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page() instead of virt_to_page() to obtain the pages of the buffer, and invalidate the buffer addresses with invalidate_kernel_vmap_range() on completion of read BIOs. This last point is executed using the function bio_invalidate_vmalloc_pages() which is defined only if the architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture actually needs the invalidation done. Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3748c ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26blk-iolatency: fix STS_AGAIN handlingDennis Zhou
[ Upstream commit c9b3007feca018d3f7061f5d5a14cb00766ffe9b ] The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios") Debugged-by: Tejun Heo <tj@kernel.org> Debugged-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26block, bfq: fix rq_in_driver check in bfq_update_inject_limitPaolo Valente
[ Upstream commit db599f9ed9bd31b018b6c48ad7c6b21d5b790ecf ] One of the cases where the parameters for injection may be updated is when there are no more in-flight I/O requests. The number of in-flight requests is stored in the field bfqd->rq_in_driver of the descriptor bfqd of the device. So, the controlled condition is bfqd->rq_in_driver == 0. Unfortunately, this is wrong because, the instruction that checks this condition is in the code path that handles the completion of a request, and, in particular, the instruction is executed before bfqd->rq_in_driver is decremented in such a code path. This commit fixes this issue by just replacing 0 with 1 in the comparison. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26blk-iolatency: only account submitted biosDennis Zhou
[ Upstream commit a3fb01ba5af066521f3f3421839e501bb2c71805 ] As is, iolatency recognizes done_bio and cleanup as ending paths. If a request is marked REQ_NOWAIT and fails to get a request, the bio is cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error(). This results in underflowing the inflight counter. Fix this by only accounting bios that were actually submitted. Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-14block, bfq: NULL out the bic when it's no longer validDouglas Anderson
commit dbc3117d4ca9e17819ac73501e914b8422686750 upstream. In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-14block: fix .bi_size overflowMing Lei
commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream. 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3ec276 ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-06blk-mq: fix up placement of debugfs directory of queue filesGreg Kroah-Hartman
When the blk-mq debugfs file creation logic was "cleaned up" it was cleaned up too much, causing the queue file to not be created in the correct location. Turns out the check for the directory being present is needed as if that has not happened yet, the files should not be created, and the function will be called later on in the initialization code so that the files can be created in the correct location. Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: fix operator in BFQQ_TOTALLY_SEEKYPaolo Valente
By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17block: fix page leak when merging to same pageChristoph Hellwig
When multiple iovecs reference the same page, each get_user_page call will add a reference to the page. But once we've created the bio that information gets lost and only a single reference will be dropped after I/O completion. Use the same_page information returned from __bio_try_merge_page to drop additional references to pages that were already present in the bio. Based on a patch from Ming Lei. Link: https://lkml.org/lkml/2019/4/23/64 Fixes: 576ed913 ("block: use bio_add_page in bio_iov_iter_get_pages") Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17block: return from __bio_try_merge_page if merging occured in the same pageChristoph Hellwig
We currently have an input same_page parameter to __bio_try_merge_page to prohibit merging in the same page. The rationale for that is that some callers need to account for every page added to a bio. Instead of letting these callers call twice into the merge code to account for the new vs existing page cases, just turn the paramter into an output one that returns if a merge in the same page occured and let them act accordingly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requestsMing Lei
blk_mq_sched_free_requests() may be called in failure path in which q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests for avoiding the false positive. This function is actually safe to call in case of !q->elevator because hctx->sched_tags is checked. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Yi Zhang <yi.zhang@redhat.com> Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue") Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13blk-mq: no need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. When all of these checks are cleaned up, lots of the functions used in the blk-mq-debugfs code can now return void, as no need to check the return value of them either. Overall, this ends up cleaning up the code and making it smaller, always a nice win. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13block: force select mq-deadline for zoned block devicesDamien Le Moal
In most use cases of zoned block devices (aka SMR disks), the mq-deadline scheduler is mandatory as it implements sequential write command processing guarantees with zone write locking. So make sure that this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected. Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-10cgroup/bfq: revert bfq.weight symlink changeJens Axboe
There's some discussion on how to do this the best, and Tejun prefers that BFQ just create the file itself instead of having cgroups support a symlink feature. Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this can be done properly for 5.3. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-07block, bfq: add weight symlink to the bfq.weight cgroup parameterAngelo Ruocco
Many userspace tools and services use the proportional-share policy of the blkio/io cgroups controller. The CFQ I/O scheduler implemented this policy for the legacy block layer. To modify the weight of a group in case CFQ was in charge, the 'weight' parameter of the group must be modified. On the other hand, the BFQ I/O scheduler implements the same policy in blk-mq, but, with BFQ, the parameter to modify has a different name: bfq.weight (forced choice until legacy block was present, because two different policies cannot share a common parameter in cgroups). Due to CFQ legacy, most if not all userspace configurations still use the parameter 'weight', and for the moment do not seem likely to be changed. But, when CFQ went away with legacy block, such a parameter ceased to exist. So, a simple workaround has been proposed [1] to make all configurations work: add a symlink, named weight, to bfq.weight. This commit adds such a symlink. [1] https://lkml.org/lkml/2019/4/8/555 Suggested-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-06block: free sched's request pool in blk_cleanup_queueMing Lei
In theory, IO scheduler belongs to request queue, and the request pool of sched tags belongs to the request queue too. However, the current tags allocation interfaces are re-used for both driver tags and sched tags, and driver tags is definitely host wide, and doesn't belong to any request queue, same with its request pool. So we need tagset instance for freeing request of sched tags. Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched tags to be freed before calling blk_mq_free_tag_set(). Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue") moves blk_exit_queue into __blk_release_queue for simplying the fast path in generic_make_request(), then causes oops during freeing requests of sched tags in __blk_release_queue(). Fix the above issue by move freeing request pool of sched tags into blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any in-queue requests at that time. Freeing sched tags has to be kept in queue's release handler becasue there might be un-completed dispatch activity which might refer to sched tags. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue") Tested-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05block: Drop unlikely before IS_ERR(_OR_NULL)Kefeng Wang
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag, so no need to do that again from its callers. Drop it. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: print offending values when cloned rq limits are exceededJohn Pittman
While troubleshooting issues where cloned request limits have been exceeded, it is often beneficial to know the actual values that have been breached. Print these values, assisting in ease of identification of root cause of the breach. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31blk-mq: Document the blk_mq_hw_queue_to_node() argumentsBart Van Assche
Document the meaning of the blk_mq_hw_queue_to_node() arguments. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31blk-mq: Fix spelling in a source code commentBart Van Assche
Change one occurrence of 'performace' into 'performance'. Cc: Max Gurtovoy <maxg@mellanox.com> Fixes: fe631457ff3e ("blk-mq: map all HWQ also in hyperthreaded system") # v4.13. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: Fix bsg_setup_queue() kernel-doc headerBart Van Assche
Document all bsg_setup_queue() arguments as required. Fixes: aae3b069d5ce ("bsg: pass in desired timeout handler") # v5.0. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: Fix rq_qos_wait() kernel-doc headerBart Van Assche
Add documentation for the @rqw argument and change " - " into ": ". Fixes: 84f603246db9 ("block: add rq_qos_wait to rq_qos") # v5.0-rc1~52^2~140. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: Fix blk_mq_*_map_queues() kernel-doc headersBart Van Assche
This patch avoids that the kernel-doc script complains about these function headers when building with W=1. Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0. Fixes: e42b3867de4b ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: Fix throtl_pending_timer_fn() kernel-doc headerBart Van Assche
Commit e99e88a9d2b0 renamed a function argument without updating the corresponding kernel-doc header. Update the kernel-doc header. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Reviewed-by: Kees Cook <keescook@chromium.org> Fixes: e99e88a9d2b0 ("treewide: setup_timer() -> timer_setup()") # v4.15. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block: Convert blk_invalidate_devt() header into a non-kernel-doc headerBart Van Assche
This patch avoids that the kernel-doc tool warns about this function header when building with W=1. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31block/partitions/ldm: Convert a kernel-doc header into a non-kernel-doc headerBart Van Assche
This patch avoids that the kernel-doc tool warns about this function header when building with W=1. Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-29blk-mq: Fix memory leak in error handlingJes Sorensen
If blk_mq_init_allocated_queue() fails, make sure to free the poll stat callback struct allocated. Signed-off-by: Jes Sorensen <jsorensen@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-29block: don't protect generic_make_request_checks with blk_queue_enterMing Lei
Now a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") has been reverted, and blkcg_exit_queue() won't be called in blk_cleanup_queue() any more. So don't need to protect generic_make_request_checks() with blk_queue_enter(), then the total mess can be cleaned. 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") is reverted. Cc: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-29block: move blk_exit_queue into __blk_release_queueMing Lei
Commit 498f6650aec8 ("block: Fix a race between the cgroup code and request queue initialization") moves what blk_exit_queue does into blk_cleanup_queue() for fixing issue caused by changing back queue lock. However, after legacy request IO path is killed, driver queue lock won't be used at all, and there isn't story for changing back queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't exist any more. So move move blk_exit_queue into __blk_release_queue. This patch basically reverts the following two commits: 498f6650aec8 block: Fix a race between the cgroup code and request queue initialization 24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller Cc: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23blk-mq: fix hang caused by freeze/unfreeze sequenceBob Liu
The following is a description of a hang in blk_mq_freeze_queue_wait(). The hang happens on attempt to freeze a queue while another task does queue unfreeze. The root cause is an incorrect sequence of percpu_ref_resurrect() and percpu_ref_kill() and as a result those two can be swapped: CPU#0 CPU#1 ---------------- ----------------- q1 = blk_mq_init_queue(shared_tags) q2 = blk_mq_init_queue(shared_tags): blk_mq_add_queue_tag_set(shared_tags): blk_mq_update_tag_set_depth(shared_tags): list_for_each_entry() blk_mq_freeze_queue(q1) > percpu_ref_kill() > blk_mq_freeze_queue_wait() blk_cleanup_queue(q1) blk_mq_freeze_queue(q1) > percpu_ref_kill() ^^^^^^ freeze_depth can't guarantee the order blk_mq_unfreeze_queue() > percpu_ref_resurrect() > blk_mq_freeze_queue_wait() ^^^^^^ Hang here!!!! This wrong sequence raises kernel warning: percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release! WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0 But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(), which waits for a zero of a q_usage_counter, which never happens because percpu-ref was reinited (instead of being killed) and stays in PERCPU state forever. How to reproduce: - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2" - cpu0: python Script.py 0; taskset the corresponding process running on cpu0 - cpu1: python Script.py 1; taskset the corresponding process running on cpu1 Script.py: ------ #!/usr/bin/python3 import os import sys while True: on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1] off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1] os.system(on) os.system(off) ------ This bug was first reported and fixed by Roman, previous discussion: [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org [3] https://patchwork.kernel.org/patch/9268199/ Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23block: remove the bi_seg_{front,back}_size fields in struct bioChristoph Hellwig
At this point these fields aren't used for anything, so we can remove them. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23block: remove the segment size check in bio_will_gapChristoph Hellwig
We fundamentally do not have a maximum segement size for devices with a virt boundary. So don't bother checking it, especially given that the existing checks didn't properly work to start with as we never fully update the front/back segment size and miss the bi_seg_front_size that wuld have been required for some cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23block: force an unlimited segment size on queues with a virt boundaryChristoph Hellwig
We currently fail to update the front/back segment size in the bio when deciding to allow an otherwise gappy segement to a device with a virt boundary. The reason why this did not cause problems is that devices with a virt boundary fundamentally don't use segments as we know it and thus don't care. Make that assumption formal by forcing an unlimited segement size in this case. Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-23block: don't decrement nr_phys_segments for physically contigous segmentsChristoph Hellwig
Currently ll_merge_requests_fn, unlike all other merge functions, reduces nr_phys_segments by one if the last segment of the previous, and the first segment of the next segement are contigous. While this seems like a nice solution to avoid building smaller than possible requests it causes a mismatch between the segments actually present in the request and those iterated over by the bvec iterators, including __rq_for_each_bio. This can for example mistrigger the single segment optimization in the nvme-pci driver, and might lead to mismatching nr_phys_segments number when recalculating the number of request when inserting a cloned request. We could possibly work around this by making the bvec iterators take the front and back segment size into account, but that would require moving them from the bio to the bio_iter and spreading this mess over all users of bvecs. Or we could simply remove this optimization under the assumption that most users already build good enough bvecs, and that the bio merge patch never cared about this optimization either. The latter is what this patch does. dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests"). Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16Merge tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "This is mainly some late lightnvm changes that came in just before the merge window, as well as fixes that have been queued up since the initial pull request was frozen. This contains: - lightnvm changes, fixing race conditions, improving memory utilization, and improving pblk compatability (Chansol, Igor, Marcin) - NVMe pull request with minor fixes all over the map (via Christoph) - remove redundant error print in sata_rcar (Geert) - struct_size() cleanup (Jackie) - dasd CONFIG_LBADF warning fix (Ming) - brd cond_resched() improvement (Mikulas)" * tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits) block/bio-integrity: use struct_size() in kmalloc() nvme: validate cntlid during controller initialisation nvme: change locking for the per-subsystem controller list nvme: trace all async notice events nvme: fix typos in nvme status code values nvme-fabrics: remove unused argument nvme-multipath: avoid crash on invalid subsystem cntlid enumeration nvme-fc: use separate work queue to avoid warning nvme-rdma: remove redundant reference between ib_device and tagset nvme-pci: mark expected switch fall-through nvme-pci: add known admin effects to augument admin effects log page nvme-pci: init shadow doorbell after each reset brd: add cond_resched to brd_free_pages sata_rcar: Remove ata_host_alloc() error printing s390/dasd: fix build warning in dasd_eckd_build_cp_raw lightnvm: pblk: use nvm_rq_to_ppa_list() lightnvm: pblk: simplify partial read path lightnvm: do not remove instance under global lock lightnvm: track inflight target creations lightnvm: pblk: recover only written metadata ...
2019-05-16block/bio-integrity: use struct_size() in kmalloc()Jackie Liu
Use the new struct_size() helper to keep code simple. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-07Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
2019-05-07Merge tag 'driver-core-5.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core/kobject updates from Greg KH: "Here is the "big" set of driver core patches for 5.2-rc1 There are a number of ACPI patches in here as well, as Rafael said they should go through this tree due to the driver core changes they required. They have all been acked by the ACPI developers. There are also a number of small subsystem-specific changes in here, due to some changes to the kobject core code. Those too have all been acked by the various subsystem maintainers. As for content, it's pretty boring outside of the ACPI changes: - spdx cleanups - kobject documentation updates - default attribute groups for kobjects - other minor kobject/driver core fixes All have been in linux-next for a while with no reported issues" * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits) kobject: clean up the kobject add documentation a bit more kobject: Fix kernel-doc comment first line kobject: Remove docstring reference to kset firmware_loader: Fix a typo ("syfs" -> "sysfs") kobject: fix dereference before null check on kobj Revert "driver core: platform: Fix the usage of platform device name(pdev->name)" init/config: Do not select BUILD_BIN2C for IKCONFIG Provide in-kernel headers to make extending kernel easier kobject: Improve doc clarity kobject_init_and_add() kobject: Improve docs for kobject_add/del driver core: platform: Fix the usage of platform device name(pdev->name) livepatch: Replace klp_ktype_patch's default_attrs with groups cpufreq: schedutil: Replace default_attrs field with groups padata: Replace padata_attr_type default_attrs field with groups irqdesc: Replace irq_kobj_type's default_attrs field with groups net-sysfs: Replace ktype default_attrs field with groups block: Replace all ktype default_attrs with groups samples/kobject: Replace foo_ktype's default_attrs field with groups kobject: Add support for default attribute groups to kobj_type driver core: Postpone DMA tear-down until after devres release for probe failure ...