summaryrefslogtreecommitdiffstats
path: root/block/blk-iolatency.c
AgeCommit message (Collapse)Author
2019-03-13blk-iolatency: fix IO hang due to negative inflight counterLiu Bo
[ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ] Our test reported the following stack, and vmcore showed that ->inflight counter is -1. [ffffc9003fcc38d0] __schedule at ffffffff8173d95d [ffffc9003fcc3958] schedule at ffffffff8173de26 [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6 [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3 [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49 [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4] [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4] [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188 [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301 [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4] [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50 [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04 [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e The ->inflight counter may be negative (-1) if 1) blk-iolatency was disabled when the IO was issued, 2) blk-iolatency was enabled before this IO reached its endio, 3) the ->inflight counter is decreased from 0 to -1 in endio() In fact the hang can be easily reproduced by the below script, H=/sys/fs/cgroup/unified/ P=/sys/fs/cgroup/unified/test echo "+io" > $H/cgroup.subtree_control mkdir -p $P echo $$ > $P/cgroup.procs xfs_io -f -d -c "pwrite 0 4k" /dev/sdg echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency xfs_io -f -d -c "pwrite 0 4k" /dev/sdg This fixes the problem by freezing the queue so that while enabling/disabling iolatency, there is no inflight rq running. Note that quiesce_queue is not needed as this only updating iolatency configuration about which dispatching request_queue doesn't care. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-08-02block: make iolatency avg_lat exponentially decayDennis Zhou (Facebook)
Currently, avg_lat is calculated by accumulating the mean of every window in a long running cumulative average. As time goes on, the metric becomes less and less useful due to the accumulated history. This patch reuses the same calculation done in load averages to make the avg_lat metric more lively. Unlike load averages, the avg only advances when a window elapses (due to an io). Idle periods extend the most recent window. Bucketing is used to limit the history of avg_lat by binding it to the window size. So, the window range for 1/exp (decay rate) is [1 min, 2.5 min) when windows elapse immediately. The current sample window size is exposed in the debug info to enable calculation of the window range. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01blk-iolatency: fix blkg leak in timer_fnJosef Bacik
At this point we have a ref on the blkg, we need to drop it if we don't have a iolat. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-16blk-iolatency: truncate our current timeJosef Bacik
In our longer tests we noticed that some boxes would degrade to the point of uselessness. This is because we truncate the current time when saving it in our bio, but I was using the raw current time to subtract from. So once the box had been up a certain amount of time it would appear as if our IO's were taking several years to complete. Fix this by truncating the current time so it matches the issue time. Verified this worked by running with this patch for a week on our test tier. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-16blk-iolatency: don't change the latency windowJosef Bacik
Early versions of these patches had us waiting for seconds at a time during submission, so we had to adjust the timing window we monitored for latency. Now we don't do things like that so this is unnecessary code. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-11blk-iolatency: fix max_depth comparisonsJosef Bacik
max_depth used to be a u64, but I changed it to a unsigned int but didn't convert my comparisons over everywhere. Fix by using UINT_MAX everywhere instead of (u64)-1. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-10block: iolatency: avoid 64-bit divisionArnd Bergmann
On 32-bit architectures, dividing a 64-bit number needs to use the do_div() function or something like it to avoid a link failure: block/blk-iolatency.o: In function `iolatency_prfill_limit': blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod' Using div_u64() gives us the best output and avoids the need for an explicit cast. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09block: introduce blk-iolatency io controllerJosef Bacik
Current IO controllers for the block layer are less than ideal for our use case. The io.max controller is great at hard limiting, but it is not work conserving. This patch introduces io.latency. You provide a latency target for your group and we monitor the io in short windows to make sure we are not exceeding those latency targets. This makes use of the rq-qos infrastructure and works much like the wbt stuff. There are a few differences from wbt - It's bio based, so the latency covers the whole block layer in addition to the actual io. - We will throttle all IO types that comes in here if we need to. - We use the mean latency over the 100ms window. This is because writes can be particularly fast, which could give us a false sense of the impact of other workloads on our protected workload. - By default there's no throttling, we set the queue_depth to INT_MAX so that we can have as many outstanding bio's as we're allowed to. Only at throttle time do we pay attention to the actual queue depth. - We backcharge cgroups for root cg issued IO and induce artificial delays in order to deal with cases like metadata only or swap heavy workloads. In testing this has worked out relatively well. Protected workloads will throttle noisy workloads down to 1 io at time if they are doing normal IO on their own, or induce up to a 1 second delay per syscall if they are doing a lot of root issued IO (metadata/swap IO). Our testing has revolved mostly around our production web servers where we have hhvm (the web server application) in a protected group and everything else in another group. We see slightly higher requests per second (RPS) on the test tier vs the control tier, and much more stable RPS across all machines in the test tier vs the control tier. Another test we run is a slow memory allocator in the unprotected group. Before this would eventually push us into swap and cause the whole box to die and not recover at all. With these patches we see slight RPS drops (usually 10-15%) before the memory consumer is properly killed and things recover within seconds. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>