summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2012-10-25block: Add blk_rq_pos(rq) to sort rq when plushingJianpeng Ma
My workload is a raid5 which had 16 disks. And used our filesystem to write using direct-io mode. I used the blktrace to find those message: 8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5] 8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5] 8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5] 8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5] 8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5] 8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5] 8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5] 8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5] 8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5] 8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5] 8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5] 8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5] 8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5] 8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5] 8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5] 8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5] 8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5] 8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5] 8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5] 8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5] 8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5] 8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5] 8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5] 8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5] 8,16 0 0 2.453853661 0 m N cfq2579 insert_request 8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453854439 0 m N cfq2579 insert_request 8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2 8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request 8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1 8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request 8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2 8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5] 8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0] 8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0 8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0] 8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0 8,16 0 0 2.454795160 0 m N cfq schedule dispatch From above messages,we can find rq[W 7493144 + 104] and rq[W 7493120 + 24] do not merge. Because the bio order is: 8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5] 8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5] 8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5] 8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5] The bio(7493144) first and bio(7493120) later.So the subsequent bios will be divided into two parts. When flushing plug-list,because elv_attempt_insert_merge only support backmerge,not supporting frontmerge. So rq[7493120 + 24] can't merge with rq[7493144 + 104]. From my test,i found those situation can count 25% in our system. Using this patch, there is no this situation. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> CC:Shaohua Li <shli@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-23block: remove CONFIG_EXPERIMENTALKees Cook
This config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it. CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-22blkcg: stop iteration early if root_rl is the only request listJun'ichi Nomura
__blk_queue_next_rl() finds next request list based on blkg_list while skipping root_blkg in the list. OTOH, root_rl is special as it may exist even without root_blkg. Though the later part of the function handles such a case correctly, exiting early is good for readability of the code. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-22blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkgJun'ichi Nomura
blk_put_rl() does not call blkg_put() for q->root_rl because we don't take request list reference on q->root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q->root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 > blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: > <IRQ> [<ffffffff810453bd>] warn_slowpath_common+0x85/0x9d > [<ffffffff810453ef>] warn_slowpath_null+0x1a/0x1c > [<ffffffff811d5f8d>] blk_put_rl+0x4d/0x95 > [<ffffffff811d614a>] __blk_put_request+0xc3/0xcb > [<ffffffff811d71a3>] blk_finish_request+0x232/0x23f > [<ffffffff811d76c3>] ? blk_end_bidi_request+0x34/0x5d > [<ffffffff811d76d1>] blk_end_bidi_request+0x42/0x5d > [<ffffffff811d7728>] blk_end_request+0x10/0x12 > [<ffffffff812cdf16>] scsi_io_completion+0x207/0x4d5 > [<ffffffff812c6fcf>] scsi_finish_command+0xfa/0x103 > [<ffffffff812ce2f8>] scsi_softirq_done+0xff/0x108 > [<ffffffff811dcea5>] blk_done_softirq+0x8d/0xa1 > [<ffffffff810915d5>] ? > generic_smp_call_function_single_interrupt+0x9f/0xd7 > [<ffffffff8104cf5b>] __do_softirq+0x102/0x213 > [<ffffffff8108a5ec>] ? lock_release_holdtime+0xb6/0xbb > [<ffffffff8104d2b4>] ? raise_softirq_irqoff+0x9/0x3d > [<ffffffff81424dfc>] call_softirq+0x1c/0x30 > [<ffffffff81011beb>] do_softirq+0x4b/0xa3 > [<ffffffff8104cdb0>] irq_exit+0x53/0xd5 > [<ffffffff8102d865>] smp_call_function_single_interrupt+0x34/0x36 > [<ffffffff8142486f>] call_function_single_interrupt+0x6f/0x80 > <EOI> [<ffffffff8101800b>] ? mwait_idle+0x94/0xcd > [<ffffffff81018002>] ? mwait_idle+0x8b/0xcd > [<ffffffff81017811>] cpu_idle+0xbb/0x114 > [<ffffffff81401fbd>] rest_init+0xc1/0xc8 > [<ffffffff81401efc>] ? csum_partial_copy_generic+0x16c/0x16c > [<ffffffff81cdbd3d>] start_kernel+0x3d4/0x3e1 > [<ffffffff81cdb79e>] ? kernel_init+0x1f7/0x1f7 > [<ffffffff81cdb2dd>] x86_64_start_reservations+0xb8/0xbd > [<ffffffff81cdb3e3>] x86_64_start_kernel+0x101/0x110 This patch clears q->root_blkg and q->root_rl.blkg when root blkg is destroyed. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-11Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit 6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...
2012-10-02Merge branch 'for-3.7-hierarchy' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup hierarchy update from Tejun Heo: "Currently, different cgroup subsystems handle nested cgroups completely differently. There's no consistency among subsystems and the behaviors often are outright broken. People at least seem to agree that the broken hierarhcy behaviors need to be weeded out if any progress is gonna be made on this front and that the fallouts from deprecating the broken behaviors should be acceptable especially given that the current behaviors don't make much sense when nested. This patch makes cgroup emit warning messages if cgroups for subsystems with broken hierarchy behavior are nested to prepare for fixing them in the future. This was put in a separate branch because more related changes were expected (didn't make it this round) and the memory cgroup wanted to pull in this and make changes on top." * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-09-26s390/partitions: make partition detection independent from DASD ioctlsStefan Weinhuber
In some usage scenarios it is desireable to work with disk images or virtualized DASD devices. One problem that prevents such applications is the partition detection in ibm.c. Currently it works only for devices that support the BIODASDINFO2 ioctl, in other words, it only works for devices that belong to the DASD device driver. The information gained from the BIODASDINFO2 ioctl is only for a small set of legacy cases abolutely necessary. All current VOL1, LNX1 and CMS1 type of disk labels can be interpreted correctly without this information, as long as the generic HDIO_GETGEO ioctl works and provides a correct disk geometry. This patch makes the ibm.c partition detection as independent as possible from the BIODASDINFO2 ioctl. Only the following two cases are still restricted to real DASDs: - An FBA DASD, or LDL formatted ECKD DASD without any disk label. - An old style LNX1 label (without large volume support) on a disk with inconsistent device geometry. Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-09-21block: fix request_queue->flags initializationTejun Heo
A queue newly allocated with blk_alloc_queue_node() has only QUEUE_FLAG_BYPASS set. For request-based drivers, blk_init_allocated_queue() is called and q->queue_flags is overwritten with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the initial bypass is still in effect. In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags instead of overwriting. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-21block: lift the initial queue bypass mode on blk_register_queue() instead of ↵Tejun Heo
blk_init_allocated_queue() b82d4b197c ("blkcg: make request_queue bypassing on allocation") made request_queues bypassed on allocation to avoid switching on and off bypass mode on a queue being initialized. Some drivers allocate and then destroy a lot of queues without fully initializing them and incurring bypass latency overhead on each of them could add upto significant overhead. Unfortunately, blk_init_allocated_queue() is never used by queues of bio-based drivers, which means that all bio-based driver queues are in bypass mode even after initialization and registration complete successfully. Due to the limited way request_queues are used by bio drivers, this problem is hidden pretty well but it shows up when blk-throttle is used in combination with a bio-based driver. Trying to configure (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs indefinitely in blkg_conf_prep() waiting for bypass mode to end. This patch moves the initial blk_queue_bypass_end() call from blk_init_allocated_queue() to blk_register_queue() which is called for any userland-visible queues regardless of its type. I believe this is correct because I don't think there is any block driver which needs or wants working elevator and blk-cgroup on a queue which isn't visible to userland. If there are such users, we need a different solution. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au> Cc: stable@vger.kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: ioctl to zero block rangesMartin K. Petersen
Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by way of blkdev_issue_zeroout(). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Make blkdev_issue_zeroout use WRITE SAMEMartin K. Petersen
If the device supports WRITE SAME, use that to optimize zeroing of blocks. If the device does not support WRITE SAME or if the operation fails, fall back to writing zeroes the old-fashioned way. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Implement support for WRITE SAMEMartin K. Petersen
The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Consolidate command flag and queue limit checks for mergesMartin K. Petersen
- blk_check_merge_flags() verifies that cmd_flags / bi_rw are compatible. This function is called for both req-req and req-bio merging. - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used to query the maximum sector count for a given request or queue. The calls will return the right value from the queue limits given the type of command (RW, discard, write same, etc.) Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Clean up special command handling logicMartin K. Petersen
Remove special-casing of non-rw fs style requests (discard). The nomerge flags are consolidated in blk_types.h, and rq_mergeable() and bio_mergeable() have been modified to use them. bio_is_rw() is used in place of bio_has_data() a few places. This is done to to distinguish true reads and writes from other fs type requests that carry a payload (e.g. write same). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-18blk: add an upper sanity check on partition addingAlan Cox
65536 should be ludicrous anyway but without it we overflow the memory computation doing the allocation and badness occurs. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-14cgroup: mark subsystems with broken hierarchy support and whine if cgroups ↵Tejun Heo
are nested for them Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2012-09-12block/blk-tag.c: Remove useless kfreePeter Senna Tschudin
Remove useless kfree() and clean up code related to the removal. The semantic patch that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r exists@ position p1,p2; expression x; @@ if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; } @unchanged exists@ position r.p1,r.p2; expression e <= r.x,x,e1; iterator I; statement S; @@ if (x@p1 == NULL) { ... when != I(x,...) S when != e = e1 when != e += e1 when != e -= e1 when != ++e when != --e when != e++ when != e-- when != &e kfree@p2(x); ... return ...; } @ok depends on unchanged exists@ position any r.p1; position r.p2; expression x; @@ ... when != true x@p1 == NULL kfree@p2(x); @depends on !ok && unchanged@ position r.p2; expression x; @@ *kfree@p2(x); // </smpl> Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: remove the duplicated setting for congestion_thresholdJaehoon Chung
Before call the blk_queue_congestion_threshold(), the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest(). Because this code is the duplicated, it has removed. Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: reject invalid queue attribute valuesDave Reisner
Instead of using simple_strtoul which "converts" invalid numbers to 0, use strict_strtoul and perform error checking to ensure that userspace passes us a valid unsigned long. This addresses problems with functions such as writev, which might want to write a trailing newline -- the newline should rightfully be rejected, but the value preceeding it should be preserved. Fixes BZ#46981. Signed-off-by: Dave Reisner <dreisner@archlinux.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: Add bio_clone_bioset(), bio_clone_kmalloc()Kent Overstreet
Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: Kill bi_destructorKent Overstreet
Now that we've got generic code for freeing bios allocated from bio pools, this isn't needed anymore. This patch also makes bio_free() static, since without bi_destructor there should be no need for it to be called anywhere else. bio_free() is now only called from bio_put, so we can refactor those a bit - move some code from bio_put() to bio_free() and kill the redundant bio->bi_next = NULL. v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL v7: No #define BIO_KMALLOC_POOL anymore Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: Ues bi_pool for bio_integrity_alloc()Kent Overstreet
Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-30block: rate-limit the error message from failing commandsYi Zou
When performing a cable pull test w/ active stress I/O using fio over a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs on the other, it is observed that the system becomes not usable due to scsi-ml being busy printing the error messages for all the failing commands. I don't believe this problem is specific to FCoE and these commands are anyway failing due to link being down (DID_NO_CONNECT), just rate-limit the messages here to solve this issue. v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for rate-limit per function. However, in this case, the failed i/o gets to blk_end_request_err() and then blk_update_request(), which also has to be rate-limited, as added in the v2 of this patch. v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip. Signed-off-by: Yi Zou <yi.zou@intel.com> Cc: www.Open-FCoE.org <devel@open-fcoe.org> Cc: Tomas Henzl <thenzl@redhat.com> Cc: <linux-scsi@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-21workqueue: deprecate __cancel_delayed_work()Tejun Heo
Now that cancel_delayed_work() can be safely called from IRQ handlers, there's no reason to use __cancel_delayed_work(). Use cancel_delayed_work() instead of __cancel_delayed_work() and mark the latter deprecated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Roland Dreier <roland@kernel.org> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21workqueue: use mod_delayed_work() instead of __cancel + queueTejun Heo
Now that mod_delayed_work() is safe to call from IRQ handlers, __cancel_delayed_work() followed by queue_delayed_work() can be replaced with mod_delayed_work(). Most conversions are straight-forward except for the following. * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite elaborate dancing around its delayed_work. Collapse it such that linkwatch_work is queued for immediate execution if LW_URGENT and existing timer is kept otherwise. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-20workqueue: deprecate system_nrt[_freezable]_wqTejun Heo
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and convert all users to system[_freezable]_wq. If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant, so there's no reason to use system_nrt[_freezable]_wq. Please use system[_freezable]_wq instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com>
2012-08-13workqueue: use mod_delayed_work() instead of cancel + queueTejun Heo
Convert delayed_work users doing cancel_delayed_work() followed by queue_delayed_work() to mod_delayed_work(). Most conversions are straight-forward. Ones worth mentioning are, * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always use mod_delayed_work() and cancel loop in edac_mc_reset_delay_period() is dropped. * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether watchdog is active or not. @fan_watchdog_active and related code dropped. * drivers/power/charger-manager.c: Seemingly a lot of delayed_work_pending() abuse going on here. [delayed_]work_pending() are unsynchronized and racy when used like this. I converted one instance in fullbatt_handler(). Please conver the rest so that it invokes workqueue APIs for the intended target state rather than trying to game work item pending state transitions. e.g. if timer should be modified - call mod_delayed_work(), canceled - call cancel_delayed_work[_sync](). * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling() simplified. Note that round_jiffies() calls in this function are meaningless. round_jiffies() work on absolute jiffies not delta delay used by delayed_work. v2: Tomi pointed out that __cancel_delayed_work() users can't be safely converted to mod_delayed_work(). They could be calling it from irq context and if that happens while delayed_work_timer_fn() is running, it could deadlock. __cancel_delayed_work() users are dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Acked-by: Anton Vorontsov <cbouatmailru@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Doug Thompson <dougthompson@xmission.com> Cc: David Airlie <airlied@linux.ie> Cc: Roland Dreier <roland@kernel.org> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-03block: Don't use static to define "void *p" in show_partition_start()Jianpeng Ma
I met a odd prblem:read /proc/partitions may return zero. I wrote a file test.c: int main() { char buff[4096]; int ret; int fd; printf("pid=%d\n",getpid()); while (1) { fd = open("/proc/partitions", O_RDONLY); if (fd < 0) { printf("open error %s\n", strerror(errno)); return 0; } ret = read(fd, buff, 4096); if (ret <= 0) printf("ret=%d, %s, %ld\n", ret, strerror(errno), lseek(fd,0,SEEK_CUR)); close(fd); } exit(0); } You can reproduce by: 1:while true;do cat /proc/partitions > /dev/null ;done 2:./test I reviewed the code and found: >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos) >> { >> static void *p; >> >> p = disk_seqf_start(seqf, pos); >> if (!IS_ERR_OR_NULL(p) && !*pos) >> seq_puts(seqf, "major minor #blocks name\n\n"); >> return p; >> } test cat /proc/partitions p = disk_seqf_start()(Not NULL) p = disk_seqf_start()(NULL because pos) if (!IS_ERR_OR_NULL(p) && !*pos) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: Add blk_bio_map_sg() helperAsias He
Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg. This helper is useful for any driver that wants to create a scatterlist from its ->make_request_fn method. Changes in v2: - Use __blk_segment_map_sg to avoid duplicated code - Add cocbook style function comment Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: Introduce __blk_segment_map_sg() helperAsias He
Split the mapping code in blk_rq_map_sg() to a helper __blk_segment_map_sg(), so that other mapping function, e.g. blk_bio_map_sg(), can share the code. Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Suggested-by: Jens Axboe <axboe@kernel.dk> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: split discard into aligned requestsPaolo Bonzini
When a disk has large discard_granularity and small max_discard_sectors, discards are not split with optimal alignment. In the limit case of discard_granularity == max_discard_sectors, no request could be aligned correctly, so in fact you might end up with no discarded logical blocks at all. Another example that helps showing the condition in the patch is with discard_granularity == 64, max_discard_sectors == 128. A request that is submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257. However, only 2 aligned blocks out of 3 are included in the request; 128..191 may be left intact and not discarded. With this patch, the first request will be truncated to ensure good alignment of what's left, and the split will be 2..127, 128..255, 256..257. The patch will also take into account the discard_alignment. At most one extra request will be introduced, because the first request will be reduced by at most granularity-1 sectors, and granularity must be less than max_discard_sectors. Subsequent requests will run on round_down(max_discard_sectors, granularity) sectors, as in the current code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: reorganize rounding of max_discard_sectorsPaolo Bonzini
Mostly a preparation for the next patch. In principle this fixes an infinite loop if max_discard_sectors < granularity, but that really shouldn't happen. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous
2012-08-01Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO bits from Jens Axboe: "The most complicated part if this is the request allocation rework by Tejun, which has been queued up for a long time and has been in for-next ditto as well. There are a few commits from yesterday and today, mostly trivial and obvious fixes. So I'm pretty confident that it is sound. It's also smaller than usual." * 'for-3.6/core' of git://git.kernel.dk/linux-block: block: remove dead func declaration block: add partition resize function to blkpg ioctl block: uninitialized ioc->nr_tasks triggers WARN_ON block: do not artificially constrain max_sectors for stacking drivers blkcg: implement per-blkg request allocation block: prepare for multiple request_lists block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv blkcg: inline bio_blkcg() and friends block: allocate io_context upfront block: refactor get_request[_wait]() block: drop custom queue draining used by scsi_transport_{iscsi|fc} mempool: add @gfp_mask to mempool_create_node() blkcg: make root blkcg allocation use %GFP_KERNEL blkcg: __blkg_lookup_create() doesn't need radix preload
2012-08-01block: remove dead func declarationYuanhan Liu
__generic_unplug_device() function is removed with commit 7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to remove the declaration at meantime. Here remove it. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: add partition resize function to blkpg ioctlVivek Goyal
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use. This patch converts hd_struct->nr_sects into sequence counter because One might extend a partition while IO is happening to it and update of nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This can lead to issues like reading inconsistent size of a partition. Sequence counter have been used so that readers don't have to take bdev mutex lock as we call sector_in_part() very frequently. Now all the access to hd_struct->nr_sects should happen using sequence counter read/update helper functions part_nr_sects_read/part_nr_sects_write. There is one exception though, set_capacity()/get_capacity(). I think theoritically race should exist there too but this patch does not modify set_capacity()/get_capacity() due to sheer number of call sites and I am afraid that change might break something. I have left that as a TODO item. We can handle it later if need be. This patch does not introduce any new races as such w.r.t set_capacity()/get_capacity(). v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Phillip Susi <psusi@ubuntu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: uninitialized ioc->nr_tasks triggers WARN_ONOlof Johansson
Hi, I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the below warning as of 3.5-rc1 and later (3.4 is fine): [ 10.886893] ------------[ cut here ]------------ [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560() [ 10.886905] Hardware name: Bochs [ 10.886906] Modules linked in: [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27 [ 10.886908] Call Trace: [ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0 [ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20 [ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560 [ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340 [ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50 [ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80 [ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60 [ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30 [ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20 [ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b [ 10.886928] ---[ end trace 32a14af7ee6a590b ]--- Reproducing is easy, I can hit it on a KVM system with a very basic config (x86_64 make defconfig + enable the drivers needed). To hit it, just install dump (on debian/ubuntu, not sure what the package might be called on Fedora), and: dump -o -f /tmp/foo / You'll see the warning in dmesg once it forks off the I/O process and starts dumping filesystem contents. I bisected it down to the following commit: commit f6e8d01bee036460e03bd4f6a79d014f98ba712e Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 5 13:15:26 2012 -0800 block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> It seems like the init of ioc->nr_tasks was removed in that patch, so it starts out at 0 instead of 1. Tejun, is the right thing here to add back the init, or should something else be done? The below patch removes the warning, but I haven't done any more extensive testing on it. Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: do not artificially constrain max_sectors for stacking driversMike Snitzer
blk_set_stacking_limits is intended to allow stacking drivers to build up the limits of the stacked device based on the underlying devices' limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024) doesn't allow the stacking driver to inherit a max_sectors larger than 1024 -- due to blk_stack_limits' use of min_not_zero. It is now clear that this artificial limit is getting in the way so change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows stacking drivers like dm-multipath to inherit 'max_sectors' from the underlying paths). Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com> Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31blk: pass from_schedule to non-request unplug functions.NeilBrown
This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31block: stack unplugShaohua Li
MD raid1 prepares to dispatch request in unplug callback. If make_request in low level queue also uses unplug callback to dispatch request, the low level queue's unplug callback will not be called. Recheck the callback list helps this case. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31blk: centralize non-request unplug handling.NeilBrown
Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-20[SCSI] block: Fix blk_execute_rq_nowait() dead queue handlingMuthukumar Ratty
If the queue is dead blk_execute_rq_nowait() doesn't invoke the done() callback function. That will result in blk_execute_rq() being stuck in wait_for_completion(). Avoid this by initializing rq->end_io to the done() callback before we check the queue state. Also, make sure the queue lock is held around the invocation of the done() callback. Found this through source code review. Signed-off-by: Muthukumar Ratty <muthur@gmail.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-06-26blkcg: implement per-blkg request allocationTejun Heo
Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25block: prepare for multiple request_listsTejun Heo
Request allocation is about to be made per-blkg meaning that there'll be multiple request lists. * Make queue full state per request_list. blk_*queue_full() functions are renamed to blk_*rl_full() and takes @rl instead of @q. * Rename blk_init_free_list() to blk_init_rl() and make it take @rl instead of @q. Also add @gfp_mask parameter. * Add blk_exit_rl() instead of destroying rl directly from blk_release_queue(). * Add request_list->q and make request alloc/free functions - blk_free_request(), [__]freed_request(), __get_request() - take @rl instead of @q. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvprivTejun Heo
Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated to use q->nr_rqs[] instead of q->rq.count[]. These counters separates queue-wide request statistics from the request list and allow implementation of per-queue request allocation. While at it, properly indent fields of struct request_list. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25blkcg: inline bio_blkcg() and friendsTejun Heo
Make bio_blkcg() and friends inline. They all are very simple and used only in few places. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25block: allocate io_context upfrontTejun Heo
Block layer very lazy allocation of ioc. It waits until the moment ioc is absolutely necessary; unfortunately, that time could be inside queue lock and __get_request() performs unlock - try alloc - retry dancing. Just allocate it up-front on entry to block layer. We're not saving the rain forest by deferring it to the last possible moment and complicating things unnecessarily. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25block: refactor get_request[_wait]()Tejun Heo
Currently, there are two request allocation functions - get_request() and get_request_wait(). The former tries to allocate a request once and the latter keeps retrying until it succeeds. The latter wraps the former and keeps retrying until allocation succeeds. The combination of two functions deliver fallible non-wait allocation, fallible wait allocation and unfailing wait allocation. However, given that forward progress is guaranteed, fallible wait allocation isn't all that useful and in fact nobody uses it. This patch simplifies the interface as follows. * get_request() is renamed to __get_request() and is only used by the wrapper function. * get_request_wait() is renamed to get_request(). It now takes @gfp_mask and retries iff it contains %__GFP_WAIT. This patch doesn't introduce any functional change and is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25block: drop custom queue draining used by scsi_transport_{iscsi|fc}Tejun Heo
iscsi_remove_host() uses bsg_remove_queue() which implements custom queue draining. fc_bsg_remove() open-codes mostly identical logic. The draining logic isn't correct in that blk_stop_queue() doesn't prevent new requests from being queued - it just stops processing, so nothing prevents new requests to be queued after the logic determines that the queue is drained. blk_cleanup_queue() now implements proper queue draining and these custom draining logics aren't necessary. Drop them and use bsg_unregister_queue() + blk_cleanup_queue() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@emulex.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>