aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/block
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/block')
-rw-r--r--Documentation/block/00-INDEX34
-rw-r--r--Documentation/block/bfq-iosched.txt7
-rw-r--r--Documentation/block/biodoc.txt88
-rw-r--r--Documentation/block/cfq-iosched.txt291
-rw-r--r--Documentation/block/null_blk.txt3
-rw-r--r--Documentation/block/queue-sysfs.txt36
6 files changed, 44 insertions, 415 deletions
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
deleted file mode 100644
index 8d55b4bbb5e2..000000000000
--- a/Documentation/block/00-INDEX
+++ /dev/null
@@ -1,34 +0,0 @@
-00-INDEX
- - This file
-bfq-iosched.txt
- - BFQ IO scheduler and its tunables
-biodoc.txt
- - Notes on the Generic Block Layer Rewrite in Linux 2.5
-biovecs.txt
- - Immutable biovecs and biovec iterators
-capability.txt
- - Generic Block Device Capability (/sys/block/<device>/capability)
-cfq-iosched.txt
- - CFQ IO scheduler tunables
-cmdline-partition.txt
- - how to specify block device partitions on kernel command line
-data-integrity.txt
- - Block data integrity
-deadline-iosched.txt
- - Deadline IO scheduler tunables
-ioprio.txt
- - Block io priorities (in CFQ scheduler)
-pr.txt
- - Block layer support for Persistent Reservations
-null_blk.txt
- - Null block for block-layer benchmarking.
-queue-sysfs.txt
- - Queue's sysfs entries
-request.txt
- - The members of struct request (in include/linux/blkdev.h)
-stat.txt
- - Block layer statistics in /sys/block/<device>/stat
-switching-sched.txt
- - Switching I/O schedulers at runtime
-writeback_cache_control.txt
- - Control of volatile write back caches
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 8d8d8f06cab2..98a8dd5ee385 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -357,6 +357,13 @@ video playing/streaming, a very low drop rate may be more important
than maximum throughput. In these cases, consider setting the
strict_guarantees parameter.
+slice_idle_us
+-------------
+
+Controls the same tuning parameter as slice_idle, but in microseconds.
+Either tunable can be used to set idling behavior. Afterwards, the
+other tunable will reflect the newly set value in sysfs.
+
strict_guarantees
-----------------
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 207eca58efaa..ac18b488cb5e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -65,7 +65,6 @@ Description of Contents:
3.2.3 I/O completion
3.2.4 Implications for drivers that do not interpret bios (don't handle
multiple segments)
- 3.2.5 Request command tagging
3.3 I/O submission
4. The I/O scheduler
5. Scalability related changes
@@ -708,93 +707,6 @@ is crossed on completion of a transfer. (The end*request* functions should
be used if only if the request has come down from block/bio path, not for
direct access requests which only specify rq->buffer without a valid rq->bio)
-3.2.5 Generic request command tagging
-
-3.2.5.1 Tag helpers
-
-Block now offers some simple generic functionality to help support command
-queueing (typically known as tagged command queueing), ie manage more than
-one outstanding command on a queue at any given time.
-
- blk_queue_init_tags(struct request_queue *q, int depth)
-
- Initialize internal command tagging structures for a maximum
- depth of 'depth'.
-
- blk_queue_free_tags((struct request_queue *q)
-
- Teardown tag info associated with the queue. This will be done
- automatically by block if blk_queue_cleanup() is called on a queue
- that is using tagging.
-
-The above are initialization and exit management, the main helpers during
-normal operations are:
-
- blk_queue_start_tag(struct request_queue *q, struct request *rq)
-
- Start tagged operation for this request. A free tag number between
- 0 and 'depth' is assigned to the request (rq->tag holds this number),
- and 'rq' is added to the internal tag management. If the maximum depth
- for this queue is already achieved (or if the tag wasn't started for
- some other reason), 1 is returned. Otherwise 0 is returned.
-
- blk_queue_end_tag(struct request_queue *q, struct request *rq)
-
- End tagged operation on this request. 'rq' is removed from the internal
- book keeping structures.
-
-To minimize struct request and queue overhead, the tag helpers utilize some
-of the same request members that are used for normal request queue management.
-This means that a request cannot both be an active tag and be on the queue
-list at the same time. blk_queue_start_tag() will remove the request, but
-the driver must remember to call blk_queue_end_tag() before signalling
-completion of the request to the block layer. This means ending tag
-operations before calling end_that_request_last()! For an example of a user
-of these helpers, see the IDE tagged command queueing support.
-
-3.2.5.2 Tag info
-
-Some block functions exist to query current tag status or to go from a
-tag number to the associated request. These are, in no particular order:
-
- blk_queue_tagged(q)
-
- Returns 1 if the queue 'q' is using tagging, 0 if not.
-
- blk_queue_tag_request(q, tag)
-
- Returns a pointer to the request associated with tag 'tag'.
-
- blk_queue_tag_depth(q)
-
- Return current queue depth.
-
- blk_queue_tag_queue(q)
-
- Returns 1 if the queue can accept a new queued command, 0 if we are
- at the maximum depth already.
-
- blk_queue_rq_tagged(rq)
-
- Returns 1 if the request 'rq' is tagged.
-
-3.2.5.2 Internal structure
-
-Internally, block manages tags in the blk_queue_tag structure:
-
- struct blk_queue_tag {
- struct request **tag_index; /* array or pointers to rq */
- unsigned long *tag_map; /* bitmap of free tags */
- struct list_head busy_list; /* fifo list of busy tags */
- int busy; /* queue depth */
- int max_depth; /* max queue depth */
- };
-
-Most of the above is simple and straight forward, however busy_list may need
-a bit of explaining. Normally we don't care too much about request ordering,
-but in the event of any barrier requests in the tag queue we need to ensure
-that requests are restarted in the order they were queue.
-
3.3 I/O Submission
The routine submit_bio() is used to submit a single io. Higher level i/o
diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
deleted file mode 100644
index 895bd3813115..000000000000
--- a/Documentation/block/cfq-iosched.txt
+++ /dev/null
@@ -1,291 +0,0 @@
-CFQ (Complete Fairness Queueing)
-===============================
-
-The main aim of CFQ scheduler is to provide a fair allocation of the disk
-I/O bandwidth for all the processes which requests an I/O operation.
-
-CFQ maintains the per process queue for the processes which request I/O
-operation(synchronous requests). In case of asynchronous requests, all the
-requests from all the processes are batched together according to their
-process's I/O priority.
-
-CFQ ioscheduler tunables
-========================
-
-slice_idle
-----------
-This specifies how long CFQ should idle for next request on certain cfq queues
-(for sequential workloads) and service trees (for random workloads) before
-queue is expired and CFQ selects next queue to dispatch from.
-
-By default slice_idle is a non-zero value. That means by default we idle on
-queues/service trees. This can be very helpful on highly seeky media like
-single spindle SATA/SAS disks where we can cut down on overall number of
-seeks and see improved throughput.
-
-Setting slice_idle to 0 will remove all the idling on queues/service tree
-level and one should see an overall improved throughput on faster storage
-devices like multiple SATA/SAS disks in hardware RAID configuration. The down
-side is that isolation provided from WRITES also goes down and notion of
-IO priority becomes weaker.
-
-So depending on storage and workload, it might be useful to set slice_idle=0.
-In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
-keeping slice_idle enabled should be useful. For any configurations where
-there are multiple spindles behind single LUN (Host based hardware RAID
-controller or for storage arrays), setting slice_idle=0 might end up in better
-throughput and acceptable latencies.
-
-back_seek_max
--------------
-This specifies, given in Kbytes, the maximum "distance" for backward seeking.
-The distance is the amount of space from the current head location to the
-sectors that are backward in terms of distance.
-
-This parameter allows the scheduler to anticipate requests in the "backward"
-direction and consider them as being the "next" if they are within this
-distance from the current head location.
-
-back_seek_penalty
------------------
-This parameter is used to compute the cost of backward seeking. If the
-backward distance of request is just 1/back_seek_penalty from a "front"
-request, then the seeking cost of two requests is considered equivalent.
-
-So scheduler will not bias toward one or the other request (otherwise scheduler
-will bias toward front request). Default value of back_seek_penalty is 2.
-
-fifo_expire_async
------------------
-This parameter is used to set the timeout of asynchronous requests. Default
-value of this is 248ms.
-
-fifo_expire_sync
-----------------
-This parameter is used to set the timeout of synchronous requests. Default
-value of this is 124ms. In case to favor synchronous requests over asynchronous
-one, this value should be decreased relative to fifo_expire_async.
-
-group_idle
------------
-This parameter forces idling at the CFQ group level instead of CFQ
-queue level. This was introduced after a bottleneck was observed
-in higher end storage due to idle on sequential queue and allow dispatch
-from a single queue. The idea with this parameter is that it can be run with
-slice_idle=0 and group_idle=8, so that idling does not happen on individual
-queues in the group but happens overall on the group and thus still keeps the
-IO controller working.
-Not idling on individual queues in the group will dispatch requests from
-multiple queues in the group at the same time and achieve higher throughput
-on higher end storage.
-
-Default value for this parameter is 8ms.
-
-low_latency
------------
-This parameter is used to enable/disable the low latency mode of the CFQ
-scheduler. If enabled, CFQ tries to recompute the slice time for each process
-based on the target_latency set for the system. This favors fairness over
-throughput. Disabling low latency (setting it to 0) ignores target latency,
-allowing each process in the system to get a full time slice.
-
-By default low latency mode is enabled.
-
-target_latency
---------------
-This parameter is used to calculate the time slice for a process if cfq's
-latency mode is enabled. It will ensure that sync requests have an estimated
-latency. But if sequential workload is higher(e.g. sequential read),
-then to meet the latency constraints, throughput may decrease because of less
-time for each process to issue I/O request before the cfq queue is switched.
-
-Though this can be overcome by disabling the latency_mode, it may increase
-the read latency for some applications. This parameter allows for changing
-target_latency through the sysfs interface which can provide the balanced
-throughput and read latency.
-
-Default value for target_latency is 300ms.
-
-slice_async
------------
-This parameter is same as of slice_sync but for asynchronous queue. The
-default value is 40ms.
-
-slice_async_rq
---------------
-This parameter is used to limit the dispatching of asynchronous request to
-device request queue in queue's slice time. The maximum number of request that
-are allowed to be dispatched also depends upon the io priority. Default value
-for this is 2.
-
-slice_sync
-----------
-When a queue is selected for execution, the queues IO requests are only
-executed for a certain amount of time(time_slice) before switching to another
-queue. This parameter is used to calculate the time slice of synchronous
-queue.
-
-time_slice is computed using the below equation:-
-time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
-time_slice of synchronous queue, increase the value of slice_sync. Default
-value is 100ms.
-
-quantum
--------
-This specifies the number of request dispatched to the device queue. In a
-queue's time slice, a request will not be dispatched if the number of request
-in the device exceeds this parameter. This parameter is used for synchronous
-request.
-
-In case of storage with several disk, this setting can limit the parallel
-processing of request. Therefore, increasing the value can improve the
-performance although this can cause the latency of some I/O to increase due
-to more number of requests.
-
-CFQ Group scheduling
-====================
-
-CFQ supports blkio cgroup and has "blkio." prefixed files in each
-blkio cgroup directory. It is weight-based and there are four knobs
-for configuration - weight[_device] and leaf_weight[_device].
-Internal cgroup nodes (the ones with children) can also have tasks in
-them, so the former two configure how much proportion the cgroup as a
-whole is entitled to at its parent's level while the latter two
-configure how much proportion the tasks in the cgroup have compared to
-its direct children.
-
-Another way to think about it is assuming that each internal node has
-an implicit leaf child node which hosts all the tasks whose weight is
-configured by leaf_weight[_device]. Let's assume a blkio hierarchy
-composed of five cgroups - root, A, B, AA and AB - with the following
-weights where the names represent the hierarchy.
-
- weight leaf_weight
- root : 125 125
- A : 500 750
- B : 250 500
- AA : 500 500
- AB : 1000 500
-
-root never has a parent making its weight is meaningless. For backward
-compatibility, weight is always kept in sync with leaf_weight. B, AA
-and AB have no child and thus its tasks have no children cgroup to
-compete with. They always get 100% of what the cgroup won at the
-parent level. Considering only the weights which matter, the hierarchy
-looks like the following.
-
- root
- / | \
- A B leaf
- 500 250 125
- / | \
- AA AB leaf
- 500 1000 750
-
-If all cgroups have active IOs and competing with each other, disk
-time will be distributed like the following.
-
-Distribution below root. The total active weight at this level is
-A:500 + B:250 + C:125 = 875.
-
- root-leaf : 125 / 875 =~ 14%
- A : 500 / 875 =~ 57%
- B(-leaf) : 250 / 875 =~ 28%
-
-A has children and further distributes its 57% among the children and
-the implicit leaf node. The total active weight at this level is
-AA:500 + AB:1000 + A-leaf:750 = 2250.
-
- A-leaf : ( 750 / 2250) * A =~ 19%
- AA(-leaf) : ( 500 / 2250) * A =~ 12%
- AB(-leaf) : (1000 / 2250) * A =~ 25%
-
-CFQ IOPS Mode for group scheduling
-===================================
-Basic CFQ design is to provide priority based time slices. Higher priority
-process gets bigger time slice and lower priority process gets smaller time
-slice. Measuring time becomes harder if storage is fast and supports NCQ and
-it would be better to dispatch multiple requests from multiple cfq queues in
-request queue at a time. In such scenario, it is not possible to measure time
-consumed by single queue accurately.
-
-What is possible though is to measure number of requests dispatched from a
-single queue and also allow dispatch from multiple cfq queue at the same time.
-This effectively becomes the fairness in terms of IOPS (IO operations per
-second).
-
-If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
-to IOPS mode and starts providing fairness in terms of number of requests
-dispatched. Note that this mode switching takes effect only for group
-scheduling. For non-cgroup users nothing should change.
-
-CFQ IO scheduler Idling Theory
-===============================
-Idling on a queue is primarily about waiting for the next request to come
-on same queue after completion of a request. In this process CFQ will not
-dispatch requests from other cfq queues even if requests are pending there.
-
-The rationale behind idling is that it can cut down on number of seeks
-on rotational media. For example, if a process is doing dependent
-sequential reads (next read will come on only after completion of previous
-one), then not dispatching request from other queue should help as we
-did not move the disk head and kept on dispatching sequential IO from
-one queue.
-
-CFQ has following service trees and various queues are put on these trees.
-
- sync-idle sync-noidle async
-
-All cfq queues doing synchronous sequential IO go on to sync-idle tree.
-On this tree we idle on each queue individually.
-
-All synchronous non-sequential queues go on sync-noidle tree. Also any
-synchronous write request which is not marked with REQ_IDLE goes on this
-service tree. On this tree we do not idle on individual queues instead idle
-on the whole group of queues or the tree. So if there are 4 queues waiting
-for IO to dispatch we will idle only once last queue has dispatched the IO
-and there is no more IO on this service tree.
-
-All async writes go on async service tree. There is no idling on async
-queues.
-
-CFQ has some optimizations for SSDs and if it detects a non-rotational
-media which can support higher queue depth (multiple requests at in
-flight at a time), then it cuts down on idling of individual queues and
-all the queues move to sync-noidle tree and only tree idle remains. This
-tree idling provides isolation with buffered write queues on async tree.
-
-FAQ
-===
-Q1. Why to idle at all on queues not marked with REQ_IDLE.
-
-A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
- with REQ_IDLE. This helps in providing isolation with all the sync-idle
- queues. Otherwise in presence of many sequential readers, other
- synchronous IO might not get fair share of disk.
-
- For example, if there are 10 sequential readers doing IO and they get
- 100ms each. If a !REQ_IDLE request comes in, it will be scheduled
- roughly after 1 second. If after completion of !REQ_IDLE request we
- do not idle, and after a couple of milli seconds a another !REQ_IDLE
- request comes in, again it will be scheduled after 1second. Repeat it
- and notice how a workload can lose its disk share and suffer due to
- multiple sequential readers.
-
- fsync can generate dependent IO where bunch of data is written in the
- context of fsync, and later some journaling data is written. Journaling
- data comes in only after fsync has finished its IO (atleast for ext4
- that seemed to be the case). Now if one decides not to idle on fsync
- thread due to !REQ_IDLE, then next journaling write will not get
- scheduled for another second. A process doing small fsync, will suffer
- badly in presence of multiple sequential readers.
-
- Hence doing tree idling on threads using !REQ_IDLE flag on requests
- provides isolation from multiple sequential readers and at the same
- time we do not idle on individual threads.
-
-Q2. When to specify REQ_IDLE
-A2. I would think whenever one is doing synchronous write and expecting
- more writes to be dispatched from same context soon, should be able
- to specify REQ_IDLE on writes and that probably should work well for
- most of the cases.
diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index ea2dafe49ae8..4cad1024fff7 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -88,7 +88,8 @@ shared_tags=[0/1]: Default: 0
zoned=[0/1]: Default: 0
0: Block device is exposed as a random-access block device.
- 1: Block device is exposed as a host-managed zoned block device.
+ 1: Block device is exposed as a host-managed zoned block device. Requires
+ CONFIG_BLK_DEV_ZONED.
zone_size=[MB]: Default: 256
Per zone size when exposed as a zoned block device. Must be a power of two.
diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 2c1e67058fd3..83b457e24bba 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -64,9 +64,16 @@ guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
-IO to sleep for this amont of microseconds before entering classic
+IO to sleep for this amount of microseconds before entering classic
polling.
+io_timeout (RW)
+---------------
+io_timeout is the request timeout in milliseconds. If a request does not
+complete in this time then the block driver timeout handler is invoked.
+That timeout handler can decide to retry the request, to fail it or to start
+a device recovery strategy.
+
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
@@ -194,4 +201,31 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
+zoned (RO)
+----------
+This indicates if the device is a zoned block device and the zone model of the
+device if it is indeed zoned. The possible values indicated by zoned are
+"none" for regular block devices and "host-aware" or "host-managed" for zoned
+block devices. The characteristics of host-aware and host-managed zoned block
+devices are described in the ZBC (Zoned Block Commands) and ZAC
+(Zoned Device ATA Command Set) standards. These standards also define the
+"drive-managed" zone model. However, since drive-managed zoned block devices
+do not support zone commands, they will be treated as regular block devices
+and zoned will report "none".
+
+nr_zones (RO)
+-------------
+For zoned block devices (zoned attribute indicating "host-managed" or
+"host-aware"), this indicates the total number of zones of the device.
+This is always 0 for regular block devices.
+
+chunk_sectors (RO)
+------------------
+This has different meaning depending on the type of the block device.
+For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
+of the RAID volume stripe segment. For a zoned block device, either host-aware
+or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
+of the device, with the eventual exception of the last zone of the device which
+may be smaller.
+
Jens Axboe <jens.axboe@oracle.com>, February 2009