path: root/Documentation/block
diff options
authorLinus Torvalds <>2005-04-16 15:20:36 -0700
committerLinus Torvalds <>2005-04-16 15:20:36 -0700
commit1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/block
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
Diffstat (limited to 'Documentation/block')
4 files changed, 1544 insertions, 0 deletions
diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt
new file mode 100644
index 00000000000..6f47332c883
--- /dev/null
+++ b/Documentation/block/as-iosched.txt
@@ -0,0 +1,165 @@
+Anticipatory IO scheduler
+Nick Piggin <> 13 Sep 2003
+Attention! Database servers, especially those using "TCQ" disks should
+investigate performance with the 'deadline' IO scheduler. Any system with high
+disk performance requirements should do so, in fact.
+If you see unusual performance characteristics of your disk systems, or you
+see big performance regressions versus the deadline scheduler, please email
+me. Database users don't bother unless you're willing to test a lot of patches
+from me ;) its a known issue.
+Also, users with hardware RAID controllers, doing striping, may find
+highly variable performance results with using the as-iosched. The
+as-iosched anticipatory implementation is based on the notion that a disk
+device has only one physical seeking head. A striped RAID controller
+actually has a head for each physical device in the logical RAID device.
+However, setting the antic_expire (see tunable parameters below) produces
+very similar behavior to the deadline IO scheduler.
+Selecting IO schedulers
+To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
+'noop' and 'as' (the default) are also available. IO schedulers are assigned
+globally at boot time only presently.
+Anticipatory IO scheduler Policies
+The as-iosched implementation implements several layers of policies
+to determine when an IO request is dispatched to the disk controller.
+Here are the policies outlined, in order of application.
+1. one-way Elevator algorithm.
+The elevator algorithm is similar to that used in deadline scheduler, with
+the addition that it allows limited backward movement of the elevator
+(i.e. seeks backwards). A seek backwards can occur when choosing between
+two IO requests where one is behind the elevator's current position, and
+the other is in front of the elevator's position. If the seek distance to
+the request in back of the elevator is less than half the seek distance to
+the request in front of the elevator, then the request in back can be chosen.
+Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
+This favors forward movement of the elevator, while allowing opportunistic
+"short" backward seeks.
+2. FIFO expiration times for reads and for writes.
+This is again very similar to the deadline IO scheduler. The expiration
+times for requests on these lists is tunable using the parameters read_expire
+and write_expire discussed below. When a read or a write expires in this way,
+the IO scheduler will interrupt its current elevator sweep or read anticipation
+to service the expired request.
+3. Read and write request batching
+A batch is a collection of read requests or a collection of write
+requests. The as scheduler alternates dispatching read and write batches
+to the driver. In the case a read batch, the scheduler submits read
+requests to the driver as long as there are read requests to submit, and
+the read batch time limit has not been exceeded (read_batch_expire).
+The read batch time limit begins counting down only when there are
+competing write requests pending.
+In the case of a write batch, the scheduler submits write requests to
+the driver as long as there are write requests available, and the
+write batch time limit has not been exceeded (write_batch_expire).
+However, the length of write batches will be gradually shortened
+when read batches frequently exceed their time limit.
+When changing between batch types, the scheduler waits for all requests
+from the previous batch to complete before scheduling requests for the
+next batch.
+The read and write fifo expiration times described in policy 2 above
+are checked only when in scheduling IO of a batch for the corresponding
+(read/write) type. So for example, the read FIFO timeout values are
+tested only during read batches. Likewise, the write FIFO timeout
+values are tested only during write batches. For this reason,
+it is generally not recommended for the read batch time
+to be longer than the write expiration time, nor for the write batch
+time to exceed the read expiration time (see tunable parameters below).
+When the IO scheduler changes from a read to a write batch,
+it begins the elevator from the request that is on the head of the
+write expiration FIFO. Likewise, when changing from a write batch to
+a read batch, scheduler begins the elevator from the first entry
+on the read expiration FIFO.
+4. Read anticipation.
+Read anticipation occurs only when scheduling a read batch.
+This implementation of read anticipation allows only one read request
+to be dispatched to the disk controller at a time. In
+contrast, many write requests may be dispatched to the disk controller
+at a time during a write batch. It is this characteristic that can make
+the anticipatory scheduler perform anomalously with controllers supporting
+TCQ, or with hardware striped RAID devices. Setting the antic_expire
+queue paramter (see below) to zero disables this behavior, and the anticipatory
+scheduler behaves essentially like the deadline scheduler.
+When read anticipation is enabled (antic_expire is not zero), reads
+are dispatched to the disk controller one at a time.
+At the end of each read request, the IO scheduler examines its next
+candidate read request from its sorted read list. If that next request
+is from the same process as the request that just completed,
+or if the next request in the queue is "very close" to the
+just completed request, it is dispatched immediately. Otherwise,
+statistics (average think time, average seek distance) on the process
+that submitted the just completed request are examined. If it seems
+likely that that process will submit another request soon, and that
+request is likely to be near the just completed request, then the IO
+scheduler will stop dispatching more read requests for up time (antic_expire)
+milliseconds, hoping that process will submit a new request near the one
+that just completed. If such a request is made, then it is dispatched
+immediately. If the antic_expire wait time expires, then the IO scheduler
+will dispatch the next read request from the sorted read queue.
+To decide whether an anticipatory wait is worthwhile, the scheduler
+maintains statistics for each process that can be used to compute
+mean "think time" (the time between read requests), and mean seek
+distance for that process. One observation is that these statistics
+are associated with each process, but those statistics are not associated
+with a specific IO device. So for example, if a process is doing IO
+on several file systems on separate devices, the statistics will be
+a combination of IO behavior from all those devices.
+Tuning the anticipatory IO scheduler
+When using 'as', the anticipatory IO scheduler there are 5 parameters under
+/sys/block/*/queue/iosched/. All are units of milliseconds.
+The parameters are:
+* read_expire
+ Controls how long until a read request becomes "expired". It also controls the
+ interval between which expired requests are served, so set to 50, a request
+ might take anywhere < 100ms to be serviced _if_ it is the next on the
+ expired list. Obviously request expiration strategies won't make the disk
+ go faster. The result basically equates to the timeslice a single reader
+ gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
+ very roughly the % streaming read efficiency your disk should get with
+ multiple readers.
+* read_batch_expire
+ Controls how much time a batch of reads is given before pending writes are
+ served. A higher value is more efficient. This might be set below read_expire
+ if writes are to be given higher priority than reads, but reads are to be
+ as efficient as possible when there are no writes. Generally though, it
+ should be some multiple of read_expire.
+* write_expire, and
+* write_batch_expire are equivalent to the above, for writes.
+* antic_expire
+ Controls the maximum amount of time we can anticipate a good read (one
+ with a short seek distance from the most recently completed request) before
+ giving up. Many other factors may cause anticipation to be stopped early,
+ or some processes will not be "anticipated" at all. Should be a bit higher
+ for big seek time devices though not a linear correspondence - most
+ processes have only a few ms thinktime.
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
new file mode 100644
index 00000000000..6dd274d7e1c
--- /dev/null
+++ b/Documentation/block/biodoc.txt
@@ -0,0 +1,1213 @@
+ Notes on the Generic Block Layer Rewrite in Linux 2.5
+ =====================================================
+Notes Written on Jan 15, 2002:
+ Jens Axboe <>
+ Suparna Bhattacharya <>
+Last Updated May 2, 2002
+September 2003: Updated I/O Scheduler portions
+ Nick Piggin <>
+These are some notes describing some aspects of the 2.5 block layer in the
+context of the bio rewrite. The idea is to bring out some of the key
+changes and a glimpse of the rationale behind those changes.
+Please mail corrections & suggestions to
+2.5 bio rewrite:
+ Jens Axboe <>
+Many aspects of the generic block layer redesign were driven by and evolved
+over discussions, prior patches and the collective experience of several
+people. See sections 8 and 9 for a list of some related references.
+The following people helped with review comments and inputs for this
+ Christoph Hellwig <>
+ Arjan van de Ven <>
+ Randy Dunlap <>
+ Andre Hedrick <>
+The following people helped with fixes/contributions to the bio patches
+while it was still work-in-progress:
+ David S. Miller <>
+Description of Contents:
+1. Scope for tuning of logic to various needs
+ 1.1 Tuning based on device or low level driver capabilities
+ - Per-queue parameters
+ - Highmem I/O support
+ - I/O scheduler modularization
+ 1.2 Tuning based on high level requirements/capabilities
+ 1.2.1 I/O Barriers
+ 1.2.2 Request Priority/Latency
+ 1.3 Direct access/bypass to lower layers for diagnostics and special
+ device operations
+ 1.3.1 Pre-built commands
+2. New flexible and generic but minimalist i/o structure or descriptor
+ (instead of using buffer heads at the i/o layer)
+ 2.1 Requirements/Goals addressed
+ 2.2 The bio struct in detail (multi-page io unit)
+ 2.3 Changes in the request structure
+3. Using bios
+ 3.1 Setup/teardown (allocation, splitting)
+ 3.2 Generic bio helper routines
+ 3.2.1 Traversing segments and completion units in a request
+ 3.2.2 Setting up DMA scatterlists
+ 3.2.3 I/O completion
+ 3.2.4 Implications for drivers that do not interpret bios (don't handle
+ multiple segments)
+ 3.2.5 Request command tagging
+ 3.3 I/O submission
+4. The I/O scheduler
+5. Scalability related changes
+ 5.1 Granular locking: Removal of io_request_lock
+ 5.2 Prepare for transition to 64 bit sector_t
+6. Other Changes/Implications
+ 6.1 Partition re-mapping handled by the generic block layer
+7. A few tips on migration of older drivers
+8. A list of prior/related/impacted patches/ideas
+9. Other References/Discussion Threads
+Bio Notes
+Let us discuss the changes in the context of how some overall goals for the
+block layer are addressed.
+1. Scope for tuning the generic logic to satisfy various requirements
+The block layer design supports adaptable abstractions to handle common
+processing with the ability to tune the logic to an appropriate extent
+depending on the nature of the device and the requirements of the caller.
+One of the objectives of the rewrite was to increase the degree of tunability
+and to enable higher level code to utilize underlying device/driver
+capabilities to the maximum extent for better i/o performance. This is
+important especially in the light of ever improving hardware capabilities
+and application/middleware software designed to take advantage of these
+1.1 Tuning based on low level device / driver capabilities
+Sophisticated devices with large built-in caches, intelligent i/o scheduling
+optimizations, high memory DMA support, etc may find some of the
+generic processing an overhead, while for less capable devices the
+generic functionality is essential for performance or correctness reasons.
+Knowledge of some of the capabilities or parameters of the device should be
+used at the generic block layer to take the right decisions on
+behalf of the driver.
+How is this achieved ?
+Tuning at a per-queue level:
+i. Per-queue limits/values exported to the generic layer by the driver
+Various parameters that the generic i/o scheduler logic uses are set at
+a per-queue level (e.g maximum request size, maximum number of segments in
+a scatter-gather list, hardsect size)
+Some parameters that were earlier available as global arrays indexed by
+major/minor are now directly associated with the queue. Some of these may
+move into the block device structure in the future. Some characteristics
+have been incorporated into a queue flags field rather than separate fields
+in themselves. There are blk_queue_xxx functions to set the parameters,
+rather than update the fields directly
+Some new queue property settings:
+ blk_queue_bounce_limit(q, u64 dma_address)
+ Enable I/O to highmem pages, dma_address being the
+ limit. No highmem default.
+ blk_queue_max_sectors(q, max_sectors)
+ Maximum size request you can handle in units of 512 byte
+ sectors. 255 default.
+ blk_queue_max_phys_segments(q, max_segments)
+ Maximum physical segments you can handle in a request. 128
+ default (driver limit). (See 3.2.2)
+ blk_queue_max_hw_segments(q, max_segments)
+ Maximum dma segments the hardware can handle in a request. 128
+ default (host adapter limit, after dma remapping).
+ (See 3.2.2)
+ blk_queue_max_segment_size(q, max_seg_size)
+ Maximum size of a clustered segment, 64kB default.
+ blk_queue_hardsect_size(q, hardsect_size)
+ Lowest possible sector size that the hardware can operate
+ on, 512 bytes default.
+New queue flags:
+ QUEUE_FLAG_CLUSTER (see 3.2.2)
+ QUEUE_FLAG_QUEUED (see 3.2.4)
+ii. High-mem i/o capabilities are now considered the default
+The generic bounce buffer logic, present in 2.4, where the block layer would
+by default copyin/out i/o requests on high-memory buffers to low-memory buffers
+assuming that the driver wouldn't be able to handle it directly, has been
+changed in 2.5. The bounce logic is now applied only for memory ranges
+for which the device cannot handle i/o. A driver can specify this by
+setting the queue bounce limit for the request queue for the device
+(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
+where a device is capable of handling high memory i/o.
+In order to enable high-memory i/o where the device is capable of supporting
+it, the pci dma mapping routines and associated data structures have now been
+modified to accomplish a direct page -> bus translation, without requiring
+a virtual address mapping (unlike the earlier scheme of virtual address
+-> bus translation). So this works uniformly for high-memory pages (which
+do not have a correponding kernel virtual address space mapping) and
+low-memory pages.
+Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA
+aspects and mapping of scatter gather lists, and support for 64 bit PCI.
+Special handling is required only for cases where i/o needs to happen on
+pages at physical memory addresses beyond what the device can support. In these
+cases, a bounce bio representing a buffer from the supported memory range
+is used for performing the i/o with copyin/copyout as needed depending on
+the type of the operation. For example, in case of a read operation, the
+data read has to be copied to the original buffer on i/o completion, so a
+callback routine is set up to do this, while for write, the data is copied
+from the original buffer to the bounce buffer prior to issuing the
+operation. Since an original buffer may be in a high memory area that's not
+mapped in kernel virtual addr, a kmap operation may be required for
+performing the copy, and special care may be needed in the completion path
+as it may not be in irq context. Special care is also required (by way of
+GFP flags) when allocating bounce buffers, to avoid certain highmem
+deadlock possibilities.
+It is also possible that a bounce buffer may be allocated from high-memory
+area that's not mapped in kernel virtual addr, but within the range that the
+device can use directly; so the bounce page may need to be kmapped during
+copy operations. [Note: This does not hold in the current implementation,
+There are some situations when pages from high memory may need to
+be kmapped, even if bounce buffers are not necessary. For example a device
+may need to abort DMA operations and revert to PIO for the transfer, in
+which case a virtual mapping of the page is required. For SCSI it is also
+done in some scenarios where the low level driver cannot be trusted to
+handle a single sg entry correctly. The driver is expected to perform the
+kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq
+routines as appropriate. A driver could also use the blk_queue_bounce()
+routine on its own to bounce highmem i/o to low memory for specific requests
+if so desired.
+iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
+As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
+queue or pick from (copy) existing generic schedulers and replace/override
+certain portions of it. The 2.5 rewrite provides improved modularization
+of the i/o scheduler. There are more pluggable callbacks, e.g for init,
+add request, extract request, which makes it possible to abstract specific
+i/o scheduling algorithm aspects and details outside of the generic loop.
+It also makes it possible to completely hide the implementation details of
+the i/o scheduler from block drivers.
+I/O scheduler wrappers are to be used instead of accessing the queue directly.
+See section 4. The I/O scheduler for details.
+1.2 Tuning Based on High level code capabilities
+i. Application capabilities for raw i/o
+This comes from some of the high-performance database/middleware
+requirements where an application prefers to make its own i/o scheduling
+decisions based on an understanding of the access patterns and i/o
+ii. High performance filesystems or other higher level kernel code's
+Kernel components like filesystems could also take their own i/o scheduling
+decisions for optimizing performance. Journalling filesystems may need
+some control over i/o ordering.
+What kind of support exists at the generic block layer for this ?
+The flags and rw fields in the bio structure can be used for some tuning
+from above e.g indicating that an i/o is just a readahead request, or for
+marking barrier requests (discussed next), or priority settings (currently
+unused). As far as user applications are concerned they would need an
+additional mechanism either via open flags or ioctls, or some other upper
+level mechanism to communicate such settings to block.
+1.2.1 I/O Barriers
+There is a way to enforce strict ordering for i/os through barriers.
+All requests before a barrier point must be serviced before the barrier
+request and any other requests arriving after the barrier will not be
+serviced until after the barrier has completed. This is useful for higher
+level control on write ordering, e.g flushing a log of committed updates
+to disk before the corresponding updates themselves.
+A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
+The generic i/o scheduler would make sure that it places the barrier request and
+all other requests coming after it after all the previous requests in the
+queue. Barriers may be implemented in different ways depending on the
+driver. A SCSI driver for example could make use of ordered tags to
+preserve the necessary ordering with a lower impact on throughput. For IDE
+this might be two sync cache flush: a pre and post flush when encountering
+a barrier write.
+There is a provision for queues to indicate what kind of barriers they
+can provide. This is as of yet unmerged, details will be added here once it
+is in the kernel.
+1.2.2 Request Priority/Latency
+Todo/Under discussion:
+Arjan's proposed request priority scheme allows higher levels some broad
+ control (high/med/low) over the priority of an i/o request vs other pending
+ requests in the queue. For example it allows reads for bringing in an
+ executable page on demand to be given a higher priority over pending write
+ requests which haven't aged too much on the queue. Potentially this priority
+ could even be exposed to applications in some manner, providing higher level
+ tunability. Time based aging avoids starvation of lower priority
+ requests. Some bits in the bi_rw flags field in the bio structure are
+ intended to be used for this priority information.
+1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
+ (e.g Diagnostics, Systems Management)
+There are situations where high-level code needs to have direct access to
+the low level device capabilities or requires the ability to issue commands
+to the device bypassing some of the intermediate i/o layers.
+These could, for example, be special control commands issued through ioctl
+interfaces, or could be raw read/write commands that stress the drive's
+capabilities for certain kinds of fitness tests. Having direct interfaces at
+multiple levels without having to pass through upper layers makes
+it possible to perform bottom up validation of the i/o path, layer by
+layer, starting from the media.
+The normal i/o submission interfaces, e.g submit_bio, could be bypassed
+for specially crafted requests which such ioctl or diagnostics
+interfaces would typically use, and the elevator add_request routine
+can instead be used to directly insert such requests in the queue or preferably
+the blk_do_rq routine can be used to place the request on the queue and
+wait for completion. Alternatively, sometimes the caller might just
+invoke a lower level driver specific interface with the request as a
+If the request is a means for passing on special information associated with
+the command, then such information is associated with the request->special
+field (rather than misuse the request->buffer field which is meant for the
+request data buffer's virtual mapping).
+For passing request data, the caller must build up a bio descriptor
+representing the concerned memory buffer if the underlying driver interprets
+bio segments or uses the block layer end*request* functions for i/o
+completion. Alternatively one could directly use the request->buffer field to
+specify the virtual address of the buffer, if the driver expects buffer
+addresses passed in this way and ignores bio entries for the request type
+involved. In the latter case, the driver would modify and manage the
+request->buffer, request->sector and request->nr_sectors or
+request->current_nr_sectors fields itself rather than using the block layer
+end_request or end_that_request_first completion interfaces.
+(See 2.3 or Documentation/block/request.txt for a brief explanation of
+the request structure fields)
+[TBD: end_that_request_last should be usable even in this case;
+Perhaps an end_that_direct_request_first routine could be implemented to make
+handling direct requests easier for such drivers; Also for drivers that
+expect bios, a helper function could be provided for setting up a bio
+corresponding to a data buffer]
+<JENS: I dont understand the above, why is end_that_request_first() not
+usable? Or _last for that matter. I must be missing something>
+<SUP: What I meant here was that if the request doesn't have a bio, then
+ end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
+ and hence can't be used for advancing request state settings on the
+ completion of partial transfers. The driver has to modify these fields
+ directly by hand.
+ This is because end_that_request_first only iterates over the bio list,
+ and always returns 0 if there are none associated with the request.
+ _last works OK in this case, and is not a problem, as I mentioned earlier
+1.3.1 Pre-built Commands
+A request can be created with a pre-built custom command to be sent directly
+to the device. The cmd block in the request structure has room for filling
+in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
+command pre-building, and the type of the request is now indicated
+through rq->flags instead of via rq->cmd)
+The request structure flags can be set up to indicate the type of request
+in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
+packet command issued via blk_do_rq, REQ_SPECIAL: special request).
+It can help to pre-build device commands for requests in advance.
+Drivers can now specify a request prepare function (q->prep_rq_fn) that the
+block layer would invoke to pre-build device commands for a given request,
+or perform other preparatory processing for the request. This is routine is
+called by elv_next_request(), i.e. typically just before servicing a request.
+(The prepare function would not be called for requests that have REQ_DONTPREP
+ Pre-building could possibly even be done early, i.e before placing the
+ request on the queue, rather than construct the command on the fly in the
+ driver while servicing the request queue when it may affect latencies in
+ interrupt context or responsiveness in general. One way to add early
+ pre-building would be to do it whenever we fail to merge on a request.
+ Now REQ_NOMERGE is set in the request flags to skip this one in the future,
+ which means that it will not change before we feed it to the device. So
+ the pre-builder hook can be invoked there.
+2. Flexible and generic but minimalist i/o structure/descriptor.
+2.1 Reason for a new structure and requirements addressed
+Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
+layer, and the low level request structure was associated with a chain of
+buffer heads for a contiguous i/o request. This led to certain inefficiencies
+when it came to large i/o requests and readv/writev style operations, as it
+forced such requests to be broken up into small chunks before being passed
+on to the generic block layer, only to be merged by the i/o scheduler
+when the underlying device was capable of handling the i/o in one shot.
+Also, using the buffer head as an i/o structure for i/os that didn't originate
+from the buffer cache unecessarily added to the weight of the descriptors
+which were generated for each such chunk.
+The following were some of the goals and expectations considered in the
+redesign of the block i/o data structure in 2.5.
+i. Should be appropriate as a descriptor for both raw and buffered i/o -
+ avoid cache related fields which are irrelevant in the direct/page i/o path,
+ or filesystem block size alignment restrictions which may not be relevant
+ for raw i/o.
+ii. Ability to represent high-memory buffers (which do not have a virtual
+ address mapping in kernel address space).
+iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e
+ greater than PAGE_SIZE chunks in one shot)
+iv. At the same time, ability to retain independent identity of i/os from
+ different sources or i/o units requiring individual completion (e.g. for
+ latency reasons)
+v. Ability to represent an i/o involving multiple physical memory segments
+ (including non-page aligned page fragments, as specified via readv/writev)
+ without unecessarily breaking it up, if the underlying device is capable of
+ handling it.
+vi. Preferably should be based on a memory descriptor structure that can be
+ passed around different types of subsystems or layers, maybe even
+ networking, without duplication or extra copies of data/descriptor fields
+ themselves in the process
+vii.Ability to handle the possibility of splits/merges as the structure passes
+ through layered drivers (lvm, md, evms), with minimal overhead.
+The solution was to define a new structure (bio) for the block layer,
+instead of using the buffer head structure (bh) directly, the idea being
+avoidance of some associated baggage and limitations. The bio structure
+is uniformly used for all i/o at the block layer ; it forms a part of the
+bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
+mapped to bio structures.
+2.2 The bio struct
+The bio structure uses a vector representation pointing to an array of tuples
+of <page, offset, len> to describe the i/o buffer, and has various other
+fields describing i/o parameters and state that needs to be maintained for
+performing the i/o.
+Notice that this representation means that a bio has no virtual address
+mapping at all (unlike buffer heads).
+struct bio_vec {
+ struct page *bv_page;
+ unsigned short bv_len;
+ unsigned short bv_offset;
+ * main unit of I/O for the block layer and lower layers (ie drivers)
+ */
+struct bio {
+ sector_t bi_sector;
+ struct bio *bi_next; /* request queue link */
+ struct block_device *bi_bdev; /* target device */
+ unsigned long bi_flags; /* status, command, etc */
+ unsigned long bi_rw; /* low bits: r/w, high: priority */
+ unsigned int bi_vcnt; /* how may bio_vec's */
+ unsigned int bi_idx; /* current index into bio_vec array */
+ unsigned int bi_size; /* total size in bytes */
+ unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
+ unsigned short bi_hw_segments; /* segments after DMA remapping */
+ unsigned int bi_max; /* max bio_vecs we can hold
+ used as index into pool */
+ struct bio_vec *bi_io_vec; /* the actual vec list */
+ bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
+ atomic_t bi_cnt; /* pin count: free when it hits zero */
+ void *bi_private;
+ bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
+With this multipage bio design:
+- Large i/os can be sent down in one go using a bio_vec list consisting
+ of an array of <page, offset, len> fragments (similar to the way fragments
+ are represented in the zero-copy network code)
+- Splitting of an i/o request across multiple devices (as in the case of
+ lvm or raid) is achieved by cloning the bio (where the clone points to
+ the same bi_io_vec array, but with the index and size accordingly modified)
+- A linked list of bios is used as before for unrelated merges (*) - this
+ avoids reallocs and makes independent completions easier to handle.
+- Code that traverses the req list needs to make a distinction between
+ segments of a request (bio_for_each_segment) and the distinct completion
+ units/bios (rq_for_each_bio).
+- Drivers which can't process a large bio in one shot can use the bi_idx
+ field to keep track of the next bio_vec entry to process.
+ (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
+ [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
+ bi_offset an len fields]
+(*) unrelated merges -- a request ends up containing two or more bios that
+ didn't originate from the same place.
+bi_end_io() i/o callback gets called on i/o completion of the entire bio.
+At a lower level, drivers build a scatter gather list from the merged bios.
+The scatter gather list is in the form of an array of <page, offset, len>
+entries with their corresponding dma address mappings filled in at the
+appropriate time. As an optimization, contiguous physical pages can be
+covered by a single entry where <page> refers to the first page and <len>
+covers the range of pages (upto 16 contiguous pages could be covered this
+way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
+the sg list.
+Note: Right now the only user of bios with more than one page is ll_rw_kio,
+which in turn means that only raw I/O uses it (direct i/o may not work
+right now). The intent however is to enable clustering of pages etc to
+become possible. The pagebuf abstraction layer from SGI also uses multi-page
+bios, but that is currently not included in the stock development kernels.
+The same is true of Andrew Morton's work-in-progress multipage bio writeout
+and readahead patches.
+2.3 Changes in the Request Structure
+The request structure is the structure that gets passed down to low level
+drivers. The block layer make_request function builds up a request structure,
+places it on the queue and invokes the drivers request_fn. The driver makes
+use of block layer helper routine elv_next_request to pull the next request
+off the queue. Control or diagnostic functions might bypass block and directly
+invoke underlying driver entry points passing in a specially constructed
+request structure.
+Only some relevant fields (mainly those which changed or may be referred
+to in some of the discussion here) are listed below, not necessarily in
+the order in which they occur in the structure (see include/linux/blkdev.h)
+Refer to Documentation/block/request.txt for details about all the request
+structure fields and a quick reference about the layers which are
+supposed to use or modify those fields.
+struct request {
+ struct list_head queuelist; /* Not meant to be directly accessed by
+ the driver.
+ Used by q->elv_next_request_fn
+ rq->queue is gone
+ */
+ .
+ .
+ unsigned char cmd[16]; /* prebuilt command data block */
+ unsigned long flags; /* also includes earlier rq->cmd settings */
+ .
+ .
+ sector_t sector; /* this field is now of type sector_t instead of int
+ preparation for 64 bit sectors */
+ .
+ .
+ /* Number of scatter-gather DMA addr+len pairs after
+ * physical address coalescing is performed.
+ */
+ unsigned short nr_phys_segments;
+ /* Number of scatter-gather addr+len pairs after
+ * physical and DMA remapping hardware coalescing is performed.
+ * This is the number of scatter-gather entries the driver
+ * will actually have to deal with after DMA mapping is done.
+ */
+ unsigned short nr_hw_segments;
+ /* Various sector counts */
+ unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
+ unsigned long hard_nr_sectors; /* block internal copy of above */
+ unsigned int current_nr_sectors; /* no. of sectors left in the
+ current segment:driver modifiable */
+ unsigned long hard_cur_sectors; /* block internal copy of the above */
+ .
+ .
+ int tag; /* command tag associated with request */
+ void *special; /* same as before */
+ char *buffer; /* valid only for low memory buffers upto
+ current_nr_sectors */
+ .
+ .
+ struct bio *bio, *biotail; /* bio list instead of bh */
+ struct request_list *rl;
+See the rq_flag_bits definitions for an explanation of the various flags
+available. Some bits are used by the block layer or i/o scheduler.
+The behaviour of the various sector counts are almost the same as before,
+except that since we have multi-segment bios, current_nr_sectors refers
+to the numbers of sectors in the current segment being processed which could
+be one of the many segments in the current bio (i.e i/o completion unit).
+The nr_sectors value refers to the total number of sectors in the whole
+request that remain to be transferred (no change). The purpose of the
+hard_xxx values is for block to remember these counts every time it hands
+over the request to the driver. These values are updated by block on
+end_that_request_first, i.e. every time the driver completes a part of the
+transfer and invokes block end*request helpers to mark this. The
+driver should not modify these values. The block layer sets up the
+nr_sectors and current_nr_sectors fields (based on the corresponding
+hard_xxx values and the number of bytes transferred) and updates it on
+every transfer that invokes end_that_request_first. It does the same for the
+buffer, bio, bio->bi_idx fields too.
+The buffer field is just a virtual address mapping of the current segment
+of the i/o buffer in cases where the buffer resides in low-memory. For high
+memory i/o, this field is not valid and must not be used by drivers.
+Code that sets up its own request structures and passes them down to
+a driver needs to be careful about interoperation with the block layer helper
+functions which the driver uses. (Section 1.3)
+3. Using bios
+3.1 Setup/Teardown
+There are routines for managing the allocation, and reference counting, and
+freeing of bios (bio_alloc, bio_get, bio_put).
+This makes use of Ingo Molnar's mempool implementation, which enables
+subsystems like bio to maintain their own reserve memory pools for guaranteed
+deadlock-free allocations during extreme VM load. For example, the VM
+subsystem makes use of the block layer to writeout dirty pages in order to be
+able to free up memory space, a case which needs careful handling. The
+allocation logic draws from the preallocated emergency reserve in situations
+where it cannot allocate through normal means. If the pool is empty and it
+can wait, then it would trigger action that would help free up memory or
+replenish the pool (without deadlocking) and wait for availability in the pool.
+If it is in IRQ context, and hence not in a position to do this, allocation
+could fail if the pool is empty. In general mempool always first tries to
+perform allocation without having to wait, even if it means digging into the
+pool as long it is not less that 50% full.
+On a free, memory is released to the pool or directly freed depending on
+the current availability in the pool. The mempool interface lets the
+subsystem specify the routines to be used for normal alloc and free. In the
+case of bio, these routines make use of the standard slab allocator.
+The caller of bio_alloc is expected to taken certain steps to avoid
+deadlocks, e.g. avoid trying to allocate more memory from the pool while
+already holding memory obtained from the pool.
+[TBD: This is a potential issue, though a rare possibility
+ in the bounce bio allocation that happens in the current code, since
+ it ends up allocating a second bio from the same pool while
+ holding the original bio ]
+Memory allocated from the pool should be released back within a limited
+amount of time (in the case of bio, that would be after the i/o is completed).
+This ensures that if part of the pool has been used up, some work (in this
+case i/o) must already be in progress and memory would be available when it
+is over. If allocating from multiple pools in the same code path, the order
+or hierarchy of allocation needs to be consistent, just the way one deals
+with multiple locks.
+The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
+for a non-clone bio. There are the 6 pools setup for different size biovecs,
+so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
+given size from these slabs.
+The bi_destructor() routine takes into account the possibility of the bio
+having originated from a different source (see later discussions on
+n/w to block transfers and kvec_cb)
+The bio_get() routine may be used to hold an extra reference on a bio prior
+to i/o submission, if the bio fields are likely to be accessed after the
+i/o is issued (since the bio may otherwise get freed in case i/o completion
+happens in the meantime).
+The bio_clone() routine may be used to duplicate a bio, where the clone
+shares the bio_vec_list with the original bio (i.e. both point to the
+same bio_vec_list). This would typically be used for splitting i/o requests
+in lvm or md.
+3.2 Generic bio helper Routines
+3.2.1 Traversing segments and completion units in a request
+The macros bio_for_each_segment() and rq_for_each_bio() should be used for
+traversing the bios in the request list (drivers should avoid directly
+trying to do it themselves). Using these helpers should also make it easier
+to cope with block changes in the future.
+ rq_for_each_bio(bio, rq)
+ bio_for_each_segment(bio_vec, bio, i)
+ /* bio_vec is now current segment */
+I/O completion callbacks are per-bio rather than per-segment, so drivers
+that traverse bio chains on completion need to keep that in mind. Drivers
+which don't make a distinction between segments and completion units would
+need to be reorganized to support multi-segment bios.
+3.2.2 Setting up DMA scatterlists
+The blk_rq_map_sg() helper routine would be used for setting up scatter
+gather lists from a request, so a driver need not do it on its own.
+ nr_segments = blk_rq_map_sg(q, rq, scatterlist);
+The helper routine provides a level of abstraction which makes it easier
+to modify the internals of request to scatterlist conversion down the line
+without breaking drivers. The blk_rq_map_sg routine takes care of several
+things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
+is set) and correct segment accounting to avoid exceeding the limits which
+the i/o hardware can handle, based on various queue properties.
+- Prevents a clustered segment from crossing a 4GB mem boundary
+- Avoids building segments that would exceed the number of physical
+ memory segments that the driver can handle (phys_segments) and the
+ number that the underlying hardware can handle at once, accounting for
+ DMA remapping (hw_segments) (i.e. IOMMU aware limits).
+Routines which the low level driver can use to set up the segment limits:
+blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
+hw data segments in a request (i.e. the maximum number of address/length
+pairs the host adapter can actually hand to the device at once)
+blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
+of physical data segments in a request (i.e. the largest sized scatter list
+a driver could handle)
+3.2.3 I/O completion
+The existing generic block layer helper routines end_request,
+end_that_request_first and end_that_request_last can be used for i/o
+completion (and setting things up so the rest of the i/o or the next
+request can be kicked of) as before. With the introduction of multi-page
+bio support, end_that_request_first requires an additional argument indicating
+the number of sectors completed.
+3.2.4 Implications for drivers that do not interpret bios (don't handle
+ multiple segments)
+Drivers that do not interpret bios e.g those which do not handle multiple
+segments and do not support i/o into high memory addresses (require bounce
+buffers) and expect only virtually mapped buffers, can access the rq->buffer
+field. As before the driver should use current_nr_sectors to determine the
+size of remaining data in the current segment (that is the maximum it can
+transfer in one go unless it interprets segments), and rely on the block layer
+end_request, or end_that_request_first/last to take care of all accounting
+and transparent mapping of the next bio segment when a segment boundary
+is crossed on completion of a transfer. (The end*request* functions should
+be used if only if the request has come down from block/bio path, not for
+direct access requests which only specify rq->buffer without a valid rq->bio)
+3.2.5 Generic request command tagging
+ Tag helpers
+Block now offers some simple generic functionality to help support command
+queueing (typically known as tagged command queueing), ie manage more than
+one outstanding command on a queue at any given time.
+ blk_queue_init_tags(request_queue_t *q, int depth)
+ Initialize internal command tagging structures for a maximum
+ depth of 'depth'.
+ blk_queue_free_tags((request_queue_t *q)
+ Teardown tag info associated with the queue. This will be done
+ automatically by block if blk_queue_cleanup() is called on a queue
+ that is using tagging.
+The above are initialization and exit management, the main helpers during
+normal operations are:
+ blk_queue_start_tag(request_queue_t *q, struct request *rq)
+ Start tagged operation for this request. A free tag number between
+ 0 and 'depth' is assigned to the request (rq->tag holds this number),
+ and 'rq' is added to the internal tag management. If the maximum depth
+ for this queue is already achieved (or if the tag wasn't started for
+ some other reason), 1 is returned. Otherwise 0 is returned.
+ blk_queue_end_tag(request_queue_t *q, struct request *rq)
+ End tagged operation on this request. 'rq' is removed from the internal
+ book keeping structures.
+To minimize struct request and queue overhead, the tag helpers utilize some
+of the same request members that are used for normal request queue management.
+This means that a request cannot both be an active tag and be on the queue
+list at the same time. blk_queue_start_tag() will remove the request, but
+the driver must remember to call blk_queue_end_tag() before signalling
+completion of the request to the block layer. This means ending tag
+operations before calling end_that_request_last()! For an example of a user
+of these helpers, see the IDE tagged command queueing support.
+Certain hardware conditions may dictate a need to invalidate the block tag
+queue. For instance, on IDE any tagged request error needs to clear both
+the hardware and software block queue and enable the driver to sanely restart
+all the outstanding requests. There's a third helper to do that:
+ blk_queue_invalidate_tags(request_queue_t *q)
+ Clear the internal block tag queue and readd all the pending requests
+ to the request queue. The driver will receive them again on the
+ next request_fn run, just like it did the first time it encountered
+ them.
+ Tag info
+Some block functions exist to query current tag status or to go from a
+tag number to the associated request. These are, in no particular order:
+ blk_queue_tagged(q)
+ Returns 1 if the queue 'q' is using tagging, 0 if not.
+ blk_queue_tag_request(q, tag)
+ Returns a pointer to the request associated with tag 'tag'.
+ blk_queue_tag_depth(q)
+ Return current queue depth.
+ blk_queue_tag_queue(q)
+ Returns 1 if the queue can accept a new queued command, 0 if we are
+ at the maximum depth already.
+ blk_queue_rq_tagged(rq)
+ Returns 1 if the request 'rq' is tagged.
+ Internal structure
+Internally, block manages tags in the blk_queue_tag structure:
+ struct blk_queue_tag {
+ struct request **tag_index; /* array or pointers to rq */
+ unsigned long *tag_map; /* bitmap of free tags */
+ struct list_head busy_list; /* fifo list of busy tags */
+ int busy; /* queue depth */
+ int max_depth; /* max queue depth */
+ };
+Most of the above is simple and straight forward, however busy_list may need
+a bit of explaining. Normally we don't care too much about request ordering,
+but in the event of any barrier requests in the tag queue we need to ensure
+that requests are restarted in the order they were queue. This may happen
+if the driver needs to use blk_queue_invalidate_tags().
+Tagging also defines a new request flag, REQ_QUEUED. This is set whenever
+a request is currently tagged. You should not use this flag directly,
+blk_rq_tagged(rq) is the portable way to do so.
+3.3 I/O Submission
+The routine submit_bio() is used to submit a single io. Higher level i/o
+routines make use of this:
+(a) Buffered i/o:
+The routine submit_bh() invokes submit_bio() on a bio corresponding to the
+bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
+(b) Kiobuf i/o (for raw/direct i/o):
+The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
+maps the array to one or more multi-page bios, issuing submit_bio() to
+perform the i/o on each of these.
+The embedded bh array in the kiobuf structure has been removed and no
+preallocation of bios is done for kiobufs. [The intent is to remove the
+blocks array as well, but it's currently in there to kludge around direct i/o.]
+Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
+ A single kiobuf structure is assumed to correspond to a contiguous range
+ of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
+ So right now it wouldn't work for direct i/o on non-contiguous blocks.
+ This is to be resolved. The eventual direction is to replace kiobuf
+ by kvec's.
+ Badari Pulavarty has a patch to implement direct i/o correctly using
+ bio and kvec.
+(c) Page i/o:
+Todo/Under discussion:
+ Andrew Morton's multi-page bio patches attempt to issue multi-page
+ writeouts (and reads) from the page cache, by directly building up
+ large bios for submission completely bypassing the usage of buffer
+ heads. This work is still in progress.
+ Christoph Hellwig had some code that uses bios for page-io (rather than
+ bh). This isn't included in bio as yet. Christoph was also working on a
+ design for representing virtual/real extents as an entity and modifying
+ some of the address space ops interfaces to utilize this abstraction rather
+ than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
+ abstraction, but intended to be as lightweight as possible).
+(d) Direct access i/o:
+Direct access requests that do not contain bios would be submitted differently
+as discussed earlier in section 1.3.
+ Kvec i/o:
+ Ben LaHaise's aio code uses a slighly different structure instead
+ of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
+ tuples (very much like the networking code), together with a callback function
+ and data pointer. This is embedded into a brw_cb structure when passed
+ to brw_kvec_async().
+ Now it should be possible to directly map these kvecs to a bio. Just as while
+ cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
+ array pointer to point to the veclet array in kvecs.
+ TBD: In order for this to work, some changes are needed in the way multi-page
+ bios are handled today. The values of the tuples in such a vector passed in
+ from higher level code should not be modified by the block layer in the course
+ of its request processing, since that would make it hard for the higher layer
+ to continue to use the vector descriptor (kvec) after i/o completes. Instead,
+ all such transient state should either be maintained in the request structure,
+ and passed on in some way to the endio completion routine.
+4. The I/O scheduler
+I/O schedulers are now per queue. They should be runtime switchable and modular
+but aren't yet. Jens has most bits to do this, but the sysfs implementation is
+A block layer call to the i/o scheduler follows the convention elv_xxx(). This
+calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,
+xxx and xxx might not match exactly, but use your imagination. If an elevator
+doesn't implement a function, the switch does nothing or some minimal house
+keeping work.
+4.1. I/O scheduler API
+The functions an elevator may implement are: (* are mandatory)
+elevator_merge_fn called to query requests for merge with a bio
+elevator_merge_req_fn " " " with another request
+elevator_merged_fn called when a request in the scheduler has been
+ involved in a merge. It is used in the deadline
+ scheduler for example, to reposition the request
+ if its sorting order has changed.
+*elevator_next_req_fn returns the next scheduled request, or NULL
+ if there are none (or none are ready).
+*elevator_add_req_fn called to add a new request into the scheduler
+elevator_queue_empty_fn returns true if the merge queue is empty.
+ Drivers shouldn't use this, but rather check
+ if elv_next_request is NULL (without losing the
+ request if one exists!)
+elevator_remove_req_fn This is called when a driver claims ownership of
+ the target request - it now belongs to the
+ driver. It must not be modified or merged.
+ Drivers must not lose the request! A subsequent
+ call of elevator_next_req_fn must return the
+ _next_ request.
+elevator_requeue_req_fn called to add a request to the scheduler. This
+ is used when the request has alrnadebeen
+ returned by elv_next_request, but hasn't
+ completed. If this is not implemented then
+ elevator_add_req_fn is called instead.
+elevator_latter_req_fn These return the request before or after the
+ one specified in disk sort order. Used by the
+ block layer to find merge possibilities.
+elevator_completed_req_fn called when a request is completed. This might
+ come about due to being merged with another or
+ when the device completes the request.
+elevator_may_queue_fn returns true if the scheduler wants to allow the
+ current context to queue a new request even if
+ it is over the queue limit. This must be used
+ very carefully!!
+elevator_put_req_fn Must be used to allocate and free any elevator
+ specific storate for a request.
+elevator_exit_fn Allocate and free any elevator specific storage
+ for a queue.
+4.2 I/O scheduler implementation
+The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
+optimal disk scan and request servicing performance (based on generic
+principles and device capabilities), optimized for:
+i. improved throughput
+ii. improved latency
+iii. better utilization of h/w & CPU time
+i. Binary tree
+AS and deadline i/o schedulers use red black binary trees for disk position
+sorting and searching, and a fifo linked list for time-based searching. This
+gives good scalability and good availablility of information. Requests are
+almost always dispatched in disk sort order, so a cache is kept of the next
+request in sort order to prevent binary tree lookups.
+This arrangement is not a generic block layer characteristic however, so
+elevators may implement queues as they please.
+ii. Last merge hint
+The last merge hint is part of the generic queue layer. I/O schedulers must do
+some management on it. For the most part, the most important thing is to make
+sure q->last_merge is cleared (set to NULL) when the request on it is no longer
+a candidate for merging (for example if it has been sent to the driver).
+The last merge performed is cached as a hint for the subsequent request. If
+sequential data is being submitted, the hint is used to perform merges without
+any scanning. This is not sufficient when there are multiple processes doing
+I/O though, so a "merge hash" is used by some schedulers.
+iii. Merge hash
+AS and deadline use a hash table indexed by the last sector of a request. This
+enables merging code to quickly look up "back merge" candidates, even when
+multiple I/O streams are being performed at once on one disk.
+"Front merges", a new request being merged at the front of an existing request,
+are far less common than "back merges" due to the nature of most I/O patterns.
+Front merges are handled by the binary trees in AS and deadline schedulers.
+iv. Handling barrier cases
+A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered
+around. That is, they must be processed after all older requests, and before
+any newer ones. This includes merges!
+In AS and deadline schedulers, barriers have the effect of flushing the reorder
+queue. The performance cost of this will vary from nothing to a lot depending
+on i/o patterns and device characteristics. Obviously they won't improve
+performance, so their use should be kept to a minimum.
+v. Handling insertion position directives
+A request may be inserted with a position directive. The directives are one of
+ELEVATOR_INSERT_SORT is a general directive for non-barrier requests.
+ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue.
+ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and
+overrides the ordering requested by any previous barriers. In practice this is
+harmless and required, because it is used for SCSI requeueing. This does not
+require flushing the reorder queue, so does not impose a performance penalty.
+vi. Plugging the queue to batch requests in anticipation of opportunities for
+ merge/sort optimizations
+This is just the same as in 2.4 so far, though per-device unplugging
+support is anticipated for 2.5. Also with a priority-based i/o scheduler,
+such decisions could be based on request priorities.
+Plugging is an approach that the current i/o scheduling algorithm resorts to so
+that it collects up enough requests in the queue to be able to take
+advantage of the sorting/merging logic in the elevator. If the
+queue is empty when a request comes in, then it plugs the request queue
+(sort of like plugging the bottom of a vessel to get fluid to build up)
+till it fills up with a few more requests, before starting to service
+the requests. This provides an opportunity to merge/sort the requests before
+passing them down to the device. There are various conditions when the queue is
+unplugged (to open up the flow again), either through a scheduled task or
+could be on demand. For example wait_on_buffer sets the unplugging going
+(by running tq_disk) so the read gets satisfied soon. So in the read case,
+the queue gets explicitly unplugged as part of waiting for completion,
+in fact all queues get unplugged as a side-effect.
+ This is kind of controversial territory, as it's not clear if plugging is
+ always the right thing to do. Devices typically have their own queues,
+ and allowing a big queue to build up in software, while letting the device be
+ idle for a while may not always make sense. The trick is to handle the fine
+ balance between when to plug and when to open up. Also now that we have
+ multi-page bios being queued in one shot, we may not need to wait to merge
+ a big request from the broken up pieces coming by.
+ Per-queue granularity unplugging (still a Todo) may help reduce some of the
+ concerns with just a single tq_disk flush approach. Something like
+ blk_kick_queue() to unplug a specific queue (right away ?)
+ or optionally, all queues, is in the plan.
+4.3 I/O contexts
+I/O contexts provide a dynamically allocated per process data area. They may
+be used in I/O schedulers, and in the block layer (could be used for IO statis,
+priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and
+as-iosched.c for an example of usage in an i/o scheduler.
+5. Scalability related changes
+5.1 Granular Locking: io_request_lock replaced by a per-queue lock
+The global io_request_lock has been removed as of 2.5, to avoid
+the scalability bottleneck it was causing, and has been replaced by more
+granular locking. The request queue structure has a pointer to the
+lock to be used for that queue. As a result, locking can now be
+per-queue, with a provision for sharing a lock across queues if
+necessary (e.g the scsi layer sets the queue lock pointers to the
+corresponding adapter lock, which results in a per host locking
+granularity). The locking semantics are the same, i.e. locking is
+still imposed by the block layer, grabbing the lock before
+request_fn execution which it means that lots of older drivers
+should still be SMP safe. Drivers are free to drop the queue
+lock themselves, if required. Drivers that explicitly used the
+io_request_lock for serialization need to be modified accordingly.
+Usually it's as easy as adding a global lock:
+ static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED;
+and passing the address to that lock to blk_init_queue().
+5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
+The sector number used in the bio structure has been changed to sector_t,
+which could be defined as 64 bit in preparation for 64 bit sector support.
+6. Other Changes/Implications
+6.1 Partition re-mapping handled by the generic block layer
+In 2.5 some of the gendisk/partition related code has been reorganized.
+Now the generic block layer performs partition-remapping early and thus
+provides drivers with a sector number relative to whole device, rather than
+having to take partition number into account in order to arrive at the true
+sector number. The routine blk_partition_remap() is invoked by
+generic_make_request even before invoking the queue specific make_request_fn,
+so the i/o scheduler also gets to operate on whole disk sector numbers. This
+should typically not require changes to block drivers, it just never gets
+to invoke its own partition sector offset calculations since all bios
+sent are offset from the beginning of the device.
+7. A Few Tips on Migration of older drivers
+Old-style drivers that just use CURRENT and ignores clustered requests,
+may not need much change. The generic layer will automatically handle
+clustered requests, multi-page bios, etc for the driver.
+For a low performance driver or hardware that is PIO driven or just doesn't
+support scatter-gather changes should be minimal too.
+The following are some points to keep in mind when converting old drivers
+to bio.
+Drivers should use elv_next_request to pick up requests and are no longer
+supposed to handle looping directly over the request list.
+(struct request->queue has been removed)
+Now end_that_request_first takes an additional number_of_sectors argument.
+It used to handle always just the first buffer_head in a request, now
+it will loop and handle as many sectors (on a bio-segment granularity)
+as specified.
+Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
+right thing to use is bio_endio(bio, uptodate) instead.
+If the driver is dropping the io_request_lock from its request_fn strategy,
+then it just needs to replace that with q->queue_lock instead.
+As described in Sec 1.1, drivers can set max sector size, max segment size
+etc per queue now. Drivers that used to define their own merge functions i
+to handle things like this can now just use the blk_queue_* functions at
+blk_init_queue time.
+Drivers no longer have to map a {partition, sector offset} into the
+correct absolute location anymore, this is done by the block layer, so
+where a driver received a request ala this before:
+ rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */
+ rq->sector = 0; /* first sector on hda5 */
+ it will now see
+ rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */
+ rq->sector = 123128; /* offset from start of disk */
+As mentioned, there is no virtual mapping of a bio. For DMA, this is
+not a problem as the driver probably never will need a virtual mapping.
+Instead it needs a bus mapping (pci_map_page for a single segment or
+use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For
+PIO drivers (or drivers that need to revert to PIO transfer once in a
+while (IDE for example)), where the CPU is doing the actual data
+transfer a virtual mapping is needed. If the driver supports highmem I/O,
+(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to
+temporarily map a bio into the virtual address space.
+8. Prior/Related/Impacted patches
+8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
+- orig kiobuf & raw i/o patches (now in 2.4 tree)
+- direct kiobuf based i/o to devices (no intermediate bh's)
+- page i/o using kiobuf
+- kiobuf splitting for lvm (mkp)
+- elevator support for kiobuf request merging (axboe)
+8.2. Zero-copy networking (Dave Miller)
+8.3. SGI XFS - pagebuf patches - use of kiobufs
+8.4. Multi-page pioent patch for bio (Christoph Hellwig)
+8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
+8.6. Async i/o implementation patch (Ben LaHaise)
+8.7. EVMS layering design (IBM EVMS team)
+8.8. Larger page cache size patch (Ben LaHaise) and
+ Large page size (Daniel Phillips)
+ => larger contiguous physical memory buffers
+8.9. VM reservations patch (Ben LaHaise)
+8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
+8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
+8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar,
+ Badari)
+8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)
+8.14 IDE Taskfile i/o patch (Andre Hedrick)
+8.15 Multi-page writeout and readahead patches (Andrew Morton)
+8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
+9. Other References:
+9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml,
+and Linus' comments - Jan 2001)
+9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan
+et al - Feb-March 2001 (many of the initial thoughts that led to bio were
+brought up in this discusion thread)
+9.3 Discussions on mempool on lkml - Dec 2001.
diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt
new file mode 100644
index 00000000000..c918b3a6022
--- /dev/null
+++ b/Documentation/block/deadline-iosched.txt
@@ -0,0 +1,78 @@
+Deadline IO scheduler tunables
+This little file attempts to document how the deadline io scheduler works.
+In particular, it will clarify the meaning of the exposed tunables that may be
+of interest to power users.
+Each io queue has a set of io scheduler tunables associated with it. These
+tunables control how the io scheduler works. You can find these entries
+assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
+you can do so by typing:
+# mount none /sys -t sysfs
+read_expire (in ms)
+The goal of the deadline io scheduler is to attempt to guarentee a start
+service time for a request. As we focus mainly on read latencies, this is
+tunable. When a read request first enters the io scheduler, it is assigned
+a deadline that is the current time + the read_expire value in units of
+write_expire (in ms)
+Similar to read_expire mentioned above, but for writes.
+When a read request expires its deadline, we must move some requests from
+the sorted io scheduler list to the block device dispatch queue. fifo_batch
+controls how many requests we move, based on the cost of each request. A
+request is either qualified as a seek or a stream. The io scheduler knows
+the last request that was serviced by the drive (or will be serviced right
+before this one). See seek_cost and stream_unit.
+write_starved (number of dispatches)
+When we have to move requests from the io scheduler queue to the block
+device dispatch queue, we always give a preference to reads. However, we
+don't want to starve writes indefinitely either. So writes_starved controls
+how many times we give preference to reads over writes. When that has been
+done writes_starved number of times, we dispatch some writes based on the
+same criteria as reads.
+front_merges (bool)
+Sometimes it happens that a request enters the io scheduler that is contigious
+with a request that is already on the queue. Either it fits in the back of that
+request, or it fits at the front. That is called either a back merge candidate
+or a front merge candidate. Due to the way files are typically laid out,
+back merges are much more common than front merges. For some work loads, you
+may even know that it is a waste of time to spend any time attempting to
+front merge requests. Setting front_merges to 0 disables this functionality.
+Front merges may still occur due to the cached last_merge hint, but since
+that comes at basically 0 cost we leave that on. We simply disable the
+rbtree front sector lookup when the io scheduler merge function is called.
+Nov 11 2002, Jens Axboe <>
diff --git a/Documentation/block/request.txt b/Documentation/block/request.txt
new file mode 100644
index 00000000000..75924e2a697
--- /dev/null
+++ b/Documentation/block/request.txt
@@ -0,0 +1,88 @@
+struct request documentation
+Jens Axboe <> 27/05/02
+2.0 Struct request members classification
+ 2.1 struct request members explanation
+Short explanation of request members
+Classification flags:
+ D driver member
+ B block layer member
+ I I/O scheduler member
+Unless an entry contains a D classification, a device driver must not access
+this member. Some members may contain D classifications, but should only be
+access through certain macros or functions (eg ->flags).
+Member Flag Comment
+------ ---- -------
+struct list_head queuelist BI Organization on various internal
+ queues
+void *elevator_private I I/O scheduler private data
+unsigned char cmd[16] D Driver can use this for setting up
+ a cdb before execution, see
+ blk_queue_prep_rq
+unsigned long flags DBI Contains info about data direction,
+ request type, etc.
+int rq_status D Request status bits
+kdev_t rq_dev DBI Target device
+int errors DB Error counts
+sector_t sector DBI Target location
+unsigned long hard_nr_sectors B Used to keep sector sane
+unsigned long nr_sectors DBI Total number of sectors in request
+unsigned long hard_nr_sectors B Used to keep nr_sectors sane
+unsigned short nr_phys_segments DB Number of physical scatter gather
+ segments in a request
+unsigned short nr_hw_segments DB Number of hardware scatter gather
+ segments in a request
+unsigned int current_nr_sectors DB Number of sectors in first segment
+ of request
+unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
+int tag DB TCQ tag, if assigned
+void *special D Free to be used by driver
+char *buffer D Map of first segment, also see
+ section on bouncing SECTION
+struct completion *waiting D Can be used by driver to get signalled
+ on request completion
+struct bio *bio DBI First bio in request
+struct bio *biotail DBI Last bio in request
+request_queue_t *q DB Request queue this request belongs to
+struct request_list *rl B Request list this request came from