aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2007-04-30Merge branch 'cfq' into for-linusJens Axboe
2007-04-30[PATCH] elevator: elv_list_lock does not need irq disablingJens Axboe
It's never grabbed from irq context, so just make it plain spin_lock(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: speedup cic rb lookupJens Axboe
We often lookup the same queue many times in succession, so cache the last looked up queue to avoid browsing the rbtree. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30ll_rw_blk: add io_context private pointerJens Axboe
To be used by as/cfq as they see fit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: get rid of cfqq hashVasily Tarasov
cfq hash is no more necessary. We always can get cfqq from io context. cfq_get_io_context_noalloc() function is introduced, because we don't want to allocate cic on merging and checking may_queue. In order to identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is eliminated we need to use other criterion: sync flag for queue is added. In all places where we dig in rb_tree we're in current context, so no additional locking is required. Advantages of this patch: no additional memory for hash, no seeking in hash, code is cleaner. But it is necessary now to seek cic in per-ioc rbtree, but it is faster: - most processes work only with few devices - most systems have only few block devices - it is a rb-tree Signed-off-by: Vasily Tarasov <vtaras@openvz.org> Changes by me: - Merge into CFQ devel branch - Get rid of cfq_get_io_context_noalloc() - Fix various bugs with dereferencing cic->cfqq[] with offset other than 0 or 1. - Fix bug in cfqq setup, is_sync condition was reversed. - Fix bug where only bio_sync() is used, we need to check for a READ too Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: tighten queue request overlap conditionJens Axboe
For tagged devices, allow overlap of requests if the idle window isn't enabled on the current active queue. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: improve sync vs async workloadsJens Axboe
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: never allow an async queue idlingJens Axboe
We don't enable it by default, don't let it get enabled during runtime. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: get rid of ->dispatch_sliceJens Axboe
We can track it fairly accurately locally, let the slice handling take care of the rest. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: don't pass unused preemption variable aroundJens Axboe
We don't use it anymore in the slice expiry handling. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: get rid of ->cur_rr and ->cfq_listJens Axboe
It's only used for preemption now that the IDLE and RT queues also use the rbtree. If we pass an 'add_front' variable to cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion at the front of the tree. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: slice offset should take ioprio into accountJens Axboe
Use the max_slice-cur_slice as the multipler for the insertion offset. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30[PATCH] cfq-iosched: style cleanups and commentsJens Axboe
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: sort IDLE queues into the rbtreeJens Axboe
Same treatment as the RT conversion, just put the sorted idle branch at the end of the tree. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: sort RT queues into the rbtreeJens Axboe
Currently CFQ does a linked insert into the current list for RT queues. We can just factor the class into the rb insertion, and then we don't have to treat RT queues in a special way. It's faster, too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30[PATCH] cfq-iosched: speed up rbtree handlingJens Axboe
For cases where the rbtree is mainly used for sorting and min retrieval, a nice speedup of the rbtree code is to maintain a cache of the leftmost node in the tree. Also spotted in the CFS CPU scheduler code. Improved by Alan D. Brunelle <Alan.Brunelle@hp.com> by updating the leftmost hint in cfq_rb_first() if it isn't set, instead of only updating it on insert. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: rework the whole round-robin list conceptJens Axboe
Drawing on some inspiration from the CFS CPU scheduler design, overhaul the pending cfq_queue concept list management. Currently CFQ uses a doubly linked list per priority level for sorting and service uses. Kill those lists and maintain an rbtree of cfq_queue's, sorted by when to service them. This unfortunately means that the ionice levels aren't as strong anymore, will work on improving those later. We only scale the slice time now, not the number of times we service. This means that latency is better (for all priority levels), but that the distinction between the highest and lower levels aren't as big. The diffstat speaks for itself. cfq-iosched.c | 363 +++++++++++++++++--------------------------------- 1 file changed, 125 insertions(+), 238 deletions(-) Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: minor updatesJens Axboe
- Move the queue_new flag clear to when the queue is selected - Only select the non-first queue in cfq_get_best_queue(), if there's a substantial difference between the best and first. - Get rid of ->busy_rr - Only select a close cooperator, if the current queue is known to take a while to "think". Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: development updateJens Axboe
- Implement logic for detecting cooperating processes, so we choose the best available queue whenever possible. - Improve residual slice time accounting. - Remove dead code: we no longer see async requests coming in on sync queues. That part was removed a long time ago. That means that we can also remove the difference between cfq_cfqq_sync() and cfq_cfqq_class_sync(), they are now indentical. And we can kill the on_dispatch array, just make it a counter. - Allow a process to go into the current list, if it hasn't been serviced in this scheduler tick yet. Possible future improvements including caching the cfqq lookup in cfq_close_cooperator(), so we don't have to look it up twice. cfq_get_best_queue() should just use that last decision instead of doing it again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30cfq-iosched: improve preemption for cooperating tasksJens Axboe
When testing the syslet async io approach, I discovered that CFQ sometimes didn't perform as well as expected. cfq_should_preempt() needs to better check for cooperating tasks, so fix that by allowing preemption of an equal priority queue if the recently queued request is as good a candidate for IO as the one we are currently waiting for. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-25cfq-iosched: fix alias + front merge bugJens Axboe
There's a really rare and obscure bug in CFQ, that causes a crash in cfq_dispatch_insert() due to rq == NULL. One example of the resulting oops is seen here: http://lkml.org/lkml/2007/4/15/41 Neil correctly diagnosed the situation for how this can happen: if two concurrent requests with the exact same sector number (due to direct IO or aliasing between MD and the raw device access), the alias handling will add the request to the sortlist, but next_rq remains NULL. Read the more complete analysis at: http://lkml.org/lkml/2007/4/25/57 This looks like it requires md to trigger, even though it should potentially be possible to due with O_DIRECT (at least if you edit the kernel and doctor some of the unplug calls). The fix is to move the ->next_rq update to when we add a request to the rbtree. Then we remove the possibility for a request to exist in the rbtree code, but not have ->next_rq correctly updated. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-20cfq-iosched: fix sequential write regressionJens Axboe
We have a 10-15% performance regression for sequential writes on TCQ/NCQ enabled drives in 2.6.21-rcX after the CFQ update went in. It has been reported by Valerie Clement <valerie.clement@bull.net> and the Intel testing folks. The regression is because of CFQ's now more aggressive queue control, limiting the depth available to the device. This patches fixes that regression by allowing a greater depth when only one queue is busy. It has been tested to not impact sync-vs-async workloads too much - we still do a lot better than 2.6.20. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04[PATCH] remove protection of LANANA-reserved majorsAndrew Morton
Revert all this. It can cause device-mapper to receive a different major from earlier kernels and it turns out that the Amanda backup program (via GNU tar, apparently) checks major numbers on files when performing incremental backups. Which is a bit broken of Amanda (or tar), but this feature isn't important enough to justify the churn. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27make elv_register() output atomicThibaut VARENE
Booting 2.6.21-rc3-g45592145 I noticed the following on one of my machines in the bootlog: io scheduler noop registered<6>Time: jiffies clocksource has been installed. io scheduler deadline registered (default) Looking at block/elevator.c, it appears that elv_register() uses two consecutive printks in a non-atomic way, leading to the above glitch. The attached trivial patch fixes this issue, by using a single printk. Signed-off-by: Thibaut VARENE <varenet@parisc-linux.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27block: blk_max_pfn is somtimes wrongVasily Tarasov
There is a small problem in handling page bounce. At the moment blk_max_pfn equals max_pfn, which is in fact not maximum possible _number_ of a page frame, but the _amount_ of page frames. For example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not 0xFFFF. request_queue structure has a member q->bounce_pfn and queue needs bounce pages for the pages _above_ this limit. This routine is handled by blk_queue_bounce(), where the following check is produced: if (q->bounce_pfn >= blk_max_pfn) return; Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn equals 0x10000. In such situation the check above fails and for each bio we always fall down for iterating over pages tied to the bio. I want to notice, that for quite a big range of device drivers (ide, md, ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and then the check above doesn't fail. But for other drivers, which obtain reuired value from drivers, it fails. For example sata_nv uses ATA_DMA_MASK or dev->dma_mask. I propose to use (max_pfn - 1) for blk_max_pfn. And the same for blk_max_low_pfn. The patch also cleanses some checks related with bounce_pfn. Signed-off-by: Vasily Tarasov <vtaras@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-20[PATCH] lockdep: annotate BLKPG_DEL_PARTITIONPeter Zijlstra
>============================================= >[ INFO: possible recursive locking detected ] >2.6.19-1.2909.fc7 #1 >--------------------------------------------- >anaconda/587 is trying to acquire lock: > (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24 > >but task is already holding lock: > (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24 > >other info that might help us debug this: >1 lock held by anaconda/587: > #0: (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24 > >stack backtrace: > [<c0405812>] show_trace_log_lvl+0x1a/0x2f > [<c0405db2>] show_trace+0x12/0x14 > [<c0405e36>] dump_stack+0x16/0x18 > [<c043bd84>] __lock_acquire+0x116/0xa09 > [<c043c960>] lock_acquire+0x56/0x6f > [<c05fb1fa>] __mutex_lock_slowpath+0xe5/0x24a > [<c05fb380>] mutex_lock+0x21/0x24 > [<c04d82fb>] blkdev_ioctl+0x600/0x76d > [<c04946b1>] block_ioctl+0x1b/0x1f > [<c047ed5a>] do_ioctl+0x22/0x68 > [<c047eff2>] vfs_ioctl+0x252/0x265 > [<c047f04e>] sys_ioctl+0x49/0x63 > [<c0404070>] syscall_call+0x7/0xb Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment clarifying the bd_mutex locking, because I confused myself and initially thought the lock order was wrong too. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20[PATCH] rework reserved major handlingAndrew Morton
Several people have reported failures in dynamic major device number handling due to the recent changes in there to avoid handing out the local/experimental majors. Rolf reports that this is due to a gcc-4.1.0 bug. The patch refactors that code a lot in an attempt to provoke the compiler into behaving. Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-17update I/O sched Kconfig help texts - CFQ is now default, not AS.Jesper Juhl
Change I/O scheduler description to correctly show CFQ as being the default scheduler and not the anticipatory scheduler that previously was default. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-02-12[PATCH] mark struct file_operations const 3Arjan van de Ven
Many struct file_operations in the kernel can be "const". Marking them const moves these to the .rodata section, which avoids false sharing with potential dirty data. In addition it'll catch accidental writes at compile time to these shared resources. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12[PATCH] register_blkdev(): don't hand out the LOCAL/EXPERIMENTAL majorsAndrew Morton
As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=7922, dynamic blockdev major allocation can hand out majors which LANANA has defined as being for local/experimental use. Cc: Torben Mathiasen <device@lanana.org> Cc: Greg KH <greg@kroah.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tomas Klas <tomas.klas@mepatek.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11cfq-iosched: improve continue or break logic in cfq_dispatchJens Axboe
This improves performance considerably for sync requests when you have command queuing enabled. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: remove the implicit queue kicking in slice expireJens Axboe
We only really need it for a process going away, so move it to those locations. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: check whether a queue timed out in accountingJens Axboe
Makes it more fair for the residual slice count. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: tweak the FIFO checkingJens Axboe
We currently check the FIFO once per slice. Optimize that a bit and only do it as the first thing for a new slice, so we don't end up doing a single request and then seek to the FIFO requests. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: don't pass in queue for cfq_arm_slice_timer()Jens Axboe
It must always be the active queue, otherwise it's a bug. So just use the active_queue, don't pass it in explicitly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: account for slice over/under timeJens Axboe
If a slice uses less than it is entitled to (or perhaps more), include that in the decision on how much time to give it the next time it gets serviced. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: defer slice activation to first request being activeJens Axboe
This better matches what time the queue is actually spending doing IO. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11[PATCH] cfq-iosched: use last service point as the fairness criteriaJens Axboe
Right now we use slice_start, which gives async queues an unfair advantage. Chance that to service_last, and base the resorter on that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: document the cfqq flagsJens Axboe
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11[PATCH] cfq-iosched: move on_rr check into cfq_resort_rr_list()Jens Axboe
Move the on_rr check into cfq_resort_rr_list(), every call site needs to check it anyway. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11cfq-iosched: remove cfq_io_context last_queueJens Axboe
It hasn't been used for a while, kill it off and remove the old if 0 code chunk. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11elevator: don't sort reads between writesJens Axboe
Don't allow elv_dispatch_sort() to mix reads and writes together, it's rarely a good idea. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11elevator: abstract out the activate and deactivate functionsJens Axboe
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6Linus Torvalds
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6: [SPARC64]: Update defconfig. [SPARC64]: Add PCI MSI support on Niagara. [SPARC64] IRQ: Use irq_desc->chip_data instead of irq_desc->handler_data [SPARC64]: Add obppath sysfs attribute for SBUS and PCI devices. [PARTITION]: Add whole_disk attribute.
2007-02-11[PATCH] Relay: add CPU hotplug supportMathieu Desnoyers
Mathieu originally needed to add this for tracing Xen, but it's something that's needed for any application that can be tracing while cpus are added. unplug isn't supported by this patch. The thought was that at minumum a new buffer needs to be added when a cpu comes up, but it wasn't worth the effort to remove buffers on cpu down since they'd be freed soon anyway when the channel was closed. [zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-10[PARTITION]: Add whole_disk attribute.Fabio Massimo Di Nitto
Some partitioning systems create special partitions that span the entire disk. One example are Sun partitions, and this whole-disk partition exists to tell the firmware the extent of the entire device so it can load the boot block and do other things. Such partitions should not be treated as normal partitions, because all the other partitions overlap this whole-disk one. So we'd see multiple instances of the same UUID etc. which we do not want. udev and friends can thus search for this 'whole_disk' attribute and use it to decide to ignore the partition. Signed-off-by: Fabio Massimo Di Nitto <fabbione@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-09[PATCH] md: fix various bugs with aligned reads in RAID5Neil Brown
It is possible for raid5 to be sent a bio that is too big for an underlying device. So if it is a READ that we pass stright down to a device, it will fail and confuse RAID5. So in 'chunk_aligned_read' we check that the bio fits within the parameters for the target device and if it doesn't fit, fall back on reading through the stripe cache and making lots of one-page requests. Note that this is the earliest time we can check against the device because earlier we don't have a lock on the device, so it could change underneath us. Also, the code for handling a retry through the cache when a read fails has not been tested and was badly broken. This patch fixes that code. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Kai" <epimetreus@fastmail.fm> Cc: <stable@suse.de> Cc: <org@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-29[PATCH] Fix SG_IO timeout jiffy conversionMike Christie
Commit 85e04e371b5a321b5df2bc3f8e0099a64fb087d7 cleaned up the timeout conversion, but did it exactly the wrong way. We get msecs from user space, and should convert them into jiffies. Not the other way around. Here is a fix with the overflow check sg.c has added in. This fixes DVD burnign with Nero. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> [ "you'll be wanting a comma there" - Andrew ] Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23[PATCH] elevator: move clearing of unplug flag earlierLinas Vepstas
A flag was recently added to the elevator code to avoid performing an unplug when reuests are being re-queued. The goal of this flag was to avoid a deep recursion that can occur when re-queueing requests after a SCSI device/host reset. See http://lkml.org/lkml/2006/5/17/254 However, that fix added the flag near the bottom of a case statement, where an earlier break (in an if statement) could transport one out of the case, without setting the flag. This patch sets the flag earlier in the case statement. I re-discovered the deep recursion recently during testing; I was told that it was a known problem, and the fix to it was in the kernel I was testing. Indeed it was ... but it didn't fix the bug. With the patch below, I no longer see the bug. Signed-off by: Linas Vepstas <linas@austin.ibm.com> Signed-off-by: Jens Axboe <axboe@suse.de> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-02[PATCH] cfq-iosched: merging problemJens Axboe
Two issues: - The final return 1 should be a return 0, otherwise comparing cfqq is a noop. - bio_sync() only checks the sync flag, while rq_is_sync() checks both for READ and sync. The latter is what we want. Expand the bio check to include reads, and relax the restriction to allow merging of async io into sync requests. In the future we want to clean up the SYNC logic, right now it means both sync request (such as READ and O_DIRECT WRITE) and unplug-on-issue. Leave that for later. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>