aboutsummaryrefslogtreecommitdiff
path: root/block/blk-ioc.c
AgeCommit message (Collapse)Author
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-18block: Substitute rcu_access_pointer() for rcu_dereference_raw()Paul E. McKenney
(Trivial patch.) If the code is looking at the RCU-protected pointer itself, but not dereferencing it, the rcu_dereference() functions can be downgraded to rcu_access_pointer(). This commit makes this downgrade in blkg_destroy() and ioc_destroy_icq(), both of which simply compare the RCU-protected pointer against another pointer with no dereferencing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2013-11-08block: cleanup removing dependency on bootmem headersGrygorii Strashko
Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-11lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-14block: queue work on power efficient wqViresh Kumar
Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01block: uninitialized ioc->nr_tasks triggers WARN_ONOlof Johansson
Hi, I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the below warning as of 3.5-rc1 and later (3.4 is fine): [ 10.886893] ------------[ cut here ]------------ [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560() [ 10.886905] Hardware name: Bochs [ 10.886906] Modules linked in: [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27 [ 10.886908] Call Trace: [ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0 [ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20 [ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560 [ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340 [ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50 [ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80 [ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60 [ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30 [ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20 [ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b [ 10.886928] ---[ end trace 32a14af7ee6a590b ]--- Reproducing is easy, I can hit it on a KVM system with a very basic config (x86_64 make defconfig + enable the drivers needed). To hit it, just install dump (on debian/ubuntu, not sure what the package might be called on Fedora), and: dump -o -f /tmp/foo / You'll see the warning in dmesg once it forks off the I/O process and starts dumping filesystem contents. I bisected it down to the following commit: commit f6e8d01bee036460e03bd4f6a79d014f98ba712e Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 5 13:15:26 2012 -0800 block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> It seems like the init of ioc->nr_tasks was removed in that patch, so it starts out at 0 instead of 1. Tejun, is the right thing here to add back the init, or should something else be done? The below patch removes the warning, but I haven't done any more extensive testing on it. Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31block: avoid infinite loop in get_task_io_context()Eric Dumazet
Calling get_task_io_context() on a exiting task which isn't %current can loop forever. This triggers at boot time on my dev machine. BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603] Fix this by making create_task_io_context() returns -EBUSY in this case to break the loop. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alan Cox <alan@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-01Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-mergedTejun Heo
cgroup/for-3.5 contains the following changes which blk-cgroup needs to proceed with the on-going cleanup. * Dynamic addition and removal of cftypes to make config/stat file handling modular for policies. * cgroup removal update to not wait for css references to drain to fix blkcg removal hang caused by cfq caching cfqgs. Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the following conflicts in block/blk-cgroup.c. * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks" conflicts with blkiocg_pre_destroy() addition and blkiocg_attach() removal. Resolved by removing @subsys from all subsys methods. * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in controllers" conflicts with ->pre_destroy() and ->attach() updates and removal of modular config. Resolved by dropping forward declarations of the methods and applying updates to the relocated blkio_subsys. * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new cftype interface" builds upon the previous item. Resolved by adding ->base_cftypes to the relocated blkio_subsys. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-03-20block: remove ioc_*_changed()Tejun Heo
After the previous patch to cfq, there's no ioc_get_changed() user left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all related stuff. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-14block: fix ioc leak in put_io_contextXiaotian Feng
When put_io_context is called, if ioc->icq_list is empty and refcount is 1, kernel will not free the ioc. This is caught by following kmemleak: unreferenced object 0xffff880036349fe0 (size 216): comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50 [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0 [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130 [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0 [<ffffffff81055f0e>] copy_process+0x188e/0x18b0 [<ffffffff8105609b>] do_fork+0x11b/0x420 [<ffffffff810247f8>] sys_clone+0x28/0x30 [<ffffffff816d3373>] stub_clone+0x13/0x20 [<ffffffffffffffff>] 0xffffffffffffffff ioc should be freed if ioc->icq_list is empty. Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: add io_context->active_refTejun Heo
Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: interface update for ioc/icq creation functionsTejun Heo
Make the following interface updates to prepare for future ioc related changes. * create_io_context() returning ioc only works for %current because it doesn't increment ref on the ioc. Drop @task parameter from it and always assume %current. * Make create_io_context_slowpath() return 0 or -errno and rename it to create_task_io_context(). * Make ioc_create_icq() take @ioc as parameter instead of assuming that of %current. The caller, get_request(), is updated to create ioc explicitly and then pass it into ioc_create_icq(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15block: exit_io_context() should call elevator_exit_icq_fn()Tejun Heo
While updating locking, b2efa05265 "block, cfq: unlink cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation from exit_io_context() to the final ioc put. While this doesn't cause catastrophic failure, it effectively removes task exit notification to elevator and cause noticeable IO performance degradation with CFQ. On task exit, CFQ used to immediately expire the slice if it was being used by the exiting task as no more IO would be issued by the task; however, after b2efa05265, the notification is lost and disk could sit idle needlessly, leading to noticeable IO performance degradation for certain workloads. This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it from exit_io_context(). ICQ_EXITED flag is added to avoid invoking the callback more than once for the same icq. Walking icq_list from ioc side and invoking elevator callback requires reverse double locking. This may be better implemented using RCU; unfortunately, using RCU isn't trivial. e.g. RCU protection would need to cover request_queue and queue_lock switch on cleanup makes grabbing queue_lock from RCU unsafe. Reverse double locking should do, at least for now. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-bisected-by: Shaohua Li <shli@kernel.org> LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15block: simplify ioc_release_fn()Tejun Heo
Reverse double lock dancing in ioc_release_fn() can be simplified by just using trylock on the queue_lock and back out from ioc lock on trylock failure. Simplify it. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-15block: replace icq->changed with icq->flagsTejun Heo
icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and access it under ioc->lock instead of using atomic bitops. ioc_get_changed() is added so that the changed part can be fetched and cleared as before. icq->flags will be used to carry other flags. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-11block: fix lockdep warning on io_context release put_io_context()Tejun Heo
11a3122f6c "block: strip out locking optimization in put_io_context()" removed ioc_lock depth lockdep annoation along with locking optimization; however, while recursing from put_io_context() is no longer possible, ioc_release_fn() may still end up putting the last reference of another ioc through elevator, which wlil grab ioc->lock triggering spurious (as the ioc is always different one) A-A deadlock warning. As this can only happen one time from ioc_release_fn(), using non-zero subclass from ioc_release_fn() is enough. Use subclass 1. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-07block: strip out locking optimization in put_io_context()Tejun Heo
put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-06block: fix ioc locking warningShaohua Li
Meelis reported a warning: WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() Hardware name: 939Dual-SATA2 timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 Call Trace: <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 [<ffffffff810145fd>] ? apic_write+0x11/0x13 [<ffffffff81027475>] __do_softirq+0x74/0xfa [<ffffffff812f337a>] call_softirq+0x1a/0x30 [<ffffffff81002ff9>] do_softirq+0x31/0x68 [<ffffffff810276cf>] irq_exit+0x3d/0xa3 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff8100801f>] ? default_idle+0x1e/0x32 [<ffffffff81008019>] ? default_idle+0x18/0x32 [<ffffffff810008b1>] cpu_idle+0x87/0xd1 [<ffffffff812de861>] rest_init+0x85/0x89 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 this_q == locked_q is possible. There are two problems here: 1. In UP case, there is preemption counter issue as spin_trylock always successes. 2. In SMP case, the loop breaks too earlier. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Knut Petersen <Knut_Petersen@t-online.de> Tested-by: Knut Petersen <Knut_Petersen@t-online.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-27block: remove WARN_ON_ONCE() in exit_io_context()Tejun Heo
6e736be7 "block: make ioc get/put interface more conventional and fix race on alloction" added WARN_ON_ONCE() in exit_io_context() which triggers if !PF_EXITING. All tasks hitting exit_io_context() from task exit should have PF_EXITING set but task struct tearing down after fork failure calls into the function without PF_EXITING, triggering the condition. WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92() Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77 Call Trace: [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2 [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a [<ffffffff8181a7a2>] exit_io_context+0x40/0x92 [<ffffffff810b58c9>] copy_process+0x126f/0x1453 [<ffffffff810b5c1b>] do_fork+0x120/0x3e9 [<ffffffff8106242f>] sys_clone+0x26/0x28 [<ffffffff82425803>] stub_clone+0x13/0x20 ---[ end trace a2e4eb670b375238 ]--- Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-25block: an exiting task should be allowed to create io_contextTejun Heo
While fixing io_context creation / task exit race condition, 6e736be7f2 "block: make ioc get/put interface more conventional and fix race on alloction" also prevented an exiting (%PF_EXITING) task from creating its own io_context. This is incorrect as exit path may issue IOs, e.g. from exit_files(), and if those IOs are the first ones issued by the task, io_context needs to be created to process the IOs. Combined with the existing problem of io_context / io_cq creation failure having the possibility of stalling IO, this problem results in deterministic full IO lockup with certain workloads. Fix it by allowing io_context creation regardless of %PF_EXITING for %current. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-19block: ioc_cgroup_changed() needs to be exportedJens Axboe
With the ioc changed, ioc_cgroup_changed() can be used by modular code. So ensure that it is exported. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: move icq creation and rq->elv.icq association to block coreTejun Heo
Now block layer knows everything necessary to create and associate icq's with requests. Move ioc_create_icq() to blk-ioc.c and update get_request() such that, if elevator_type->icq_size is set, requests are automatically associated with their matching icq's before elv_set_request(). io_context reference is also managed by block core on request alloc/free. * Only ioprio/cgroup changed handling remains from cfq_get_cic(). Collapsed into cfq_set_request(). * This removes queue kicking on icq allocation failure (for now). As icq allocation failure is rare and the only effect of queue kicking achieved was possibily accelerating queue processing, this change shouldn't be noticeable. There is a larger underlying problem. Unlike request allocation, icq allocation is not guaranteed to succeed eventually after retries. The number of icq is unbound and thus mempool can't be the solution either. This effectively adds allocation dependency on memory free path and thus possibility of deadlock. This usually wouldn't happen because icq allocation is not a hot path and, even when the condition triggers, it's highly unlikely that none of the writeback workers already has icq. However, this is still possible especially if elevator is being switched under high memory pressure, so we better get it fixed. Probably the only solution is just bypassing elevator and appending to dispatch queue on any elevator allocation failure. * Comment added to explain how icq's are managed and synchronized. This completes cleanup of io_context interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: move io_cq exit/release to blk-ioc.cTejun Heo
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc and q, and freeing automatically handled by blk-ioc. The elevator operation only need to perform exit operation specific to the elevator - in cfq's case, exiting the cfqq's. Also, clearing of io_cq's on q detach is moved to block core and automatically performed on elevator switch and q release. Because the q io_cq points to might be freed before RCU callback for the io_cq runs, blk-ioc code should remember to which cache the io_cq needs to be freed when the io_cq is released. New field io_cq->__rcu_icq_cache is added for this purpose. As both the new field and rcu_head are used only after io_cq is released and the q/ioc_node fields aren't, they are put into unions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: move io_cq lookup to blk-ioc.cTejun Heo
Now that all io_cq related data structures are in block core layer, io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c. Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with parameter return type changes (cfqd -> request_queue, cfq_io_cq -> io_cq) and cfq_cic_lookup() becomes thin wrapper around cfq_cic_lookup(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: reorganize cfq_io_context into generic and cfq specific partsTejun Heo
Currently io_context and cfq logics are mixed without clear boundary. Most of io_context is independent from cfq but cfq_io_context handling logic is dispersed between generic ioc code and cfq. cfq_io_context represents association between an io_context and a request_queue, which is a concept useful outside of cfq, but it also contains fields which are useful only to cfq. This patch takes out generic part and put it into io_cq (io context-queue) and the rest into cfq_io_cq (cic moniker remains the same) which contains io_cq. The following changes are made together. * cfq_ttime and cfq_io_cq now live in cfq-iosched.c. * All related fields, functions and constants are renamed accordingly. * ioc->ioc_data is now "struct io_cq *" instead of "void *" and renamed to icq_hint. This prepares for io_context API cleanup. Documentation is currently sparse. It will be added later. Changes in this patch are mechanical and don't cause functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: replace current_io_context() with create_io_context()Tejun Heo
When called under queue_lock, current_io_context() triggers lockdep warning if it hits allocation path. This is because io_context installation is protected by task_lock which is not IRQ safe, so it triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock deadlock warning. Given the restriction, accessor + creator rolled into one doesn't work too well. Drop current_io_context() and let the users access task->io_context directly inside queue_lock combined with explicit creation using create_io_context(). Future ioc updates will further consolidate ioc access and the create interface will be unexported. While at it, relocate ioc internal interface declarations in blk.h and add section comments before and after. This patch does not introduce functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: unlink cfq_io_context's immediatelyTejun Heo
cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's *far* simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block, cfq: move ioc ioprio/cgroup changed handling to cicTejun Heo
ioprio/cgroup change was handled by marking the changed state in ioc and, on the following access to the ioc, performing RCU-protected iteration through all cic's grabbing the matching queue_lock. This patch moves the changed state to each cic. When ioprio or cgroup changes, the respective bit is set on all cic's of the ioc and when each of those cic (not ioc) is accessed, change is applied for that specific ioc-queue pair. This also fixes the following two race conditions between setting and clearing of changed states. * Missing barrier between assign/load of ioprio and ioprio_changed allowed applying old ioprio. * Change requests could happen between application of change and clearing of changed variables. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block: make ioc get/put interface more conventional and fix race on alloctionTejun Heo
Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block: misc ioc cleanupsTejun Heo
* int return from put_io_context() wasn't used by anybody. Make it return void like other put functions and docbook-fy the function comment. * Reorder dummy declarations for !CONFIG_BLOCK case a bit. * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of if block and drop 0'ing. * Docbook-fy current_io_context() comment. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-07-01Merge branch 'for-linus' into for-3.1/coreJens Axboe
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06block: rename the return of two functionsPaul Bolle
If we rename the return of alloc_io_context() and get_io_context() from "ret" to "ioc" the code get's (a bit) more readable and (a lot) more grepable. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02block: Use hlist_entry() for io_context.cic_list.firstPaul Bolle
list_entry() and hlist_entry() are both simply aliases for container_of(), but since io_context.cic_list.first is an hlist_node one should at least use the correct alias. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23blk-cgroup: Initialize ioc->cgroup_changed at ioc creation timeVivek Goyal
If we don't explicitly initialize it to zero, CFQ might think that cgroup of ioc has changed and it generates lots of unnecessary calls to call_for_each_cic(changed_cgroup). Fix it. cfq_get_io_context() cfq_ioc_set_cgroup() call_for_each_cic(ioc, changed_cgroup) Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-21block: Clean up exit_io_context() source code.Bart Van Assche
This patch fixes a spelling error in a source code comment and removes superfluous braces in the function exit_io_context(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-11block: remove unused copy_io_context()Jens Axboe
Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-01block: don't access jiffies when initialising io_contextRichard Kennedy
As the comment says the initial value of last_waited is never used, so there is no need to initialise it with the current jiffies. Jiffies is hot enough without accessing it for no reason. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-11block: removed unused as_io_contextKirill Afonshin
It isn't used anymore, since AS was deleted. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-04block: Fix io_context leak after failure of clone with CLONE_IOLouis Rilling
With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-04block: Fix io_context leak after clone with CLONE_IOLouis Rilling
With CLONE_IO, copy_io() increments both ioc->refcount and ioc->nr_tasks. However exit_io_context() only decrements ioc->refcount if ioc->nr_tasks reaches 0. Always call put_io_context() in exit_io_context(). Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-10block: prevent possible io_context->refcount overflowNikanth Karthikesan
Currently io_context has an atomic_t(32-bit) as refcount. In the case of cfq, for each device against whcih a task does I/O, a reference to the io_context would be taken. And when there are multiple process sharing io_contexts(CLONE_IO) would also have a reference to the same io_context. Theoretically the possible maximum number of processes sharing the same io_context + the number of disks/cfq_data referring to the same io_context can overflow the 32-bit counter on a very high-end machine. Even though it is an improbable case, let us make it atomic_long_t. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-07cfq-iosched: fix RCU race in the cfq io_context destructor handlingJens Axboe
put_io_context() drops the RCU read lock before calling into cfq_dtor(), however we need to hold off freeing there before grabbing and dereferencing the first object on the list. So extend the rcu_read_lock() scope to cover the calling of cfq_dtor(), and optimize cfq_free_io_context() to use a new variant for call_for_each_cic() that assumes the RCU read lock is already held. Hit in the wild by Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19cfq-iosched: add hlist for browsing parallel to the radix treeJens Axboe
It's cumbersome to browse a radix tree from start to finish, especially since we modify keys when a process exits. So add a hlist for the single purpose of browsing over all known cfq_io_contexts, used for exit, io prio change, etc. This fixes http://bugzilla.kernel.org/show_bug.cgi?id=9948 Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19make blk_ioc_init() staticAdrian Bunk
blk_ioc_init() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-02-01block: kill swap_io_context()Jens Axboe
It blindly copies everything in the io_context, including the lock. That doesn't work so well for either lock ordering or lockdep. There seems zero point in swapping io contexts on a request to request merge, so the best point of action is to just remove it. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29block: continue ll_rw_blk.c splitupJens Axboe
Adds files for barrier handling, rq execution, io context handling, mapping data to requests, and queue settings. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>