aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched/fair.c
AgeCommit message (Collapse)Author
2012-06-06sched: Fix domain iterationPeter Zijlstra
Weird topologies can lead to asymmetric domain setups. This needs further consideration since these setups are typically non-minimal too. For now, make it work by adding an extra mask selecting which CPUs are allowed to iterate up. The topology that triggered it is the one from David Rientjes: 10 20 20 30 20 10 20 20 20 20 10 20 30 20 20 10 resulting in boxes that wouldn't even boot. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'Peter Zijlstra
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Make sure to not re-read variables after validationPeter Zijlstra
We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Fix SD_OVERLAPPeter Zijlstra
SD_OVERLAP exists to allow overlapping groups, overlapping groups appear in NUMA topologies that aren't fully connected. The typical result of not fully connected NUMA is that each cpu (or rather node) will have different spans for a particular distance. However due to how sched domains are traversed -- only the first cpu in the mask goes one level up -- the next level only cares about the spans of the cpus that went up. Due to this two things were observed to be broken: - build_overlap_sched_groups() -- since its possible the cpu we're building the groups for exists in multiple (or all) groups, the selection criteria of the first group didn't ensure there was a cpu for which is was true that cpumask_first(span) == cpu. Thus load- balancing would terminate. - update_group_power() -- assumed that the cpu span of the first group of the domain was covered by all groups of the child domain. The above explains why this isn't true, so deal with it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-17sched: Remove stale power aware scheduling remnants and dysfunctional knobsPeter Zijlstra
It's been broken forever (i.e. it's not scheduling in a power aware fashion), as reported by Suresh and others sending patches, and nobody cares enough to fix it properly ... so remove it to make space free for something better. There's various problems with the code as it stands today, first and foremost the user interface which is bound to topology levels and has multiple values per level. This results in a state explosion which the administrator or distro needs to master and almost nobody does. Furthermore large configuration state spaces aren't good, it means the thing doesn't just work right because it's either under so many impossibe to meet constraints, or even if there's an achievable state workloads have to be aware of it precisely and can never meet it for dynamic workloads. So pushing this kind of decision to user-space was a bad idea even with a single knob - it's exponentially worse with knobs on every node of the topology. There is a proposal to replace the user interface with a single 3 state knob: sched_balance_policy := { performance, power, auto } where 'auto' would be the preferred default which looks at things like Battery/AC mode and possible cpufreq state or whatever the hw exposes to show us power use expectations - but there's been no progress on it in the past many months. Aside from that, the actual implementation of the various knobs is known to be broken. There have been sporadic attempts at fixing things but these always stop short of reaching a mergable state. Therefore this wholesale removal with the hopes of spurring people who care to come forward once again and work on a coherent replacement. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14sched/fair: Improve the ->group_imb logicPeter Zijlstra
Group imbalance is meant to deal with situations where affinity masks and sched domains don't align well, such as 3 cpus from one group and 6 from another. In this case the domain based balancer will want to put an equal amount of tasks on each side even though they don't have equal cpus. Currently group_imb is set whenever two cpus of a group have a weight difference of at least one avg task and the heaviest cpu has at least two tasks. A group with imbalance set will always be picked as busiest and a balance pass will be forced. The problem is that even if there are no affinity masks this stuff can trigger and cause weird balancing decisions, eg. the observed behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the difference of 1 avg load (they all had the same weight) and nr_running being >1 the group_imbalance logic triggered and did the weird thing of pulling more load instead of trying to move the 1 excess task to the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with 1 task. Curb the group_imbalance stuff by making the nr_running condition weaker by also tracking the min_nr_running and using the difference in nr_running over the set instead of the absolute max nr_running. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14sched/nohz: Fix rq->cpu_load[] calculationsPeter Zijlstra
While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14sched/fair: Revert sched-domain iteration breakagePeter Zijlstra
Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the group") and 0ce90475 ("sched/fair: Add some serialization to the sched_domain load-balance walk") are horribly broken so revert them. The problem is that while it sounds good to have the minimally loaded cpu do the pulling of more load, the way we walk the domains there is absolutely no guarantee this cpu will actually get to the domain. In fact its very likely it wont. Therefore the higher up the tree we get, the less likely it is we'll balance at all. The first of mask always walks up, while sucky in that it accumulates load on the first cpu and needs extra passes to spread it out at least guarantees a cpu gets up that far and load-balancing happens at all. Since its now always the first and idle cpus should always be able to balance so they get a task as fast as possible we can also do away with the added serialization. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09sched/fair: Propagate 'struct lb_env' usage into find_busiest_groupPeter Zijlstra
More function argument passing reduction. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09sched/fair: Add some serialization to the sched_domain load-balance walkPeter Zijlstra
Since the sched_domain walk is completely unserialized (!SD_SERIALIZE) it is possible that multiple cpus in the group get elected to do the next level. Avoid this by adding some serialization. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09sched/fair: Let minimally loaded cpu balance the groupPeter Zijlstra
Currently we let the leftmost (or first idle) cpu ascend the sched_domain tree and perform load-balancing. The result is that the busiest cpu in the group might be performing this function and pull more load to itself. The next load balance pass will then try to equalize this again. Change this to pick the least loaded cpu to perform higher domain balancing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09sched: Change rq->nr_running to unsigned intPeter Zijlstra
Since there's a PID space limit of 30bits (see futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a lower bound of 2 pages per task) would still take 8T of memory it seems reasonable to say that unsigned int is sufficient for rq->nr_running. When we do get anywhere near that amount of tasks I suspect other things would go funny, load-balancer load computations would really need to be hoisted to 128bit etc. So save a few bytes and convert rq->nr_running and friends to unsigned int. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26sched: Fix more load-balancing falloutPeter Zijlstra
Commits 367456c756a6 ("sched: Ditch per cgroup task lists for load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage") left some more wreckage. By setting loop_max unconditionally to ->nr_running load-balancing could take a lot of time on very long runqueues (hackbench!). So keep the sysctl as max limit of the amount of tasks we'll iterate. Furthermore, the min load filter for migration completely fails with cgroups since inequality in per-cpu state can easily lead to such small loads :/ Furthermore the change to add new tasks to the tail of the queue instead of the head seems to have some effect.. not quite sure I understand why. Combined these fixes solve the huge hackbench regression reported by Tim when hackbench is ran in a cgroup. Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-29Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpusets: Remove an unused variable sched/rt: Improve pick_next_highest_task_rt() sched: Fix select_fallback_rq() vs cpu_active/cpu_online sched/x86/smp: Do not enable IRQs over calibrate_delay() sched: Fix compiler warning about declared inline after use MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
2012-03-23sched: Fix compiler warning about declared inline after usePeter Zijlstra
kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...
2012-03-12sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancerDiwakar Tundlam
The 'next_balance' field of 'nohz' idle balancer must be initialized to jiffies. Since jiffies is initialized to negative 300 seconds the 'nohz' idle balancer does not run for the first 300s (5mins) after bootup. If no new processes are spawed or no idle cycles happen, the load on the cpus will remain unbalanced for that duration. Signed-off-by: Diwakar Tundlam <dtundlam@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched: Fix load-balance wreckagePeter Zijlstra
Commit 367456c ("sched: Ditch per cgroup task lists for load-balancing") completely wrecked load-balancing due to a few silly mistakes. Correct those and remove more pointless code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Ditch per cgroup task lists for load-balancingPeter Zijlstra
Per cgroup load-balance has numerous problems, chief amongst them that there is no real sane order in them. So stop pretending it makes sense and enqueue all tasks on a single list. This also allows us to more easily fix the fwd progress issue uncovered by the lock-break stuff. Rotate the list on failure to migreate and limit the total iterations to nr_running (which with releasing the lock isn't strictly accurate but close enough). Also add a filter that skips very light tasks on the first attempt around the list, this attempts to avoid shooting whole cgroups around without affecting over balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Rename load-balancing fieldsPeter Zijlstra
s/env->this_/env->dst_/g s/env->busiest_/env->src_/g s/pull_task/move_task/g Makes everything clearer. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-0yvgms8t8x962drpvl0fu0kk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Move load-balancing arguments into helper structPeter Zijlstra
Passing large sets of similar arguments all around the load-balancer gets tiresom when you want to modify something. Stick them all in a helper structure and pass the structure around. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-5slqz0vhsdzewrfk9eza1aon@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: we'll queue up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24static keys: Introduce 'struct static_key', static_key_true()/false() and ↵Ingo Molnar
static_key_slow_[inc|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22sched: Remove rcu_read_lock/unlock() from select_idle_sibling()Nikunj A. Dadhania
select_idle_sibling() is called from select_task_rq_fair(), which already has the RCU read lock held. Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120217030409.11748.12491.stgit@abhimanyu Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22sched/events: Revert trace_sched_stat_sleeptime()Peter Zijlstra
Commit 1ac9bc69 ("sched/tracing: Add a new tracepoint for sleeptime") added a new sched:sched_stat_sleeptime tracepoint. It's broken: the first sample we get on a task might be bad because of a stale sleep_start value that wasn't reset at the last task switch because the tracepoint was not active. It also breaks the existing schedstat samples due to the side effects of: - se->statistics.sleep_start = 0; ... - se->statistics.block_start = 0; Nor do I see means to fix it without adding overhead to the scheduler fast path, which I'm not willing to for the sake of redundant instrumentation. Most importantly, sleep time information can already be constructed by tracing context switches and wakeups, and taking the timestamp difference between the schedule-out, the wakeup and the schedule-in. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Vagin <avagin@openvz.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/n/tip-pc4c9qhl8q6vg3bs4j6k0rbd@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-31sched: Move SMP-only variable into the SMP sectionHiroshi Shimamoto
This also fixes the following compilation warning on !SMP: CC kernel/sched/fair.o kernel/sched/fair.c:218:36: warning: 'max_load_balance_interval' defined but not used [-Wunused-variable] Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4F2754A0.9090306@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27sched: Ensure cpu_power periodic updateVincent Guittot
With a lot of small tasks, the softirq sched is nearly never called when no_hz is enabled. In this case load_balance() is mainly called with the newly_idle mode which doesn't update the cpu_power. Add a next_update field which ensure a maximum update period when there is short activity. Having stale cpu_power information can skew the load-balancing decisions, this is cured by the guaranteed update. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323717668-2143-1-git-send-email-vincent.guittot@linaro.org
2012-01-26sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplugSuresh Siddha
With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared immediately after the cpu exits idle. This gets cleared as part of the next tick seen on that cpu. For the cpu offline support, we need to clear this state manually. Fix it by registering a cpu notifier, which clears the nohz idle load balance state for this rq explicitly during the CPU_DYING notification. There won't be any nohz updates for that cpu, after the CPU_DYING notification. But lets be extra paranoid and skip updating the nohz state in the select_nohz_load_balancer() if the cpu is not in active state anymore. Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Reviewed-and-tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-11sched: Fix lockup by limiting load-balance retries on lock-breakPeter Zijlstra
Eric and David reported dead machines and traced it to commit a195f004 ("sched: Fix load-balance lock-breaking"), it turns out there's still a scenario where we can end up re-trying forever. Since there is no strict forward progress guarantee in the load-balance iteration we can get stuck re-retrying the same task-set over and over. Creating a forward progress guarantee with the existing structure is somewhat non-trivial, for now simply terminate the retry loop after a few tries. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: David Ahern <dsahern@gmail.com> [ logic cleanup as suggested by Eric ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1326297936.2442.157.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-23sched/tracing: Add a new tracepoint for sleeptimeArun Sharma
If CONFIG_SCHEDSTATS is defined, the kernel maintains information about how long the task was sleeping or in the case of iowait, blocking in the kernel before getting woken up. This will be useful for sleep time profiling. Note: this information is only provided for sched_fair. Other scheduling classes may choose to provide this in the future. Note: the delay includes the time spent on the runqueue as well. Signed-off-by: Arun Sharma <asharma@fb.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1324512940-32060-2-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix cgroup movement of waking processDaisuke Nishimura
There is a small race between try_to_wake_up() and sched_move_task(), which is trying to move the process being woken up. try_to_wake_up() on CPU0 sched_move_task() on CPU1 --------------------------------+--------------------------------- raw_spin_lock_irqsave(p->pi_lock) task_waking_fair() ->p.se.vruntime -= cfs_rq->min_vruntime ttwu_queue() ->send reschedule IPI to CPU1 raw_spin_unlock_irqsave(p->pi_lock) task_rq_lock() -> tring to aquire both p->pi_lock and rq->lock with IRQ disabled task_move_group_fair() -> p.se.vruntime -= (old)cfs_rq->min_vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() (via IPI) sched_ttwu_pending() raw_spin_lock(rq->lock) ttwu_do_activate() ... enqueue_entity() child.se->vruntime += cfs_rq->min_vruntime raw_spin_unlock(rq->lock) As a result, vruntime of the process becomes far bigger than min_vruntime, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by just ignoring such process in task_move_group_fair(), because the vruntime has already been normalized in task_waking_fair(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix cgroup movement of newly created processDaisuke Nishimura
There is a small race between do_fork() and sched_move_task(), which is trying to move the child. do_fork() sched_move_task() --------------------------------+--------------------------------- copy_process() sched_fork() task_fork_fair() -> vruntime of the child is initialized based on that of the parent. -> we can see the child in "tasks" file now. task_rq_lock() task_move_group_fair() -> child.se.vruntime -= (old)cfs_rq->min_vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() wake_up_new_task() ... enqueue_entity() child.se.vruntime += cfs_rq->min_vruntime As a result, vruntime of the child becomes far bigger than min_vruntime, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by just ignoring such process in task_move_group_fair(), because the vruntime has already been normalized in task_fork_fair(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix cgroup movement of forking processDaisuke Nishimura
There is a small race between task_fork_fair() and sched_move_task(), which is trying to move the parent. task_fork_fair() sched_move_task() --------------------------------+--------------------------------- cfs_rq = task_cfs_rq(current) -> cfs_rq is the "old" one. curr = cfs_rq->curr -> curr is set to the parent. task_rq_lock() dequeue_task() ->parent.se.vruntime -= (old)cfs_rq->min_vruntime enqueue_task() ->parent.se.vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() raw_spin_lock_irqsave(rq->lock) se->vruntime = curr->vruntime -> vruntime of the child is set to that of the parent which has already been updated by sched_move_task(). se->vruntime -= (old)cfs_rq->min_vruntime. raw_spin_unlock_irqrestore(rq->lock) As a result, vruntime of the child becomes far bigger than expected, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by setting "cfs_rq" and "curr" after holding the rq->lock. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix load-balance lock-breakingPeter Zijlstra
The current lock break relies on contention on the rq locks, something which might never come because we've got IRQs disabled. Or will be very likely because on anything with more than 2 cpus a synchronized load-balance pass will very likely cause contention on the rq locks. Also the sched_nr_migrate thing fails when it gets trapped the loops of either the cgroup muck in load_balance_fair() or the move_tasks() load condition. Instead, use the new lb_flags field to propagate break/abort conditions for all these loops and create a new loop outside the irq disabled on the break being required. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Replace all_pinned with a generic flags fieldPeter Zijlstra
Replace the all_pinned argument with a flags field so that we can add some extra controls throughout that entire call chain. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Only queue remote wakeups when crossing cache boundariesPeter Zijlstra
Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-08sched, nohz: Fix missing RCU read lockPeter Zijlstra
Yong Zhang reported: > [ INFO: suspicious RCU usage. ] > kernel/sched/fair.c:5091 suspicious rcu_dereference_check() usage! This is due to the sched_domain stuff being RCU protected and commit 0b005cf5 ("sched, nohz: Implement sched group, domain aware nohz idle load balancing") overlooking this fact. The sd variable only lives inside the for_each_domain() block, so we only need to wrap that. Reported-by: Yong Zhang <yong.zhang0@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1323264728.32012.107.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancerSuresh Siddha
Intention is to set the NOHZ_BALANCE_KICK flag for the 'ilb_cpu'. Not for the 'cpu' which is the local cpu. Fix the typo. Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323199594.1984.18.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Fix the idle cpu check in nohz_idle_balanceSuresh Siddha
cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after exiting idle. So during nohz_idle_balance(), intention is to double check if the cpu that is part of the idle_cpu_mask is indeed idle before going ahead in performing idle balance for that cpu. Fix the cpu typo in the idle_cpu() check during nohz_idle_balance(). Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323199177.1984.12.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Save some hrtick_start_fair cyclesMike Galbraith
hrtick_start_fair() shows up in profiles even when disabled. v3.0.6 taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key * 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: Mike Galbraith <efault@gmx.de> [ fixed !CONFIG_SCHED_HRTICK borkage ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971607.6855.17.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Clean up the find_new_ilb() using sched groups nr_busy_cpusSuresh Siddha
nr_busy_cpus in the sched_group_power indicates whether the group is semi idle or not. This helps remove the is_semi_idle_group() and simplify the find_new_ilb() in the context of finding an optimal cpu that can do idle load balancing. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.656983582@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Implement sched group, domain aware nohz idle load balancingSuresh Siddha
When there are many logical cpu's that enter and exit idle often, members of the global nohz data structure are getting modified very frequently causing lot of cache-line contention. Make the nohz idle load balancing more scalabale by using the sched domain topology and 'nr_busy_cpu's in the struct sched_group_power. Idle load balance is kicked on one of the idle cpu's when there is atleast one idle cpu and: - a busy rq having more than one task or - a busy rq's scheduler group that share package resources (like HT/MC siblings) and has more than one member in that group busy or - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that domain are idle compared to the busy ones. This will help in kicking the idle load balancing request only when there is a potential imbalance. And once it is mostly balanced, these kicks will be minimized. These changes helped improve the workload that is context switch intensive between number of task pairs by 2x on a 8 socket NHM-EX based system. Reported-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Track nr_busy_cpus in the sched_group_powerSuresh Siddha
Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group because sched groups are duplicated for the SD_OVERLAP scheduler domain] and for each cpu that enters and exits idle, this parameter will be updated in each scheduler group of the scheduler domain that this cpu belongs to. To avoid the frequent update of this state as the cpu enters and exits idle, the update of the stat during idle exit is delayed to the first timer tick that happens after the cpu becomes busy. This is done using NOHZ_IDLE flag in the struct rq's nohz_flags. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Introduce nohz_flags in 'struct rq'Suresh Siddha
Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Set skip_clock_update in yield_task_fair()Mike Galbraith
This is another case where we are on our way to schedule(), so can save a useless clock update and resulting microscopic vruntime update. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971686.6855.18.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Use rt.nr_cpus_allowed to recover select_task_rq() cyclesMike Galbraith
rt.nr_cpus_allowed is always available, use it to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. See the line marked with '*' below: # taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Clean up domain traversal in select_idle_sibling()Suresh Siddha
Instead of going through the scheduler domain hierarchy multiple times (for giving priority to an idle core over an idle SMT sibling in a busy core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES flag and traverse the domain hierarchy down till we find an idle group. This cleanup also addresses an issue reported by Mike where the recent changes returned the busy thread even in the presence of an idle SMT sibling in single socket platforms. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06events, sched: Add tracepoint for accounting blocked timeAndrew Vagin
This tracepoint shows how long a task is sleeping in uninterruptible state. E.g. it may show how long and where a mutex is waited for. Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322471015-107825-8-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-17sched: Move all scheduler bits into kernel/sched/Peter Zijlstra
There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>