Age | Commit message (Collapse) | Author |
|
This merge adds two fixup patches from tixy & morten on task-placement-v2
branch.
|
|
Adds an extra check intersection of the task affinity mask and the slower
hmp_domain cpumask before down migrating low priority tasks.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
|
|
On homogeneous (non-heterogeneous) systems all CPUs will be declared
'fast' and the slow cpu list will be empty. In this situation we need to
avoid adding an empty slow HMP domain otherwise the scheduler code will
blow up when it attempts to move a task to the slow domain.
Signed-off-by: Jon Medhurst <tixy@linaro.org>
|
|
'arm-asymmetric-support-v3-v3.6-rc1', 'rcu-hotplug-v1', 'arm-multi_pmu_v2', 'scheduler-misc-v1', 'hw-bkp-v7.1-debug-v1' and 'config-fragments' into big-LITTLE-MP-v10
Updates:
-------
- Based on v3.6
- Stats:
- Total Patches: 77 (V9 had incorrect count)
- New Patches: 7
- task-placement-v2: sched: Enable HMP priority filter by default
- arm-multi_pmu_v2 updated existing patches:
http://permalink.gmane.org/gmane.linux.linaro.devel/13707
- hw-bkp-v7.1-debug-v1: new branch (1 patch)
- Dropped Patches: 2
- branch cpu-hotplug-get_online_cpus-v1 removed as patches are already there
in rcu-hotplug-v1
- Updated Patches:
- per-task-load-average-v3-fixed updated with minor fixes. from:
git://git.kernel.org/pub/scm/linux/kernel/git/pjt/sched.git
commands used for merge:
-----------------------
$ git checkout -b big-LITTLE-MP-v10 v3.6
$ git merge per-cpu-thread-hotplug-v3-fixed per-task-load-average-v3-fixed
task-placement-v2 arm-asymmetric-support-v3-v3.6-rc1 rcu-hotplug-v1
arm-multi_pmu_v2 scheduler-misc-v1 hw-bkp-v7.1-debug-v1 config-fragments
|
|
This patch introduces os save and restore mechanism for v7.1 debug and
self-hosted debuggers.
It enables the os to save DBGDSCR before powerdown and restore it when power
is restored.
The clearing of the os lock in the restore function kick-starts the debug
logic again.
The os save and restore routines are hooked into the CPU PM event notifier
chain. CPU PM events are used to save and restore per-cpu context when a
single CPU is preparing to enter or has exited a low power state.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
|
|
This updates linaro config fragments to enable the HMP priority filter by
default.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
|
|
This updates the ARM Kconfig to enable the HMP priority filter by default.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
|
|
This adds core support for saving and restoring CPU PMU registers
for suspend/resume support i.e. deeper C-states in cpuidle terms.
This patch adds support only to ARMv7 PMU registers save/restore.
It needs to be extended to xscale and ARMv6 if needed.
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
|
|
The userspace perf tool provides options to specify PMU names from command
line for the event. An example of pmu event syntax would be
(<pmu_name>/<config>/<modifier>)
However the parser in the perf tool breaks the tokens at spacesand fails to
identify the PMU name with spaces correctly.
This patch removes spaces in the ARMv7 CPU PMU names.
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
|
|
This patch sets the cpu affinity for the perf IRQs in the logical order
within the cluster. However interupts are assumed to be specified in the
same logical order within the cluster.
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
|
|
In a system with multiple heterogeneous CPU PMUs and each PMUs can handle
events on a subset of CPUs, probably belonging a the same cluster.
This patch introduces a cpumask to track which CPUs each PMU supports.
It also updates armpmu_event_init to reject cpu-specific events being
initialised for unsupported CPUs. Since process-specific events can be
initialised for all the CPU PMUs,armpmu_start/stop/add are modified to
prevent from being added on unsupported CPUs.
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
|
|
In order to support multiple, heterogeneous CPU PMUs and distinguish
them, they cannot be registered as PERF_TYPE_RAW type. Instead we can
get perf core to allocate a new idr type id for each PMU.
Userspace applications can refer sysfs entried to find a PMU's type,
which can then be used in tracking events on individual PMUs.
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
|
|
A single global CPU PMU pointer is not useful in a system with multiple,
heterogeneous CPU PMUs as we need to access the relevant PMU depending
on the current CPU.
This patch replaces the single global CPU PMU pointer with per-cpu
pointers and changes the OProfile accessors to refer to the PMU affine
to CPU0.
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Perf has three ways to name a PMU: either by passing an explicit char *,
reading arm_pmu->name or accessing arm_pmu->pmu.name.
Just use arm_pmu->name consistently in the ARM backend.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
When attempting to reset the PMU state for either a NULL PMU or a PMU
implementation without a reset function, return NOTIFY_DONE from the CPU
notifier as we don't care about the hotplug event.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The current practice of registering the cpu hotplug notifier at PMU
registration time won't be safe with multiple PMUs, as we'll repeatedly
attempt to register the notifier. This has the unfortunate effect of
silently corrupting the notifier list, leading to boot stalling.
Instead, register the notifier at init time. Its sanity checks will
prevent anything bad from happening if the notifier is called before we
have any PMUs registered.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Multi-cluster ARMv7 systems may have CPU PMUs with different number of
counters.
This patch updates armv7_pmnc_counter_valid so that it takes a pmu
argument and checks the counter validity against that. We also remove a
number of redundant counter checks whether the current PMU is not easily
retrievable.
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The arm_pmu functions have wildly varied parameters which can often be
derived from struct perf_event.
This patch changes the arm_pmu function prototypes so that struct
perf_event pointers are passed in preference to fields that can be
derived from the event.
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Supporting multiple, heterogeneous CPU PMUs requires us to allocate the
arm_pmu structures dynamically as the devices are probed.
This patch removes the static structure definitions for each CPU PMU
type and instead passes pointers to the PMU-specific init functions.
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Add minimal guest support to perf, so it can distinguish whether
the PMU interrupt was in the host or the guest, as well as collecting
some very basic information (guest PC, user vs kernel mode).
This is not feature complete though, as it doesn't support backtracing
in the guest.
Based on the x86 implementation, tested with KVM/ARM.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Some device drivers like PMU require to retrieve the logical cpu mask
that corresponds to a given cluster id. This patch provides a hook in
the topology code that, given an existing cluster id as input,
initializes the corresponding cpumask passed as a pointer, reusing all
existing topology information required by sched domains in the kernel.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
|
Include asm/pmu.h to fix below build error:
CC arch/arm/mach-ux500/cpu-db8500.o
arch/arm/mach-ux500/cpu-db8500.c:118:8: error: variable 'db8500_pmu_platdata' has initializer but incomplete type
arch/arm/mach-ux500/cpu-db8500.c:119:2: error: unknown field 'handle_irq' specified in initializer
arch/arm/mach-ux500/cpu-db8500.c:119:2: warning: excess elements in struct initializer [enabled by default]
arch/arm/mach-ux500/cpu-db8500.c:119:2: warning: (near initialization for 'db8500_pmu_platdata') [enabled by default]
make[1]: *** [arch/arm/mach-ux500/cpu-db8500.o] Error 1
make: *** [arch/arm/mach-ux500] Error 2
Signed-off-by: Axel Lin <axel.lin@gmail.com>
|
|
This patch moves the CPU-specific IRQ registration and parsing code into
the CPU PMU backend. This is required because a PMU may have more than
one interrupt, which in turn can be either PPI (per-cpu) or SPI
(requiring strict affinity setting at the interrupt distributor).
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
[will: cosmetic edits and reworked interrupt dispatching]
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
This patch moves the CPU-specific PMU handling code out of perf_event.c
and into perf_event_cpu.c.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The CPU PMU code is tightly coupled with generic ARM PMU handling code.
This makes it cumbersome when trying to add support for other ARM PMUs
(e.g. interconnect, L2 cache controller, bus) as the generic parts of
the code are not readily reusable.
This patch cleans up perf_event.c so that reusable code is exposed via
header files to other potential PMU drivers. The CPU code is
consistently named to identify it as such and also to prepare for moving
it into a separate file.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The CPU PMU is probed using the current cpuid information as part of the
early_initcall initialising the architecture perf backend. For
architectures without NMI (such as ARM), this does not need to be
performed early and can be deferred to the driver probe callback. This
also allows us to probe the devicetree in preference to parsing the
current cpuid, which may be invalid on a big.LITTLE multi-cluster
system.
This patch defers the PMU probing and uses the devicetree information
when available.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
There's a rather strange compiler barrier in the PMU disabling code
which was presumably placed there by aliens. There's no valid reason for
the barrier and one can only suspect that it's up to no good.
This patch removes it before it has a chance to spread.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The arm_pmu_type enumeration was initially introduced to identify
different PMU types in the system, the usual one being that on the CPU
(ARM_PMU_DEVICE_CPU). With the removal of the PMU reservation code and
the introduction of devicetree bindings for the CPU PMU, the enumeration
is no longer required.
This patch removes the enumeration and updates the various CPU PMU
platform devices so that they no longer pass an .id field referring
to identify the PMU type.
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Pawel Moll <pawel.moll@arm.com>
Acked-by: Jon Hunter <jon-hunter@ti.com>
Acked-by: Kukjin Kim <kgene.kim@samsung.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiandong Zheng <jdzheng@broadcom.com>
Signed-off-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com>
[will: cosmetic edits and actual removal of the enum type]
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
The PMU reservation mechanism was originally intended to allow OProfile
and perf-events to co-ordinate over access to the CPU PMU. Since then,
OProfile for ARM has moved to using perf as its backend, so the
reservation code is no longer used.
This patch removes the reservation code for the CPU PMU on ARM.
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
This patch adds separate devicetree bindings for 11MPcore and
Cortex-{A5,A7,A15} PMUs in preparation for improved devicetree parsing
in the ARM perf-event CPU PMU driver.
Cc: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Add runtime PM support to the ARM PMU driver so that devices such as OMAP
supporting dynamic PM can use the platform->runtime_* hooks to initialise
hardware at runtime. Without having these runtime PM hooks in place any
configuration of the PMU hardware would be lost when low power states are
entered and hence would prevent PMU from working.
This change also replaces the PMU platform functions enable_irq and disable_irq
added by Ming Lei with runtime_resume and runtime_suspend funtions. Ming had
added the enable_irq and disable_irq functions as a method to configure the
cross trigger interface on OMAP4 for routing the PMU interrupts. By adding
runtime PM support, we can move the code called by enable_irq and disable_irq
into the runtime PM callbacks runtime_resume and runtime_suspend.
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Benoit Cousson <b-cousson@ti.com>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Kevin Hilman <khilman@ti.com>
Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
We need a way to prevent tasks that are migrating up and down the
hmp_domains from migrating straight on through before the load has
adapted to the new compute capacity of the CPU on the new hmp_domain.
This patch adds a next up/down migration delay that prevents the task
from doing another migration in the same direction until the delay
has expired.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
Adds ftrace event for tracing task migrations using HMP
optimized scheduling.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
Adds ftrace events for key variables related to the entity
load-tracking to help debugging scheduler behaviour. Allows tracing
of load contribution and runqueue residency ratio for both entities
and runqueues as well as entity CPU usage ratio.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
SCHED_HMP requires the different cpu types to be represented by an
ordered list of hmp_domains. Each hmp_domain represents all cpus of
a particular type using a cpumask.
The list is platform specific and therefore must be generated by
platform code by implementing arch_get_hmp_domains().
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
We can't rely on Kconfig options to set the fast and slow CPU lists for
HMP scheduling if we want a single kernel binary to support multiple
devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2
big.LITTLE system), Fast Models, or even non big.LITTLE devices.
This patch adds the function arch_get_fast_and_slow_cpus() to generate
the lists at run-time by parsing the CPU nodes in device-tree; it
assumes slow cores are A7s and everything else is fast. The function
still supports the old Kconfig options as this is useful for testing the
HMP scheduler on devices without big.LITTLE.
This patch is reuse of a patch by Jon Medhurst <tixy@linaro.org> with a
few bits left out.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
Adds Kconfig entries to enable HMP scheduling on ARM platforms.
Currently, it disables CPU level sched_domain load-balacing in order
to simplify things. This needs fixing in a later revision. HMP
scheduling will do the load-balancing at this level instead.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
Introduces a priority threshold which prevents low priority task
from migrating to faster hmp_domains (cpus). This is useful for
user-space software which assigns lower task priority to background
task.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
This patch introduces forced task migration for moving suitable
currently running tasks between hmp_domains. Task behaviour is likely
to change over time. Tasks running in a less capable hmp_domain may
change to become more demanding and should therefore be migrated up.
They are unlikely go through the select_task_rq_fair() path anytime
soon and therefore need special attention.
This patch introduces a period check (SCHED_TICK) of the currently
running task on all runqueues and sets up a forced migration using
stop_machine_no_wait() if the task needs to be migrated.
Ideally, this should not be implemented by polling all runqueues.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
This patch introduces the basic SCHED_HMP infrastructure. Each class of
cpus is represented by a hmp_domain and tasks will only be moved between
these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul
Turners fair group scheduling patch set:
<https://lkml.org/lkml/2012/8/23/267>
SCHED_HMP requires that the platform implements arch_get_hmp_domains()
which should set up the platform specific list of hmp_domains. It is
also assumed that the platform disables SD_LOAD_BALANCE for the
appropriate sched_domains.
Tasks placement takes place every time a task is to be inserted into
a runqueue based on its load history. The task placement decision is
based on load thresholds.
There are no restrictions on the number of hmp_domains, however,
multiple (>2) has not been tested and the up/down migration policy is
rather simple.
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
This patch adds load_avg_ratio to each task. The load_avg_ratio is a
variant of load_avg_contrib which is not scaled by the task priority. It
is calculated like this:
runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1).
Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
|
|
While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g.
runnable based load-balance (in progress), governors, power-management, etc
These facilities are not yet consumers of this data. This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
With the frame-work for runnable tracking now fully in place. Per-entity usage
tracking is a simple and low-overhead addition.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average. In this function we charge the accumulated run-time since last
update and handle appropriate decay. In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.
Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.
[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.
C program to generate the above:
#include <math.h>
#include <stdio.h>
#define N 32
#define WMULT_SHIFT 32
const long WMULT_CONST = ((1UL << N) - 1);
double y;
long runnable_avg_yN_inv[N];
void calc_mult_inv() {
int i;
double yn = 0;
printf("inverses\n");
for (i = 0; i < N; i++) {
yn = (double)WMULT_CONST * pow(y, i);
runnable_avg_yN_inv[i] = yn;
printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]);
}
printf("\n");
}
long mult_inv(long c, int n) {
return (c * runnable_avg_yN_inv[n]) >> WMULT_SHIFT;
}
void calc_yn_sum(int n)
{
int i;
double sum = 0, sum_fl = 0, diff = 0;
/*
* We take the floored sum to ensure the sum of partial sums is never
* larger than the actual sum.
*/
printf("sum y^n\n");
printf(" %8s %8s %8s\n", "exact", "floor", "error");
for (i = 1; i <= n; i++) {
sum = (y * sum + y * 1024);
sum_fl = floor(y * sum_fl+ y * 1024);
printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl,
sum_fl - sum);
}
printf("\n");
}
void calc_conv(long n) {
long old_n;
int i = -1;
printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
do {
old_n = n;
n = mult_inv(n, 1) + 1024;
i++;
} while (n != old_n);
printf("%d> %ld\n", i - 1, n);
printf("\n");
}
void main() {
y = pow(0.5, 1/(double)N);
calc_mult_inv();
calc_conv(1024);
calc_yn_sum(N);
}
[ Compile with -lm ]
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow. This is a large cost
saving for frequently switching tasks.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities. This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits. Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.
What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.
Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|
|
Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.
To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq. This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
|
|
Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.
Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
|