aboutsummaryrefslogtreecommitdiff
path: root/include/linux/perf_event.h
AgeCommit message (Collapse)Author
2015-06-26Merge tag 'trace-v4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "This patch series contains several clean ups and even a new trace clock "monitonic raw". Also some enhancements to make the ring buffer even faster. But the biggest and most noticeable change is the renaming of the ftrace* files, structures and variables that have to deal with trace events. Over the years I've had several developers tell me about their confusion with what ftrace is compared to events. Technically, "ftrace" is the infrastructure to do the function hooks, which include tracing and also helps with live kernel patching. But the trace events are a separate entity altogether, and the files that affect the trace events should not be named "ftrace". These include: include/trace/ftrace.h -> include/trace/trace_events.h include/linux/ftrace_event.h -> include/linux/trace_events.h Also, functions that are specific for trace events have also been renamed: ftrace_print_*() -> trace_print_*() (un)register_ftrace_event() -> (un)register_trace_event() ftrace_event_name() -> trace_event_name() ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled() ftrace_define_fields_##call() -> trace_define_fields_##call() ftrace_get_offsets_##call() -> trace_get_offsets_##call() Structures have been renamed: ftrace_event_file -> trace_event_file ftrace_event_{call,class} -> trace_event_{call,class} ftrace_event_buffer -> trace_event_buffer ftrace_subsystem_dir -> trace_subsystem_dir ftrace_event_raw_##call -> trace_event_raw_##call ftrace_event_data_offset_##call-> trace_event_data_offset_##call ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call And a few various variables and flags have also been updated. This has been sitting in linux-next for some time, and I have not heard a single complaint about this rename breaking anything. Mostly because these functions, variables and structures are mostly internal to the tracing system and are seldom (if ever) used by anything external to that" * tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits) ring_buffer: Allow to exit the ring buffer benchmark immediately ring-buffer-benchmark: Fix the wrong type ring-buffer-benchmark: Fix the wrong param in module_param ring-buffer: Add enum names for the context levels ring-buffer: Remove useless unused tracing_off_permanent() ring-buffer: Give NMIs a chance to lock the reader_lock ring-buffer: Add trace_recursive checks to ring_buffer_write() ring-buffer: Allways do the trace_recursive checks ring-buffer: Move recursive check to per_cpu descriptor ring-buffer: Add unlikelys to make fast path the default tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call() tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call() tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled() tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_* tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir tracing: Rename ftrace_event_name() to trace_event_name() tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX ...
2015-06-26Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM updates from Russell King: "Bigger items included in this update are: - A series of updates from Arnd for ARM randconfig build failures - Updates from Dmitry for StrongARM SA-1100 to move IRQ handling to drivers/irqchip/ - Move ARMs SP804 timer to drivers/clocksource/ - Perf updates from Mark Rutland in preparation to move the ARM perf code into drivers/ so it can be shared with ARM64. - MCPM updates from Nicolas - Add support for taking platform serial number from DT - Re-implement Keystone2 physical address space switch to conform to architecture requirements - Clean up ARMv7 LPAE code, which goes in hand with the Keystone2 changes. - L2C cleanups to avoid unlocking caches if we're prevented by the secure support to unlock. - Avoid cleaning a potentially dirty cache containing stale data on CPU initialisation - Add ARM-only entry point for secondary startup (for machines that can only call into a Thumb kernel in ARM mode). Same thing is also done for the resume entry point. - Provide arch_irqs_disabled via asm-generic - Enlarge ARMv7M vector table - Always use BFD linker for VDSO, as gold doesn't accept some of the options we need. - Fix an incorrect BSYM (for Thumb symbols) usage, and convert all BSYM compiler macros to a "badr" (for branch address). - Shut up compiler warnings provoked by our cmpxchg() implementation. - Ensure bad xchg sizes fail to link" * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (75 commits) ARM: Fix build if CLKDEV_LOOKUP is not configured ARM: fix new BSYM() usage introduced via for-arm-soc branch ARM: 8383/1: nommu: avoid deprecated source register on mov ARM: 8391/1: l2c: add options to overwrite prefetching behavior ARM: 8390/1: irqflags: Get arch_irqs_disabled from asm-generic ARM: 8387/1: arm/mm/dma-mapping.c: Add arm_coherent_dma_mmap ARM: 8388/1: tcm: Don't crash when TCM banks are protected by TrustZone ARM: 8384/1: VDSO: force use of BFD linker ARM: 8385/1: VDSO: group link options ARM: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations ARM: remove __bad_xchg definition ARM: 8369/1: ARMv7M: define size of vector table for Vybrid ARM: 8382/1: clocksource: make ARM_TIMER_SP804 depend on GENERIC_SCHED_CLOCK ARM: 8366/1: move Dual-Timer SP804 driver to drivers/clocksource ARM: 8365/1: introduce sp804_timer_disable and remove arm_timer.h inclusion ARM: 8364/1: fix BE32 module loading ARM: 8360/1: add secondary_startup_arm prototype in header file ARM: 8359/1: correct secondary_startup_arm mode ARM: proc-v7: sanitise and document registers around errata ARM: proc-v7: clean up MIDR access ...
2015-06-22Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather largish update for everything time and timer related: - Cache footprint optimizations for both hrtimers and timer wheel - Lower the NOHZ impact on systems which have NOHZ or timer migration disabled at runtime. - Optimize run time overhead of hrtimer interrupt by making the clock offset updates smarter - hrtimer cleanups and removal of restrictions to tackle some problems in sched/perf - Some more leap second tweaks - Another round of changes addressing the 2038 problem - First step to change the internals of clock event devices by introducing the necessary infrastructure - Allow constant folding for usecs/msecs_to_jiffies() - The usual pile of clockevent/clocksource driver updates The hrtimer changes contain updates to sched, perf and x86 as they depend on them plus changes all over the tree to cleanup API changes and redundant code, which got copied all over the place. The y2038 changes touch s390 to remove the last non 2038 safe code related to boot/persistant clock" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) clocksource: Increase dependencies of timer-stm32 to limit build wreckage timer: Minimize nohz off overhead timer: Reduce timer migration overhead if disabled timer: Stats: Simplify the flags handling timer: Replace timer base by a cpu index timer: Use hlist for the timer wheel hash buckets timer: Remove FIFO "guarantee" timers: Sanitize catchup_timer_jiffies() usage hrtimer: Allow hrtimer::function() to free the timer seqcount: Introduce raw_write_seqcount_barrier() seqcount: Rename write_seqcount_barrier() hrtimer: Fix hrtimer_is_queued() hole hrtimer: Remove HRTIMER_STATE_MIGRATE selftest: Timers: Avoid signal deadlock in leap-a-day timekeeping: Copy the shadow-timekeeper over the real timekeeper last clockevents: Check state instead of mode in suspend/resume path selftests: timers: Add leap-second timer edge testing to leap-a-day.c ntp: Do leapsecond adjustment in adjtimex read path time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge ntp: Introduce and use SECS_PER_DAY macro instead of 86400 ...
2015-06-07perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLESKan Liang
After enlarging the PEBS interrupt threshold, there may be some mixed up PEBS samples which are discarded by the kernel. This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with the number of possible discarded records when it is impossible to demux the samples. It makes sure the user is not left in the dark about such discards. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07perf/x86/intel: Handle multiple records in the PEBS bufferYan, Zheng
When the PEBS interrupt threshold is larger than one record and the machine supports multiple PEBS events, the records of these events are mixed up and we need to demultiplex them. Demuxing the records is hard because the hardware is deficient. The hardware has two issues that, when combined, create impossible scenarios to demux. The first issue is that the 'status' field of the PEBS record is a copy of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a problem let us first describe the regular PEBS cycle: A) the CTRn value reaches 0: - the corresponding bit in GLOBAL_STATUS gets set - we start arming the hardware assist < some unspecified amount of time later -- this could cover multiple events of interest > B) the hardware assist is armed, any next event will trigger it C) a matching event happens: - the hardware assist triggers and generates a PEBS record this includes a copy of GLOBAL_STATUS at this moment - if we auto-reload we (re)set CTRn - we clear the relevant bit in GLOBAL_STATUS Now consider the following chain of events: A0, B0, A1, C0 The event generated for counter 0 will include a status with counter 1 set, even though its not at all related to the record. A similar thing can happen with a !PEBS event if it just happens to overflow at the right moment. The second issue is that the hardware will only emit one record for two or more counters if the event that triggers the assist is 'close'. The 'close' can be several cycles. In some cases even the complete assist, if the event is something that doesn't need retirement. For instance, consider this chain of events: A0, B0, A1, B1, C01 Where C01 is an event that triggers both hardware assists, we will generate but a single record, but again with both counters listed in the status field. This time the record pertains to both events. Note that these two cases are different but undistinguishable with the data as generated. Therefore demuxing records with multiple PEBS bits (we can safely ignore status bits for !PEBS counters) is impossible. Furthermore we cannot emit the record to both events because that might cause a data leak -- the events might not have the same privileges -- so what this patch does is discard such events. The assumption/hope is that such discards will be rare. Here lists some possible ways you may get high discard rate. - when you count the same thing multiple times. But it is not a useful configuration. - you can be unfortunate if you measure with a userspace only PEBS event along with either a kernel or unrestricted PEBS event. Imagine the event triggering and setting the overflow flag right before entering the kernel. Then all kernel side events will end up with multiple bits set. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> [ Changelog improvements. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27perf: allow for PMU-specific event filteringMark Rutland
In certain circumstances it may not be possible to schedule particular events due to constraints other than a lack of hardware counters (e.g. on big.LITTLE systems where CPUs support different events). The core perf event code does not distinguish these cases and pessimistically assumes that any failure to schedule an event means that it is not worth attempting to schedule later events, even if some hardware counters are still unused. When an event a pmu cannot schedule exists in a flexible group list it can unnecessarily prevent event groups following it in the list from being scheduled (until it is rotated to the end of the list). This means some events are scheduled for only a portion of the time they could be, and for short running programs no events may be scheduled if the list is initially sorted in an unfortunate order. This patch adds a new (optional) filter_match function pointer to struct pmu which a pmu driver can use to tell perf core when an event matches pmu-specific scheduling requirements. This plugs into the existing event_filter_match logic, and makes it possible to avoid the scheduling problem described above. When no filter is provided by the PMU, the existing behaviour is retained. Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-05-27perf/x86/intel/cqm: Use proper data typesThomas Gleixner
'int' is really not a proper data type for an MSR. Use u32 to make it clear that we are dealing with a 32-bit unsigned hardware value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Matt Fleming <matt.fleming@intel.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: Will Auld <will.auld@intel.com> Link: http://lkml.kernel.org/r/20150518235149.919350144@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27perf/x86: Fix event/group validationPeter Zijlstra
Commit 43b4578071c0 ("perf/x86: Reduce stack usage of x86_schedule_events()") violated the rule that 'fake' scheduling; as used for event/group validation; should not change the event state. This went mostly un-noticed because repeated calls of x86_pmu::get_event_constraints() would give the same result. And x86_pmu::put_event_constraints() would mostly not do anything. Commit e979121b1b15 ("perf/x86/intel: Implement cross-HT corruption bug workaround") made the situation much worse by actually setting the event->hw.constraint value to NULL, so when validation and actual scheduling interact we get NULL ptr derefs. Fix it by removing the constraint pointer from the event and move it back to an array, this time in cpuc instead of on the stack. validate_group() x86_schedule_events() event->hw.constraint = c; # store <context switch> perf_task_event_sched_in() ... x86_schedule_events(); event->hw.constraint = c2; # store ... put_event_constraints(event); # assume failure to schedule intel_put_event_constraints() event->hw.constraint = NULL; <context switch end> c = event->hw.constraint; # read -> NULL if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref This in particular is possible when the event in question is a cpu-wide event and group-leader, where the validate_group() tries to add an event to the group. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 43b4578071c0 ("perf/x86: Reduce stack usage of x86_schedule_events()") Fixes: e979121b1b15 ("perf/x86/intel: Implement cross-HT corruption bug workaround") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-18sched,perf: Fix periodic timersPeter Zijlstra
In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
2015-05-13tracing: Rename ftrace_event_{call,class} to trace_event_{call,class}Steven Rostedt (Red Hat)
The name "ftrace" really refers to the function hook infrastructure. It is not about the trace_events. The structures ftrace_event_call and ftrace_event_class have nothing to do with the function hooks, and are really trace_event structures. Rename ftrace_event_* to trace_event_*. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-08perf: Fix software migrate eventsPeter Zijlstra
Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it was borken: > The problem is that the task isn't actually scheduled while its being > migrated (obviously), and if its not scheduled, the counters aren't > scheduled either, so there's no observing of the fact. > > A further problem with migrations is that many migrations happen from > softirq context, which is nested inside the 'random' task context of > whoemever happens to run at that time, similarly for the wakeup > migrations triggered from (soft)irq context. All those end up being > accounted in the task that's currently running, eg. your 'ls'. The below cures this by marking a task as migrated and accounting it on the subsequent sched_in(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add ITRACE_START record to indicate that tracing has startedAlexander Shishkin
For counters that generate AUX data that is bound to the context of a running task, such as instruction tracing, the decoder needs to know exactly which task is running when the event is first scheduled in, before the first sched_switch. The decoder's need to know this stems from the fact that instruction flow trace decoding will almost always require program's object code in order to reconstruct said flow and for that we need at least its pid/tid in the perf stream. To single out such instruction tracing pmus, this patch introduces ITRACE PMU capability. The reason this is not part of RECORD_AUX record is that not all pmus capable of generating AUX data need this, and the opposite is *probably* also true. While sched_switch covers for most cases, there are two problems with it: the consumer will need to process events out of order (that is, having found RECORD_AUX, it will have to skip forward to the nearest sched_switch to figure out which task it was, then go back to the actual trace to decode it) and it completely misses the case when the tracing is enabled and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add API for PMUs to write to the AUX areaAlexander Shishkin
For pmus that wish to write data to ring buffer's AUX area, provide perf_aux_output_{begin,end}() calls to initiate/commit data writes, similarly to perf_output_{begin,end}. These also use the same output handle structure. Also, similarly to software counterparts, these will direct inherited events' output to parents' ring buffers. After the perf_aux_output_begin() returns successfully, handle->size is set to the maximum amount of data that can be written wrt aux_tail pointer, so that no data that the user hasn't seen will be overwritten, therefore this should always be called before hardware writing is enabled. On success, this will return the pointer to pmu driver's private structure allocated for this aux area by pmu::setup_aux. Same pointer can also be retrieved using perf_get_aux() while hardware writing is enabled. PMU driver should pass the actual amount of data written as a parameter to perf_aux_output_end(). All hardware writes should be completed and visible before this one is called. Additionally, perf_aux_output_skip() will adjust output handle and aux_head in case some part of the buffer has to be skipped over to maintain hardware's alignment constraints. Nested writers are forbidden and guards are in place to catch such attempts. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a pmu capability for "exclusive" eventsAlexander Shishkin
Usually, pmus that do, for example, instruction tracing, would only ever be able to have one event per task per cpu (or per perf_event_context). For such pmus it makes sense to disallow creating conflicting events early on, so as to provide consistent behavior for the user. This patch adds a pmu capability that indicates such constraint on event creation. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a capability for AUX_NO_SG pmus to do software double bufferingAlexander Shishkin
For pmus that don't support scatter-gather for AUX data in hardware, it might still make sense to implement software double buffering to avoid losing data while the user is reading data out. For this purpose, add a pmu capability that guarantees multiple high-order chunks for AUX buffer, so that the pmu driver can do switchover tricks. To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your pmu's capability mask. This will make the ring buffer AUX allocation code ensure that the biggest high order allocation for the aux buffer pages is no bigger than half of the total requested buffer size, thus making sure that the buffer has at least two high order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Support high-order allocations for AUX spaceAlexander Shishkin
Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability) don't support scatter-gather and will prefer larger contiguous areas for their output regions. This patch adds a new pmu capability to request higher order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add AUX area to ring buffer for raw data streamsPeter Zijlstra
This patch introduces "AUX space" in the perf mmap buffer, intended for exporting high bandwidth data streams to userspace, such as instruction flow traces. AUX space is a ring buffer, defined by aux_{offset,size} fields in the user_page structure, and read/write pointers aux_{head,tail}, which abide by the same rules as data_* counterparts of the main perf buffer. In order to allocate/mmap AUX, userspace needs to set up aux_offset to such an offset that will be greater than data_offset+data_size and aux_size to be the desired buffer size. Both need to be page aligned. Then, same aux_offset and aux_size should be passed to mmap() call and if everything adds up, you should have an AUX buffer as a result. Pages that are mapped into this buffer also come out of user's mlock rlimit plus perf_event_mlock_kb allowance. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27perf: Add per event clockid supportPeter Zijlstra
While thinking on the whole clock discussion it occurred to me we have two distinct uses of time: 1) the tracking of event/ctx/cgroup enabled/running/stopped times which includes the self-monitoring support in struct perf_event_mmap_page. 2) the actual timestamps visible in the data records. And we've been conflating them. The first is all about tracking time deltas, nobody should really care in what time base that happens, its all relative information, as long as its internally consistent it works. The second however is what people are worried about when having to merge their data with external sources. And here we have the discussion on MONOTONIC vs MONOTONIC_RAW etc.. Where MONOTONIC is good for correlating between machines (static offset), MONOTNIC_RAW is required for correlating against a fixed rate hardware clock. This means configurability; now 1) makes that hard because it needs to be internally consistent across groups of unrelated events; which is why we had to have a global perf_clock(). However, for 2) it doesn't really matter, perf itself doesn't care what it writes into the buffer. The below patch makes the distinction between these two cases by adding perf_event_clock() which is used for the second case. It further makes this configurable on a per-event basis, but adds a few sanity checks such that we cannot combine events with different clocks in confusing ways. And since we then have per-event configurability we might as well retain the 'legacy' behaviour as a default. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/x86' into perf/core, because it's readyIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23perf: Remove type specific target pointersPeter Zijlstra
The only reason CQM had to use a hard-coded pmu type was so it could use cqm_target in hw_perf_event. Do away with the {tp,bp,cqm}_target pointers and provide a non type specific one. This allows us to do away with that silly pmu type as well. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vince Weaver <vince@deater.net> Cc: acme@kernel.org Cc: acme@redhat.com Cc: hpa@zytor.com Cc: jolsa@redhat.com Cc: kanaka.d.juvva@intel.com Cc: matt.fleming@intel.com Cc: tglx@linutronix.de Cc: torvalds@linux-foundation.org Cc: vikas.shivappa@linux.intel.com Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-26Merge tag 'v4.0-rc1' into perf/core, to refresh the treeIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25perf/x86/intel: Support task events with Intel CQMMatt Fleming
Add support for task events as well as system-wide events. This change has a big impact on the way that we gather LLC occupancy values in intel_cqm_event_read(). Currently, for system-wide (per-cpu) events we defer processing to userspace which knows how to discard all but one cpu result per package. Things aren't so simple for task events because we need to do the value aggregation ourselves. To do this, we defer updating the LLC occupancy value in event->count from intel_cqm_event_read() and do an SMP cross-call to read values for all packages in intel_cqm_event_count(). We need to ensure that we only do this for one task event per cache group, otherwise we'll report duplicate values. If we're a system-wide event we want to fallback to the default perf_event_count() implementation. Refactor this into a common function so that we don't duplicate the code. Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an event's task (if the event isn't per-cpu) inside of the Intel CQM PMU driver. This task information is only availble in the upper layers of the perf infrastructure. Other perf backends stash the target task in event->hw.*target so we need to do something similar. The task is used to determine whether events should share a cache group and an RMID. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25perf/x86/intel: Add Intel Cache QoS Monitoring supportMatt Fleming
Future Intel Xeon processors support a Cache QoS Monitoring feature that allows tracking of the LLC occupancy for a task or task group, i.e. the amount of data in pulled into the LLC for the task (group). Currently the PMU only supports per-cpu events. We create an event for each cpu and read out all the LLC occupancy values. Because this results in duplicate values being written out to userspace, we also export a .per-pkg event file so that the perf tools only accumulate values for one cpu per package. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25perf: Add ->count() function to read per-package countersMatt Fleming
For PMU drivers that record per-package counters, the ->count variable cannot be used to record an accurate aggregated value, since it's not possible to perform SMP cross-calls to cpus on other packages from the context in which we update ->count. Introduce a new optional ->count() accessor function that can be used to customize how values are collected. If a PMU driver doesn't provide a ->count() function, we fallback to the existing code. There is necessarily a window of staleness with this approach because the task that generated the counter value may not have been scheduled by the cpu recently. An alternative and more complex approach would be to use a hrtimer to periodically refresh the values from a more permissive scheduling context. So, we're trading off complexity for accuracy. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-3-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25perf: Make perf_cgroup_from_task() globalMatt Fleming
Move perf_cgroup_from_task() from kernel/events/ to include/linux/ along with the necessary struct definitions, so that it can be used by the PMU code. When the upcoming Intel Cache Monitoring PMU driver assigns monitoring IDs to perf events, it needs to be able to check whether any two monitoring events overlap (say, a cgroup and task event), which means we need to be able to lookup the cgroup associated with a task (if any). Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/1422038748-21397-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf, powerpc: Fix up flush_branch_stack() usersPeter Zijlstra
The recent LBR rework for x86 left a stray flush_branch_stack() user in the PowerPC code, fix that up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <cl@linux.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Tejun Heo <tj@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Simplify the branch stack checkYan, Zheng
Use event->attr.branch_sample_type to replace intel_pmu_needs_lbr_smpl() for avoiding duplicated code that implicitly enables the LBR. Currently, branch stack can be enabled by user explicitly requesting branch sampling or implicit branch sampling to correct PEBS skid. For user explicitly requested branch sampling, the branch_sample_type is explicitly set by user. For PEBS case, the branch_sample_type is also implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-11-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Add pmu specific data for perf task contextYan, Zheng
Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach stata. The flag is set by PMU's event_init() callback, it indicates that perf event needs PMU specific data. The PMU specific data are initialized to zeros. Later patches will use PMU specific data to save LBR stack. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-6-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf/x86/intel: Use context switch callback to flush LBR stackYan, Zheng
Previous commit introduces context switch callback, its function overlaps with the flush branch stack callback. So we can use the context switch callback to flush LBR stack. This patch adds code that uses the flush branch callback to flush the LBR stack when task is being scheduled in. The callback is enabled only when there are events use the LBR hardware. This patch also removes all old flush branch stack code. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-4-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Introduce pmu context switch callbackYan, Zheng
The callback is invoked when process is scheduled in or out. It provides mechanism for later patches to save/store the LBR stack. For the schedule in case, the callback is invoked at the same place that flush branch stack callback is invoked. So it also can replace the flush branch stack callback. To avoid unnecessary overhead, the callback is enabled only when there are events use the LBR stack. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-3-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-16Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf updates from Ingo Molnar: "This series tightens up RDPMC permissions: currently even highly sandboxed x86 execution environments (such as seccomp) have permission to execute RDPMC, which may leak various perf events / PMU state such as timing information and other CPU execution details. This 'all is allowed' RDPMC mode is still preserved as the (non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is that RDPMC access is only allowed if a perf event is mmap-ed (which is needed to correctly interpret RDPMC counter values in any case). As a side effect of these changes CR4 handling is cleaned up in the x86 code and a shadow copy of the CR4 value is added. The extra CR4 manipulation adds ~ <50ns to the context switch cost between rdpmc-capable and rdpmc-non-capable mms" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks perf/x86: Only allow rdpmc if a perf_event is mapped perf: Pass the event to arch_perf_update_userpage() perf: Add pmu callbacks to track event mapping and unmapping x86: Add a comment clarifying LDT context switching x86: Store a per-cpu shadow copy of CR4 x86: Clean up cr4 manipulation
2015-02-11Merge tag 'powerpc-3.20-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: - Update of all defconfigs - Addition of a bunch of config options to modernise our defconfigs - Some PS3 updates from Geoff - Optimised memcmp for 64 bit from Anton - Fix for kprobes that allows 'perf probe' to work from Naveen - Several cxl updates from Ian & Ryan - Expanded support for the '24x7' PMU from Cody & Sukadev - Freescale updates from Scott: "Highlights include 8xx optimizations, some more work on datapath device tree content, e300 machine check support, t1040 corenet error reporting, and various cleanups and fixes" * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits) cxl: Add missing return statement after handling AFU errror cxl: Fail AFU initialisation if an invalid configuration record is found cxl: Export optional AFU configuration record in sysfs powerpc/mm: Warn on flushing tlb page in kernel context powerpc/powernv: Add OPAL soft-poweroff routine powerpc/perf/hv-24x7: Document sysfs event description entries powerpc/perf/hv-gpci: add the remaining gpci requests powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated powerpc/perf/hv-24x7: parse catalog and populate sysfs with events perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper perf: add PMU_EVENT_ATTR_STRING() helper perf: provide sysfs_show for struct perf_pmu_events_attr powerpc/kernel: Avoid initializing device-tree pointer twice powerpc: Remove old compile time disabled syscall tracing code powerpc/kernel: Make syscall_exit a local label cxl: Fix device_node reference counting powerpc/mm: bail out early when flushing TLB page powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80) perf/powerpc: reset event hw state when adding it to the PMU powerpc/qe: Use strlcpy() ...
2015-02-04perf: Add pmu callbacks to track event mapping and unmappingAndy Lutomirski
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/266afcba1d1f91ea5501e4e16e94bbbc1a9339b6.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04perf: Decouple unthrottling and rotatingMark Rutland
Currently the adjusments made as part of perf_event_task_tick() use the percpu rotation lists to iterate over any active PMU contexts, but these are not used by the context rotation code, having been replaced by separate (per-context) hrtimer callbacks. However, some manipulation of the rotation lists (i.e. removal of contexts) has remained in perf_rotate_context(). This leads to the following issues: * Contexts are not always removed from the rotation lists. Removal of PMUs which have been placed in rotation lists, but have not been removed by a hrtimer callback can result in corruption of the rotation lists (when memory backing the context is freed). This has been observed to result in hangs when PMU drivers built as modules are inserted and removed around the creation of events for said PMUs. * Contexts which do not require rotation may be removed from the rotation lists as a result of a hrtimer, and will not be considered by the unthrottling code in perf_event_task_tick. This patch fixes the issue by updating the rotation ist when events are scheduled in/out, ensuring that each rotation list stays in sync with the HW state. As each event holds a refcount on the module of its PMU, this ensures that when a PMU module is unloaded none of its CPU contexts can be in a rotation list. By maintaining a list of perf_event_contexts rather than perf_event_cpu_contexts, we don't need separate paths to handle the cpu and task contexts, which also makes the code a little simpler. As the rotation_list variables are not used for rotation, these are renamed to active_ctx_list, which better matches their current function. perf_pmu_rotate_{start,stop} are renamed to perf_pmu_ctx_{activate,deactivate}. Reported-by: Johannes Jensen <johannes.jensen@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <Will.Deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150129134511.GR17721@leverpostej Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-02perf: add PMU_EVENT_ATTR_STRING() helperCody P Schafer
Helper for constructing static struct perf_pmu_events_attr s. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02perf: provide sysfs_show for struct perf_pmu_events_attrCody P Schafer
(struct perf_pmu_events_attr) is defined in include/linux/perf_event.h, but the only "show" for it is in x86 and contains x86 specific stuff. Make a generic one for those of us who are just using the event_str. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-01-28Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28perf: Tighten (and fix) the grouping conditionPeter Zijlstra
The fix from 9fc81d87420d ("perf: Fix events installation during moving group") was incomplete in that it failed to recognise that creating a group with events for different CPUs is semantically broken -- they cannot be co-scheduled. Furthermore, it leads to real breakage where, when we create an event for CPU Y and then migrate it to form a group on CPU X, the code gets confused where the counter is programmed -- triggered in practice as well by me via the perf fuzzer. Fix this by tightening the rules for creating groups. Only allow grouping of counters that can be co-scheduled in the same context. This means for the same task and/or the same cpu. Fixes: 9fc81d87420d ("perf: Fix events installation during moving group") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14perf: Avoid horrible stack usagePeter Zijlstra (Intel)
Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09perf: Move task_pt_regs sampling into arch codeAndy Lutomirski
On x86_64, at least, task_pt_regs may be only partially initialized in many contexts, so x86_64 should not use it without extra care from interrupt context, let alone NMI context. This will allow x86_64 to override the logic and will supply some scratch space to use to make a cleaner copy of user regs. Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16perf: Improve the perf_sample_data struct layoutPeter Zijlstra
This patch reorders fields in the perf_sample_data struct in order to minimize the number of cachelines touched in perf_sample_data_init(). It also removes some intializations which are redundant with the code in kernel/events/core.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: jolsa@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16perf: Add ability to sample machine state on interruptStephane Eranian
Enable capture of interrupted machine state for each sample. Registers to sample are passed per event in the sample_regs_intr bitmask. To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in sample_type. The list of available registers is arch dependent and provided by asm/perf_regs.h Registers are laid out as u64 in the order of the bit order of sample_intr_regs. This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend the perf_event_attr struct with a new u64 field. Reviewed-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-24perf: Add PERF_EVENT_STATE_EXIT state for events with exited taskJiri Olsa
Adding new perf event state to indicate that the monitored task has exited. In this case the event stays alive until the owner task exits or close the event fd while providing the last data through the read syscall and ring buffer. Instead it needs to propagate the error info (monitored task has died) via poll and read syscalls by returning POLLHUP and 0 respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140811120102.GY9918@twins.programming.kicks-ass.net Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-t5y3w8jjx6tfo5w8y6oajsjq@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2014-08-13perf/x86: Fix data source encoding issues for load latency/precise storeStephane Eranian
This patch fixes issues introuduce by Andi's previous patch 'Revamp PEBS' series. This patch fixes the following: - precise_store_data_hsw() encode the mem op type whenever we can - precise_store_data_hsw set the default data source correctly - 0 is not a valid init value for data source. Define PERF_MEM_NA as the default value This bug was actually introduced by commit 722e76e60f2775c21b087ff12c5e678cf0ebcaaf Author: Stephane Eranian <eranian@google.com> Date: Thu May 15 17:56:44 2014 +0200 fix Haswell precise store data source encoding Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1407785233-32193-4-git-send-email-eranian@google.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: ak@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13perf: Add queued work to remove orphaned child eventsJiri Olsa
In cases when the owner task exits before the workload and the workload made some forks, all the events stay in until the last workload process exits. Thats' because each child event holds parent reference. We want to release all children events once the parent is gone, because at that time there's no process to read them anyway, so they're just eating resources. This removal races with process exit, which removes all events and fork, which clone events. To be clear of those two, adding work queue to remove orphaned child for context in case such event is detected. Using delayed work queue (with delay == 1), because we queue this work under perf scheduler callbacks. Normal work queue tries to wake up the queue process, which deadlocks on rq->lock in this place. Also preventing clones from abandoned parent event. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406896382-18404-4-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06perf: Differentiate exec() and non-exec() comm eventsAdrian Hunter
perf tools like 'perf report' can aggregate samples by comm strings, which generally works. However, there are other potential use-cases. For example, to pair up 'calls' with 'returns' accurately (from branch events like Intel BTS) it is necessary to identify whether the process has exec'd. Although a comm event is generated when an 'exec' happens it is also generated whenever the comm string is changed on a whim (e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event to differentiate one case from the other. In order to determine whether the kernel supports the new flag, a selection bit named 'exec' is added to struct perf_event_attr. The bit does nothing but will cause perf_event_open() to fail if the bit is set on kernels that do not have it defined. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com Cc: Paul Mackerras <paulus@samba.org> Cc: Dave Jones <davej@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06Merge branch 'perf/urgent' into perf/core, to resolve conflict and to ↵Ingo Molnar
prepare for new patches Conflicts: arch/x86/kernel/traps.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06perf: Fix perf_event_comm() vs. exec() assumptionPeter Zijlstra
perf_event_comm() assumes that set_task_comm() is only called on exec(), and in particular that its only called on current. Neither are true, as Dave reported a WARN triggered by set_task_comm() being called on !current. Separate the exec() hook from the comm hook. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net [ Build fix. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05perf: Disable sampled events if no PMU interruptVince Weaver
Add common code to generate -ENOTSUPP at event creation time if an architecture attempts to create a sampled event and PERF_PMU_NO_INTERRUPT is set. This adds a new pmu->capabilities flag. Initially we only support PERF_PMU_NO_INTERRUPT (to indicate a PMU has no support for generating hardware interrupts) but there are other capabilities that can be added later. Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Acked-by: Will Deacon <will.deacon@arm.com> [peterz: rename to PERF_PMU_CAP_* and moved the pmu::capabilities word into a hole] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>