* [PATCH v0 0/2] perf: Allow forcing high-order allocation for AUX buffers @ 2019-02-13 11:47 Alexander Shishkin 2019-02-13 11:47 ` [PATCH v0 1/2] perf: Add an option to ask for high order allocations " Alexander Shishkin 2019-02-13 11:47 ` [PATCH v0 2/2] perf record: Add --aux-highorder Alexander Shishkin 0 siblings, 2 replies; 8+ messages in thread From: Alexander Shishkin @ 2019-02-13 11:47 UTC (permalink / raw) To: Peter Zijlstra, Arnaldo Carvalho de Melo Cc: Ingo Molnar, linux-kernel, jolsa, Alexander Shishkin Hi Peter and Arnaldo, It turns out that using high-order allocations for AUX buffers reduces the run-time performance penalty, for example, with Intel PT. The assumption is that this comes from not having to fetch the next page's address at every page boundary. Given a workload that does a lot of indirect branches (thus generating more PT data with addresses of branch targets), it takes around 6% longer to complete under PT in snapshot mode than without PT, but only around 4% if we use high-order output regions instead of single page output regions. This is measured on an Atom CPU. We already use high-order allocations for PMUs that don't do HW SG (like Intel PT on BDW). This patchset adds an attribute bit that enables the same for the PMUs that do have HW SG, and a command line option for perf record to set this bit. Alexander Shishkin (2): perf: Add an option to ask for high order allocations for AUX buffers perf record: Add --aux-highorder include/uapi/linux/perf_event.h | 3 ++- kernel/events/core.c | 3 +++ kernel/events/ring_buffer.c | 3 ++- tools/include/uapi/linux/perf_event.h | 3 ++- tools/perf/builtin-record.c | 2 ++ tools/perf/perf.h | 1 + tools/perf/util/evsel.c | 9 +++++++++ 7 files changed, 21 insertions(+), 3 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 11:47 [PATCH v0 0/2] perf: Allow forcing high-order allocation for AUX buffers Alexander Shishkin @ 2019-02-13 11:47 ` Alexander Shishkin 2019-02-13 13:07 ` Peter Zijlstra 2019-02-13 11:47 ` [PATCH v0 2/2] perf record: Add --aux-highorder Alexander Shishkin 1 sibling, 1 reply; 8+ messages in thread From: Alexander Shishkin @ 2019-02-13 11:47 UTC (permalink / raw) To: Peter Zijlstra, Arnaldo Carvalho de Melo Cc: Ingo Molnar, linux-kernel, jolsa, Alexander Shishkin Currently, the AUX buffer allocator will use high-order allocations for PMUs that don't support hardware scatter-gather chaining to ensure large contiguous blocks of pages, and always use an array of single pages otherwise. There is, however, a tangible performance benefit in using larger chunks of contiguous memory even in the latter case, that comes from not having to fetch the next page's address at every page boundary. In particular, a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime penalty with a single multi-page output region in snapshot mode (no PMI) than with multiple single-page output regions, from ~6% down to ~4%. For the snapshot mode it does make a difference as it is intended to run over long periods of time. Following the above justification, add an attribute bit to ask for a high-order AUX allocation. To prevent an unprivileged user from using up the higher orders of the page allocator, require CAP_SYS_ADMIN for this option. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> --- include/uapi/linux/perf_event.h | 3 ++- kernel/events/core.c | 3 +++ kernel/events/ring_buffer.c | 3 ++- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 7198ddd0c6b1..04726b5729c8 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -374,7 +374,8 @@ struct perf_event_attr { namespaces : 1, /* include namespaces data */ ksymbol : 1, /* include ksymbol events */ bpf_event : 1, /* include bpf events */ - __reserved_1 : 33; + aux_highorder : 1, /* use high order allocations for AUX data */ + __reserved_1 : 32; union { __u32 wakeup_events; /* wakeup every n events */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 5aeb4c74fb99..ba95398505c5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10688,6 +10688,9 @@ SYSCALL_DEFINE5(perf_event_open, perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) return -EACCES; + if (attr.aux_highorder && !capable(CAP_SYS_ADMIN)) + return -EACCES; + /* * In cgroup mode, the pid argument is used to pass the fd * opened to the cgroup directory in cgroupfs. The cpu argument diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 70ae2422cbaf..72b7380deb0a 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -603,7 +603,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, if (!has_aux(event)) return -EOPNOTSUPP; - if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) { + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG || + event->attr.aux_highorder) { /* * We need to start with the max_order that fits in nr_pages, * not the other way around, hence ilog2() and not get_order. -- 2.20.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 11:47 ` [PATCH v0 1/2] perf: Add an option to ask for high order allocations " Alexander Shishkin @ 2019-02-13 13:07 ` Peter Zijlstra 2019-02-13 13:49 ` Alexander Shishkin 2019-02-13 17:47 ` Mel Gorman 0 siblings, 2 replies; 8+ messages in thread From: Peter Zijlstra @ 2019-02-13 13:07 UTC (permalink / raw) To: Alexander Shishkin Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa, mgorman On Wed, Feb 13, 2019 at 01:47:15PM +0200, Alexander Shishkin wrote: > Currently, the AUX buffer allocator will use high-order allocations > for PMUs that don't support hardware scatter-gather chaining to ensure > large contiguous blocks of pages, and always use an array of single > pages otherwise. > > There is, however, a tangible performance benefit in using larger chunks > of contiguous memory even in the latter case, that comes from not having > to fetch the next page's address at every page boundary. In particular, > a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime > penalty with a single multi-page output region in snapshot mode (no PMI) > than with multiple single-page output regions, from ~6% down to ~4%. For > the snapshot mode it does make a difference as it is intended to run over > long periods of time. > > Following the above justification, add an attribute bit to ask for a > high-order AUX allocation. To prevent an unprivileged user from using up > the higher orders of the page allocator, require CAP_SYS_ADMIN for this > option. Why do we need a knob for that? Last time I checked unpriv users could fragment the page allocator just fine. What is there to protect? Also, since we return all pages upon buffer free, the page allocator should in fact re-construct the high order stuff. So a buffer alloc + free, using high order pages, should be an effective nop on high order availability. Unlike spraying dentries or whatever works these days around the machine. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 13:07 ` Peter Zijlstra @ 2019-02-13 13:49 ` Alexander Shishkin 2019-02-13 17:47 ` Mel Gorman 1 sibling, 0 replies; 8+ messages in thread From: Alexander Shishkin @ 2019-02-13 13:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa, mgorman Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Feb 13, 2019 at 01:47:15PM +0200, Alexander Shishkin wrote: >> Currently, the AUX buffer allocator will use high-order allocations >> for PMUs that don't support hardware scatter-gather chaining to ensure >> large contiguous blocks of pages, and always use an array of single >> pages otherwise. >> >> There is, however, a tangible performance benefit in using larger chunks >> of contiguous memory even in the latter case, that comes from not having >> to fetch the next page's address at every page boundary. In particular, >> a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime >> penalty with a single multi-page output region in snapshot mode (no PMI) >> than with multiple single-page output regions, from ~6% down to ~4%. For >> the snapshot mode it does make a difference as it is intended to run over >> long periods of time. >> >> Following the above justification, add an attribute bit to ask for a >> high-order AUX allocation. To prevent an unprivileged user from using up >> the higher orders of the page allocator, require CAP_SYS_ADMIN for this >> option. > > Why do we need a knob for that? Last time I checked unpriv users could > fragment the page allocator just fine. What is there to protect? > > Also, since we return all pages upon buffer free, the page allocator > should in fact re-construct the high order stuff. > > So a buffer alloc + free, using high order pages, should be an effective > nop on high order availability. > > Unlike spraying dentries or whatever works these days around the > machine. Good point. My worry is that since it's 'snapshot' mode, the buffer(s) will be around potentially for a long time. Otherwise, I'd also prefer to drop the capability wall. Thanks, -- Alex ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 13:07 ` Peter Zijlstra 2019-02-13 13:49 ` Alexander Shishkin @ 2019-02-13 17:47 ` Mel Gorman 2019-02-13 17:54 ` Peter Zijlstra 1 sibling, 1 reply; 8+ messages in thread From: Mel Gorman @ 2019-02-13 17:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa On Wed, Feb 13, 2019 at 02:07:55PM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 01:47:15PM +0200, Alexander Shishkin wrote: > > Currently, the AUX buffer allocator will use high-order allocations > > for PMUs that don't support hardware scatter-gather chaining to ensure > > large contiguous blocks of pages, and always use an array of single > > pages otherwise. > > > > There is, however, a tangible performance benefit in using larger chunks > > of contiguous memory even in the latter case, that comes from not having > > to fetch the next page's address at every page boundary. In particular, > > a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime > > penalty with a single multi-page output region in snapshot mode (no PMI) > > than with multiple single-page output regions, from ~6% down to ~4%. For > > the snapshot mode it does make a difference as it is intended to run over > > long periods of time. > > > > Following the above justification, add an attribute bit to ask for a > > high-order AUX allocation. To prevent an unprivileged user from using up > > the higher orders of the page allocator, require CAP_SYS_ADMIN for this > > option. > > Why do we need a knob for that? Last time I checked unpriv users could > fragment the page allocator just fine. What is there to protect? > It's "protected" in that we try to avoid long-lived unmovable allocations causing problems but it's internal to the page allocator. The exception is ZONE_MOVABLE and CMA which has some protection or static pools like hugetlbfs with tunables and APIs that can strict access if desired. > Also, since we return all pages upon buffer free, the page allocator > should in fact re-construct the high order stuff. > It does. At worse, other unmovable allocations may be allocated elsewhere in the meantime but that's normal activity. It can be replicated in any number of innocent ways by a normal user. It's even the basis of a test case that forces fragmentation, measures THP allocation success rate and stresses compaction. > So a buffer alloc + free, using high order pages, should be an effective > nop on high order availability. > Shouldn't even be necessary. With the possible exception of hugetlbfs, the userspace is not forced to take special action to keep the page allocator happy. If there is a tangiable performance benefit from using contiguous regions then I would suggest optimistically allocating them with appropriate GFP flags to avoid large latencies at startup time and fall back if necessary. Don't stick it behind capabilities or restrict it to privileged users. Only hugetlbfs provides restricted access and exposes an interface to userspace for applications and even that can be unprivileged. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 17:47 ` Mel Gorman @ 2019-02-13 17:54 ` Peter Zijlstra 2019-02-13 19:17 ` Mel Gorman 0 siblings, 1 reply; 8+ messages in thread From: Peter Zijlstra @ 2019-02-13 17:54 UTC (permalink / raw) To: Mel Gorman Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa On Wed, Feb 13, 2019 at 05:47:56PM +0000, Mel Gorman wrote: > If there is a tangiable performance benefit from using contiguous regions > then I would suggest optimistically allocating them with appropriate > GFP flags to avoid large latencies at startup time and fall back if > necessary. Right; the code does the fallback thing. It successively tries smaller order allocations until 0-order fails. It currently uses: #define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY) Is that what you could consider appropriate? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v0 1/2] perf: Add an option to ask for high order allocations for AUX buffers 2019-02-13 17:54 ` Peter Zijlstra @ 2019-02-13 19:17 ` Mel Gorman 0 siblings, 0 replies; 8+ messages in thread From: Mel Gorman @ 2019-02-13 19:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa On Wed, Feb 13, 2019 at 06:54:34PM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 05:47:56PM +0000, Mel Gorman wrote: > > If there is a tangiable performance benefit from using contiguous regions > > then I would suggest optimistically allocating them with appropriate > > GFP flags to avoid large latencies at startup time and fall back if > > necessary. > > Right; the code does the fallback thing. It successively tries smaller > order allocations until 0-order fails. > > It currently uses: > > #define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY) > > Is that what you could consider appropriate? Yeah, spot on. Conceivably the semantics of __GFP_NORETY will change a little at some point in the next year but not in any way I'd consider harmful (depends on how THP and locality discussions go). Even *if* we did something harmful, there will be complaints before it's problematic. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v0 2/2] perf record: Add --aux-highorder 2019-02-13 11:47 [PATCH v0 0/2] perf: Allow forcing high-order allocation for AUX buffers Alexander Shishkin 2019-02-13 11:47 ` [PATCH v0 1/2] perf: Add an option to ask for high order allocations " Alexander Shishkin @ 2019-02-13 11:47 ` Alexander Shishkin 1 sibling, 0 replies; 8+ messages in thread From: Alexander Shishkin @ 2019-02-13 11:47 UTC (permalink / raw) To: Peter Zijlstra, Arnaldo Carvalho de Melo Cc: Ingo Molnar, linux-kernel, jolsa, Alexander Shishkin Allow requesting high-order AUX allocations from the perf record command line. Since this operation requires CAP_SYS_ADMIN, adjust the error message to suggest this as a potential reason of failure. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> --- tools/include/uapi/linux/perf_event.h | 3 ++- tools/perf/builtin-record.c | 2 ++ tools/perf/perf.h | 1 + tools/perf/util/evsel.c | 9 +++++++++ 4 files changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index 7198ddd0c6b1..04726b5729c8 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -374,7 +374,8 @@ struct perf_event_attr { namespaces : 1, /* include namespaces data */ ksymbol : 1, /* include ksymbol events */ bpf_event : 1, /* include bpf events */ - __reserved_1 : 33; + aux_highorder : 1, /* use high order allocations for AUX data */ + __reserved_1 : 32; union { __u32 wakeup_events; /* wakeup every n events */ diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 3fdfbaebd95e..9f8d5eba24a8 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1929,6 +1929,8 @@ static struct option __record_options[] = { parse_clockid), OPT_STRING_OPTARG('S', "snapshot", &record.opts.auxtrace_snapshot_opts, "opts", "AUX area tracing Snapshot Mode", ""), + OPT_BOOLEAN(0, "aux-highorder", &record.opts.aux_highorder, + "Use high-order allocation for AUX buffers"), OPT_UINTEGER(0, "proc-map-timeout", &proc_map_timeout, "per thread proc mmap processing timeout in ms"), OPT_BOOLEAN(0, "namespaces", &record.opts.record_namespaces, diff --git a/tools/perf/perf.h b/tools/perf/perf.h index b120e547ddc7..42f1fa5faeed 100644 --- a/tools/perf/perf.h +++ b/tools/perf/perf.h @@ -67,6 +67,7 @@ struct record_opts { bool strict_freq; bool sample_id; bool bpf_event; + bool aux_highorder; unsigned int freq; unsigned int mmap_pages; unsigned int auxtrace_mmap_pages; diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 684c893ca6bc..3f4dc8f79d2f 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -1098,6 +1098,9 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts, if (evsel->own_cpus || evsel->unit) evsel->attr.read_format |= PERF_FORMAT_ID; + if (opts->aux_highorder) + attr->aux_highorder = 1; + /* * Apply event specific term settings, * it overloads any global configuration. @@ -1657,6 +1660,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr, PRINT_ATTRf(namespaces, p_unsigned); PRINT_ATTRf(ksymbol, p_unsigned); PRINT_ATTRf(bpf_event, p_unsigned); + PRINT_ATTRf(aux_highorder, p_unsigned); PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned); PRINT_ATTRf(bp_type, p_unsigned); @@ -2891,6 +2895,11 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, struct target *target, "No permission to enable %s event.\n\n", perf_evsel__name(evsel)); + if (evsel->attr.aux_highorder) + printed += scnprintf(msg + printed, size - printed, + "Using --aux-highorder may not be possible with your privilege level,\n" + "which requires CAP_SYS_ADMIN capability. Consider using sudo.\n\n"); + return scnprintf(msg + printed, size - printed, "You may not have permission to collect %sstats.\n\n" "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n" -- 2.20.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-02-13 19:17 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-13 11:47 [PATCH v0 0/2] perf: Allow forcing high-order allocation for AUX buffers Alexander Shishkin 2019-02-13 11:47 ` [PATCH v0 1/2] perf: Add an option to ask for high order allocations " Alexander Shishkin 2019-02-13 13:07 ` Peter Zijlstra 2019-02-13 13:49 ` Alexander Shishkin 2019-02-13 17:47 ` Mel Gorman 2019-02-13 17:54 ` Peter Zijlstra 2019-02-13 19:17 ` Mel Gorman 2019-02-13 11:47 ` [PATCH v0 2/2] perf record: Add --aux-highorder Alexander Shishkin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.