* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region [not found] <CAP045Ap8cMx6mzSgcQ3n3bnh_8GJuCp7_KZe_5ZTCR_K6cPTLw@mail.gmail.com> @ 2017-06-28 1:01 ` Kyle Huey 2017-06-28 2:09 ` Jin, Yao 0 siblings, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-06-28 1:01 UTC (permalink / raw) To: Jin Yao, Ingo Molnar Cc: Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Mark Rutland, Will Deacon, gregkh, Robert O'Callahan, open list Sent again with LKML CCd, sorry for the noise. - Kyle On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@kylehuey.com> wrote: > cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be > a candidate for backporting to stable branches. > > rr, a userspace record and replay debugger[0], uses the PMU interrupt > to stop a program during replay to inject asynchronous events such as > signals. We are counting retired conditional branches in userspace > only. This changeset causes the kernel to drop interrupts on the > floor if, during the PMU interrupt's "skid" region, the CPU enters > kernel mode for whatever reason. When replaying traces of complex > programs such as Firefox, we intermittently fail to deliver > asynchronous events on time, leading the replay to diverge from the > recorded state. > > It seems like this change should, at a bare minimum, be limited to > counters that actually perform sampling of register state when the > interrupt fires. In our case, with the retired conditional branches > counter restricted to counting userspace events only, it makes no > difference that the PMU interrupt happened to be delivered in the > kernel. > > As this makes rr unusable on complex applications and cannot be > efficiently worked around, we would appreciate this being addressed > before 4.12 is finalized, and the regression not being introduced to > stable branches. > > Thanks, > > - Kyle > > [0] http://rr-project.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 1:01 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey @ 2017-06-28 2:09 ` Jin, Yao 2017-06-28 4:51 ` Kyle Huey 0 siblings, 1 reply; 37+ messages in thread From: Jin, Yao @ 2017-06-28 2:09 UTC (permalink / raw) To: Kyle Huey, Ingo Molnar Cc: Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Mark Rutland, Will Deacon, gregkh, Robert O'Callahan, open list Hi, In theory, the PMI interrupts in skid region should be dropped, right? For a userspace debugger, is it the only choice that relies on the *skid* PMI interrupt? Thanks Jin Yao On 6/28/2017 9:01 AM, Kyle Huey wrote: > Sent again with LKML CCd, sorry for the noise. > > - Kyle > > On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@kylehuey.com> wrote: >> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be >> a candidate for backporting to stable branches. >> >> rr, a userspace record and replay debugger[0], uses the PMU interrupt >> to stop a program during replay to inject asynchronous events such as >> signals. We are counting retired conditional branches in userspace >> only. This changeset causes the kernel to drop interrupts on the >> floor if, during the PMU interrupt's "skid" region, the CPU enters >> kernel mode for whatever reason. When replaying traces of complex >> programs such as Firefox, we intermittently fail to deliver >> asynchronous events on time, leading the replay to diverge from the >> recorded state. >> >> It seems like this change should, at a bare minimum, be limited to >> counters that actually perform sampling of register state when the >> interrupt fires. In our case, with the retired conditional branches >> counter restricted to counting userspace events only, it makes no >> difference that the PMU interrupt happened to be delivered in the >> kernel. >> >> As this makes rr unusable on complex applications and cannot be >> efficiently worked around, we would appreciate this being addressed >> before 4.12 is finalized, and the regression not being introduced to >> stable branches. >> >> Thanks, >> >> - Kyle >> >> [0] http://rr-project.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 2:09 ` Jin, Yao @ 2017-06-28 4:51 ` Kyle Huey 2017-06-28 5:35 ` Jin, Yao 2017-06-28 10:12 ` Mark Rutland 0 siblings, 2 replies; 37+ messages in thread From: Kyle Huey @ 2017-06-28 4:51 UTC (permalink / raw) To: Jin, Yao Cc: Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Mark Rutland, Will Deacon, gregkh, Robert O'Callahan, open list On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: > Hi, > > In theory, the PMI interrupts in skid region should be dropped, right? No, why would they be dropped? My understanding of the situation is as follows: There is some time, call it t_0, where the hardware counter overflows. The PMU triggers an interrupt, but this is not instantaneous. Call the time when the interrupt is actually delivered t_1. Then t_1 - t_0 is the "skid". Note that if the counter is `exclude_kernel`, then at t_0 the CPU *must* be running a userspace program. But by t_1, the CPU may be doing something else. Your patch changed things so that if at t_1 the CPU is in the kernel, then the interrupt is discarded. But rr has programmed the counter to deliver a signal on overflow (via F_SETSIG on the fd returned by perf_event_open). This change results in the signal never being delivered, because the interrupt was ignored. (More accurately, the signal is delivered the *next* time the counter overflows, which is far past where we wanted to inject our asynchronous event into our tracee. It seems to me that it might be reasonable to ignore the interrupt if the purpose of the interrupt is to trigger sampling of the CPUs register state. But if the interrupt will trigger some other operation, such as a signal on an fd, then there's no reason to drop it. > For a userspace debugger, is it the only choice that relies on the *skid* > PMI interrupt? I don't understand this question, but hopefully the above clarified things. - Kyle > Thanks > Jin Yao > > > On 6/28/2017 9:01 AM, Kyle Huey wrote: >> >> Sent again with LKML CCd, sorry for the noise. >> >> - Kyle >> >> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@kylehuey.com> wrote: >>> >>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be >>> a candidate for backporting to stable branches. >>> >>> rr, a userspace record and replay debugger[0], uses the PMU interrupt >>> to stop a program during replay to inject asynchronous events such as >>> signals. We are counting retired conditional branches in userspace >>> only. This changeset causes the kernel to drop interrupts on the >>> floor if, during the PMU interrupt's "skid" region, the CPU enters >>> kernel mode for whatever reason. When replaying traces of complex >>> programs such as Firefox, we intermittently fail to deliver >>> asynchronous events on time, leading the replay to diverge from the >>> recorded state. >>> >>> It seems like this change should, at a bare minimum, be limited to >>> counters that actually perform sampling of register state when the >>> interrupt fires. In our case, with the retired conditional branches >>> counter restricted to counting userspace events only, it makes no >>> difference that the PMU interrupt happened to be delivered in the >>> kernel. >>> >>> As this makes rr unusable on complex applications and cannot be >>> efficiently worked around, we would appreciate this being addressed >>> before 4.12 is finalized, and the regression not being introduced to >>> stable branches. >>> >>> Thanks, >>> >>> - Kyle >>> >>> [0] http://rr-project.org/ > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 4:51 ` Kyle Huey @ 2017-06-28 5:35 ` Jin, Yao 2017-06-28 7:30 ` Kyle Huey 2017-06-28 10:12 ` Mark Rutland 1 sibling, 1 reply; 37+ messages in thread From: Jin, Yao @ 2017-06-28 5:35 UTC (permalink / raw) To: Kyle Huey Cc: Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Mark Rutland, Will Deacon, gregkh, Robert O'Callahan, open list, yao.jin Hi Kyle, I understand your requirement. Sorry I don't know the detail of rr debugger, but I guess if it just uses counter overflow to deliver a signal, could it set the counter without "exclude_kernel"? Another consideration is, I'm not sure if the kernel can accurately realize that if the interrupt is to trigger sampling or just deliver a signal. Userspace may know that but kernel may not. Anyway let's listen to more comments from community. Thanks Jin Yao On 6/28/2017 12:51 PM, Kyle Huey wrote: > On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: >> Hi, >> >> In theory, the PMI interrupts in skid region should be dropped, right? > No, why would they be dropped? > > My understanding of the situation is as follows: > > There is some time, call it t_0, where the hardware counter overflows. > The PMU triggers an interrupt, but this is not instantaneous. Call > the time when the interrupt is actually delivered t_1. Then t_1 - t_0 > is the "skid". > > Note that if the counter is `exclude_kernel`, then at t_0 the CPU > *must* be running a userspace program. But by t_1, the CPU may be > doing something else. Your patch changed things so that if at t_1 the > CPU is in the kernel, then the interrupt is discarded. But rr has > programmed the counter to deliver a signal on overflow (via F_SETSIG > on the fd returned by perf_event_open). This change results in the > signal never being delivered, because the interrupt was ignored. > (More accurately, the signal is delivered the *next* time the counter > overflows, which is far past where we wanted to inject our > asynchronous event into our tracee. > > It seems to me that it might be reasonable to ignore the interrupt if > the purpose of the interrupt is to trigger sampling of the CPUs > register state. But if the interrupt will trigger some other > operation, such as a signal on an fd, then there's no reason to drop > it. > >> For a userspace debugger, is it the only choice that relies on the *skid* >> PMI interrupt? > I don't understand this question, but hopefully the above clarified things. > > - Kyle > >> Thanks >> Jin Yao >> >> >> On 6/28/2017 9:01 AM, Kyle Huey wrote: >>> Sent again with LKML CCd, sorry for the noise. >>> >>> - Kyle >>> >>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@kylehuey.com> wrote: >>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be >>>> a candidate for backporting to stable branches. >>>> >>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt >>>> to stop a program during replay to inject asynchronous events such as >>>> signals. We are counting retired conditional branches in userspace >>>> only. This changeset causes the kernel to drop interrupts on the >>>> floor if, during the PMU interrupt's "skid" region, the CPU enters >>>> kernel mode for whatever reason. When replaying traces of complex >>>> programs such as Firefox, we intermittently fail to deliver >>>> asynchronous events on time, leading the replay to diverge from the >>>> recorded state. >>>> >>>> It seems like this change should, at a bare minimum, be limited to >>>> counters that actually perform sampling of register state when the >>>> interrupt fires. In our case, with the retired conditional branches >>>> counter restricted to counting userspace events only, it makes no >>>> difference that the PMU interrupt happened to be delivered in the >>>> kernel. >>>> >>>> As this makes rr unusable on complex applications and cannot be >>>> efficiently worked around, we would appreciate this being addressed >>>> before 4.12 is finalized, and the regression not being introduced to >>>> stable branches. >>>> >>>> Thanks, >>>> >>>> - Kyle >>>> >>>> [0] http://rr-project.org/ >> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 5:35 ` Jin, Yao @ 2017-06-28 7:30 ` Kyle Huey 0 siblings, 0 replies; 37+ messages in thread From: Kyle Huey @ 2017-06-28 7:30 UTC (permalink / raw) To: Jin, Yao Cc: Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Mark Rutland, Will Deacon, gregkh, Robert O'Callahan, open list, yao.jin On Tue, Jun 27, 2017 at 10:35 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: > Hi Kyle, > > I understand your requirement. Sorry I don't know the detail of rr debugger, > but I guess if it just uses counter overflow to deliver a signal, could it > set the counter without "exclude_kernel"? Unfortunately we cannot. We depend on the counter value being exactly the same between recording and replay, and dropping `exclude_kernel` would introduce non-determinism. > Another consideration is, I'm not sure if the kernel can accurately realize > that if the interrupt is to trigger sampling or just deliver a signal. > Userspace may know that but kernel may not. After looking at this code a bit more, I think that changing the `is_sample_allowed` check from an early return to a guard around the invocation of `overflow_handler` would fix this. I believe, but have not tested, that `perf_event_fasync` is what must run to deliver our signal, while the `overflow_handler` is what copies the kernel RIP/etc into the output buffer that you want to skip. - Kyle > Anyway let's listen to more comments from community. > > Thanks > > Jin Yao > > > > On 6/28/2017 12:51 PM, Kyle Huey wrote: >> >> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: >>> >>> Hi, >>> >>> In theory, the PMI interrupts in skid region should be dropped, right? >> >> No, why would they be dropped? >> >> My understanding of the situation is as follows: >> >> There is some time, call it t_0, where the hardware counter overflows. >> The PMU triggers an interrupt, but this is not instantaneous. Call >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0 >> is the "skid". >> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU >> *must* be running a userspace program. But by t_1, the CPU may be >> doing something else. Your patch changed things so that if at t_1 the >> CPU is in the kernel, then the interrupt is discarded. But rr has >> programmed the counter to deliver a signal on overflow (via F_SETSIG >> on the fd returned by perf_event_open). This change results in the >> signal never being delivered, because the interrupt was ignored. >> (More accurately, the signal is delivered the *next* time the counter >> overflows, which is far past where we wanted to inject our >> asynchronous event into our tracee. >> >> It seems to me that it might be reasonable to ignore the interrupt if >> the purpose of the interrupt is to trigger sampling of the CPUs >> register state. But if the interrupt will trigger some other >> operation, such as a signal on an fd, then there's no reason to drop >> it. >> >>> For a userspace debugger, is it the only choice that relies on the *skid* >>> PMI interrupt? >> >> I don't understand this question, but hopefully the above clarified >> things. >> >> - Kyle >> >>> Thanks >>> Jin Yao >>> >>> >>> On 6/28/2017 9:01 AM, Kyle Huey wrote: >>>> >>>> Sent again with LKML CCd, sorry for the noise. >>>> >>>> - Kyle >>>> >>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <me@kylehuey.com> wrote: >>>>> >>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be >>>>> a candidate for backporting to stable branches. >>>>> >>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt >>>>> to stop a program during replay to inject asynchronous events such as >>>>> signals. We are counting retired conditional branches in userspace >>>>> only. This changeset causes the kernel to drop interrupts on the >>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters >>>>> kernel mode for whatever reason. When replaying traces of complex >>>>> programs such as Firefox, we intermittently fail to deliver >>>>> asynchronous events on time, leading the replay to diverge from the >>>>> recorded state. >>>>> >>>>> It seems like this change should, at a bare minimum, be limited to >>>>> counters that actually perform sampling of register state when the >>>>> interrupt fires. In our case, with the retired conditional branches >>>>> counter restricted to counting userspace events only, it makes no >>>>> difference that the PMU interrupt happened to be delivered in the >>>>> kernel. >>>>> >>>>> As this makes rr unusable on complex applications and cannot be >>>>> efficiently worked around, we would appreciate this being addressed >>>>> before 4.12 is finalized, and the regression not being introduced to >>>>> stable branches. >>>>> >>>>> Thanks, >>>>> >>>>> - Kyle >>>>> >>>>> [0] http://rr-project.org/ >>> >>> > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 4:51 ` Kyle Huey 2017-06-28 5:35 ` Jin, Yao @ 2017-06-28 10:12 ` Mark Rutland 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland 2017-06-28 16:46 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey 1 sibling, 2 replies; 37+ messages in thread From: Mark Rutland @ 2017-06-28 10:12 UTC (permalink / raw) To: Kyle Huey Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: > On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: > > Hi, > > > > In theory, the PMI interrupts in skid region should be dropped, right? > > No, why would they be dropped? > > My understanding of the situation is as follows: > > There is some time, call it t_0, where the hardware counter overflows. > The PMU triggers an interrupt, but this is not instantaneous. Call > the time when the interrupt is actually delivered t_1. Then t_1 - t_0 > is the "skid". > > Note that if the counter is `exclude_kernel`, then at t_0 the CPU > *must* be running a userspace program. But by t_1, the CPU may be > doing something else. Your patch changed things so that if at t_1 the > CPU is in the kernel, then the interrupt is discarded. But rr has > programmed the counter to deliver a signal on overflow (via F_SETSIG > on the fd returned by perf_event_open). This change results in the > signal never being delivered, because the interrupt was ignored. > (More accurately, the signal is delivered the *next* time the counter > overflows, which is far past where we wanted to inject our > asynchronous event into our tracee. Yes, this is a bug. As we're trying to avoid smapling state, I think we can move the check into perf_prepare_sample() or __perf_event_output(), where that state is actually sampled. I'll take a look at that momentarily. Just to clarify, you don't care about the sample state at all? i.e. you don't need the user program counter? Is that signal delivered to the tracee, or to a different process that traces it? If the latter, what ensures that the task is stopped sufficiently quickly? > It seems to me that it might be reasonable to ignore the interrupt if > the purpose of the interrupt is to trigger sampling of the CPUs > register state. But if the interrupt will trigger some other > operation, such as a signal on an fd, then there's no reason to drop > it. Agreed. I'll try to have a patch for this soon. I just need to figure out exactly where that overflow signal is generated by the perf core. Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 10:12 ` Mark Rutland @ 2017-06-28 10:56 ` Mark Rutland 2017-06-28 12:40 ` Vince Weaver ` (2 more replies) 2017-06-28 16:46 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey 1 sibling, 3 replies; 37+ messages in thread From: Mark Rutland @ 2017-06-28 10:56 UTC (permalink / raw) To: Kyle Huey Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: > As we're trying to avoid smapling state, I think we can move the check > into perf_prepare_sample() or __perf_event_output(), where that state is > actually sampled. I'll take a look at that momentarily. > > Just to clarify, you don't care about the sample state at all? i.e. you > don't need the user program counter? > > Is that signal delivered to the tracee, or to a different process that > traces it? If the latter, what ensures that the task is stopped > sufficiently quickly? > > > It seems to me that it might be reasonable to ignore the interrupt if > > the purpose of the interrupt is to trigger sampling of the CPUs > > register state. But if the interrupt will trigger some other > > operation, such as a signal on an fd, then there's no reason to drop > > it. > > Agreed. I'll try to have a patch for this soon. > > I just need to figure out exactly where that overflow signal is > generated by the perf core. I've figured that out now. That's handled by perf_pending_event(), whcih is the irq_work we schedule in __perf_event_overflow(). Does the below patch work for you? ---->8---- >From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001 From: Mark Rutland <mark.rutland@arm.com> Date: Wed, 28 Jun 2017 11:39:25 +0100 Subject: [PATCH] perf/core: generate overflow signal when samples are dropped We recently tried to kill an information leak where users could receive kernel samples due to skid on the PMU interrupt. To block this, we bailed out early in perf_event_overflow(), as we do for non-sampling events. This broke rr, which uses sampling events to receive a signal on overflow (but does not care about the contents of the sample). These signals are critical to the correct operation of rr. Instead of bailing out early in perf_event_overflow, we can bail prior to performing the actual sampling in __perf_event_output(). This avoids the information leak, but preserves the generation of the signal. Since we don't place any sample data into the ring buffer, the signal is arguably spurious. However, a userspace ringbuffer consumer can already consume data prior to taking the associated signals, and therefore must handle spurious signals to operate correctly. Thus, this signal shouldn't be harmful. Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified") Reported-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/events/core.c | 46 +++++++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 6c4e523..6b263f3 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header, } } +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) +{ + /* + * Due to interrupt latency (AKA "skid"), we may enter the + * kernel before taking an overflow, even if the PMU is only + * counting user events. + * To avoid leaking information to userspace, we must always + * reject kernel samples when exclude_kernel is set. + */ + if (event->attr.exclude_kernel && !user_mode(regs)) + return false; + + return true; +} + static void __always_inline __perf_event_output(struct perf_event *event, struct perf_sample_data *data, @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, struct perf_output_handle handle; struct perf_event_header header; + /* + * For security, drop the skid kernel samples if necessary. + */ + if (!sample_is_allowed(event, regs)) + return ret; + /* protect the callchain buffers */ rcu_read_lock(); @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event) return __perf_event_account_interrupt(event, 1); } -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) -{ - /* - * Due to interrupt latency (AKA "skid"), we may enter the - * kernel before taking an overflow, even if the PMU is only - * counting user events. - * To avoid leaking information to userspace, we must always - * reject kernel samples when exclude_kernel is set. - */ - if (event->attr.exclude_kernel && !user_mode(regs)) - return false; - - return true; -} - /* * Generic event overflow handling, sampling. */ @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event, ret = __perf_event_account_interrupt(event, throttle); /* - * For security, drop the skid kernel samples if necessary. - */ - if (!sample_is_allowed(event, regs)) - return ret; - - /* * XXX event_limit might not quite work as expected on inherited * events */ @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event, READ_ONCE(event->overflow_handler)(event, data, regs); + /* + * We must generate a wakeup regardless of whether we actually + * generated a sample. This is relied upon by rr. + */ if (*perf_event_fasync(event) && event->pending_kill) { event->pending_wakeup = 1; irq_work_queue(&event->pending); -- 1.9.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland @ 2017-06-28 12:40 ` Vince Weaver 2017-06-28 13:07 ` Mark Rutland 2017-06-28 16:48 ` Kyle Huey 2017-07-11 9:03 ` [tip:perf/urgent] Revert "perf/core: Drop kernel samples even though :u is specified" tip-bot for Ingo Molnar 2 siblings, 1 reply; 37+ messages in thread From: Vince Weaver @ 2017-06-28 12:40 UTC (permalink / raw) To: Mark Rutland Cc: Kyle Huey, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, 28 Jun 2017, Mark Rutland wrote: > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: > Instead of bailing out early in perf_event_overflow, we can bail prior > to performing the actual sampling in __perf_event_output(). This avoids > the information leak, but preserves the generation of the signal. > > Since we don't place any sample data into the ring buffer, the signal is > arguably spurious. However, a userspace ringbuffer consumer can already > consume data prior to taking the associated signals, and therefore must > handle spurious signals to operate correctly. Thus, this signal > shouldn't be harmful. this could still break some of my perf_event validation tests. Ones that set up a sampling event for every 1M instructions, run for 100M instructions, and expect there to be 100 samples received. If we're so worried about info leakage, can't we just zero-out the problem address (or randomize the kernel address) rather than just pretending the interrupt didn't happen? Vince ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 12:40 ` Vince Weaver @ 2017-06-28 13:07 ` Mark Rutland 2017-06-29 8:13 ` Alexey Budankov 0 siblings, 1 reply; 37+ messages in thread From: Mark Rutland @ 2017-06-28 13:07 UTC (permalink / raw) To: Vince Weaver Cc: Kyle Huey, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: > On Wed, 28 Jun 2017, Mark Rutland wrote: > > > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: > > > Instead of bailing out early in perf_event_overflow, we can bail prior > > to performing the actual sampling in __perf_event_output(). This avoids > > the information leak, but preserves the generation of the signal. > > > > Since we don't place any sample data into the ring buffer, the signal is > > arguably spurious. However, a userspace ringbuffer consumer can already > > consume data prior to taking the associated signals, and therefore must > > handle spurious signals to operate correctly. Thus, this signal > > shouldn't be harmful. > > this could still break some of my perf_event validation tests. > > Ones that set up a sampling event for every 1M instructions, run for 100M > instructions, and expect there to be 100 samples received. Is that test reliable today? I'd expect that at least on ARM it's not, given that events can be counted imprecisely, and mode filters can be applied imprecisely. So you might get fewer (or more) samples. I'd imagine similar is true on other archtiectures. If sampling took long enough, the existing ratelimiting could come into effect, too. Surely that already has some error margin? > If we're so worried about info leakage, can't we just zero-out the problem > address (or randomize the kernel address) rather than just pretending the > interrupt didn't happen? Making up zeroed or randomized data is going to confuse users. I can't imagine that real users are going to want bogus samples that they have to identify (somehow) in order to skip when processing the data. I can see merit in signalling "lost" samples to userspace, so long as they're easily distinguished from real samples. One option is to fake up a sample using the user regs regardless, but that's both fragile and surprising in other cases. Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 13:07 ` Mark Rutland @ 2017-06-29 8:13 ` Alexey Budankov 2017-06-29 8:25 ` Alexey Budankov 0 siblings, 1 reply; 37+ messages in thread From: Alexey Budankov @ 2017-06-29 8:13 UTC (permalink / raw) To: Mark Rutland, Vince Weaver Cc: Kyle Huey, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list Hi Folks, On 28.06.2017 16:07, Mark Rutland wrote: > On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: >> On Wed, 28 Jun 2017, Mark Rutland wrote: >> >>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >> >>> Instead of bailing out early in perf_event_overflow, we can bail prior >>> to performing the actual sampling in __perf_event_output(). This avoids >>> the information leak, but preserves the generation of the signal. >>> >>> Since we don't place any sample data into the ring buffer, the signal is >>> arguably spurious. However, a userspace ringbuffer consumer can already >>> consume data prior to taking the associated signals, and therefore must >>> handle spurious signals to operate correctly. Thus, this signal >>> shouldn't be harmful. >> >> this could still break some of my perf_event validation tests. >> >> Ones that set up a sampling event for every 1M instructions, run for 100M >> instructions, and expect there to be 100 samples received. > > Is that test reliable today? > > I'd expect that at least on ARM it's not, given that events can be > counted imprecisely, and mode filters can be applied imprecisely. So you > might get fewer (or more) samples. I'd imagine similar is true on other > archtiectures. > > If sampling took long enough, the existing ratelimiting could come into > effect, too. > > Surely that already has some error margin? FYI. >From my recent experience and observation (on Intel Xeon Phi) wakeup_events_overflow and overflow_poll tests may fail if /proc/sys/kernel/perf_event_max_sample_rate is low enough: # cat /proc/sys/kernel/perf_event_max_sample_rate 6000 # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7f707b0dc000 POLL_IN : 10 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 Testing wakeup events overflow... PASSED # abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131412 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7fb3bfe04000 POLL_IN : 10 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Testing catching overflow with poll()... PASSED # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate # abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131551 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7f80532df000 POLL_IN : 9 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Unexpected POLL_IN interrupt. Testing catching overflow with poll()... FAILED [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll This tests using poll() to catch overflow. Monitoring pid 131553 status 1407 Child has stopped due to signal 5 (Trace/breakpoint trap) Continuing child Returned HUP! Counts, using mmap buffer 0x7f650952c000 POLL_IN : 9 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 1 UNKNOWN : 0 Unexpected POLL_IN interrupt. Testing catching overflow with poll()... FAILED > >> If we're so worried about info leakage, can't we just zero-out the problem >> address (or randomize the kernel address) rather than just pretending the >> interrupt didn't happen? > > Making up zeroed or randomized data is going to confuse users. I can't > imagine that real users are going to want bogus samples that they have > to identify (somehow) in order to skip when processing the data. > > I can see merit in signalling "lost" samples to userspace, so long as > they're easily distinguished from real samples. > > One option is to fake up a sample using the user regs regardless, but > that's both fragile and surprising in other cases. > > Thanks, > Mark. > Thanks, Alexey ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-29 8:13 ` Alexey Budankov @ 2017-06-29 8:25 ` Alexey Budankov 0 siblings, 0 replies; 37+ messages in thread From: Alexey Budankov @ 2017-06-29 8:25 UTC (permalink / raw) To: Mark Rutland, Vince Weaver Cc: Kyle Huey, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On 29.06.2017 11:13, Alexey Budankov wrote: > Hi Folks, > > On 28.06.2017 16:07, Mark Rutland wrote: >> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote: >>> On Wed, 28 Jun 2017, Mark Rutland wrote: >>> >>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >>> >>>> Instead of bailing out early in perf_event_overflow, we can bail prior >>>> to performing the actual sampling in __perf_event_output(). This avoids >>>> the information leak, but preserves the generation of the signal. >>>> >>>> Since we don't place any sample data into the ring buffer, the signal is >>>> arguably spurious. However, a userspace ringbuffer consumer can already >>>> consume data prior to taking the associated signals, and therefore must >>>> handle spurious signals to operate correctly. Thus, this signal >>>> shouldn't be harmful. >>> >>> this could still break some of my perf_event validation tests. >>> >>> Ones that set up a sampling event for every 1M instructions, run for 100M >>> instructions, and expect there to be 100 samples received. >> >> Is that test reliable today? >> >> I'd expect that at least on ARM it's not, given that events can be >> counted imprecisely, and mode filters can be applied imprecisely. So you >> might get fewer (or more) samples. I'd imagine similar is true on other >> archtiectures. >> >> If sampling took long enough, the existing ratelimiting could come into >> effect, too. >> >> Surely that already has some error margin? > > FYI. > >>From my recent experience and observation (on Intel Xeon Phi) > wakeup_events_overflow and overflow_poll tests may fail if > /proc/sys/kernel/perf_event_max_sample_rate is low enough: > > # cat /proc/sys/kernel/perf_event_max_sample_rate > 6000 > # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow > This tests wakeup event overflows. > Testing with wakeup_events=1. > Counts, using mmap buffer 0x7f707b0dc000 > POLL_IN : 10 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 0 > UNKNOWN : 0 > Testing wakeup events overflow... PASSED > # abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131412 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7fb3bfe04000 > POLL_IN : 10 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Testing catching overflow with poll()... PASSED > > # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate > # abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131551 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7f80532df000 > POLL_IN : 9 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Unexpected POLL_IN interrupt. > Testing catching overflow with poll()... FAILED > [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll > This tests using poll() to catch overflow. > Monitoring pid 131553 status 1407 > Child has stopped due to signal 5 (Trace/breakpoint trap) > Continuing child > Returned HUP! > Counts, using mmap buffer 0x7f650952c000 > POLL_IN : 9 > POLL_OUT: 0 > POLL_MSG: 0 > POLL_ERR: 0 > POLL_PRI: 0 > POLL_HUP: 1 > UNKNOWN : 0 > Unexpected POLL_IN interrupt. > Testing catching overflow with poll()... FAILED More of the other test: # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7fe65deff000 POLL_IN : 4 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 4, expected 10. Testing wakeup events overflow... FAILED [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7f818a732000 POLL_IN : 2 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 2, expected 10. Testing wakeup events overflow... FAILED [ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow This tests wakeup event overflows. Testing with wakeup_events=1. Counts, using mmap buffer 0x7fb3675c5000 POLL_IN : 4 POLL_OUT: 0 POLL_MSG: 0 POLL_ERR: 0 POLL_PRI: 0 POLL_HUP: 0 UNKNOWN : 0 POLL_IN value 4, expected 10. Testing wakeup events overflow... FAILED > >> >>> If we're so worried about info leakage, can't we just zero-out the problem >>> address (or randomize the kernel address) rather than just pretending the >>> interrupt didn't happen? >> >> Making up zeroed or randomized data is going to confuse users. I can't >> imagine that real users are going to want bogus samples that they have >> to identify (somehow) in order to skip when processing the data. >> >> I can see merit in signalling "lost" samples to userspace, so long as >> they're easily distinguished from real samples. >> >> One option is to fake up a sample using the user regs regardless, but >> that's both fragile and surprising in other cases. >> >> Thanks, >> Mark. >> > > Thanks, > Alexey > > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland 2017-06-28 12:40 ` Vince Weaver @ 2017-06-28 16:48 ` Kyle Huey 2017-06-28 17:49 ` Mark Rutland 2017-07-11 9:03 ` [tip:perf/urgent] Revert "perf/core: Drop kernel samples even though :u is specified" tip-bot for Ingo Molnar 2 siblings, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-06-28 16:48 UTC (permalink / raw) To: Mark Rutland Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote: >> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: > >> As we're trying to avoid smapling state, I think we can move the check >> into perf_prepare_sample() or __perf_event_output(), where that state is >> actually sampled. I'll take a look at that momentarily. >> >> Just to clarify, you don't care about the sample state at all? i.e. you >> don't need the user program counter? >> >> Is that signal delivered to the tracee, or to a different process that >> traces it? If the latter, what ensures that the task is stopped >> sufficiently quickly? >> >> > It seems to me that it might be reasonable to ignore the interrupt if >> > the purpose of the interrupt is to trigger sampling of the CPUs >> > register state. But if the interrupt will trigger some other >> > operation, such as a signal on an fd, then there's no reason to drop >> > it. >> >> Agreed. I'll try to have a patch for this soon. >> >> I just need to figure out exactly where that overflow signal is >> generated by the perf core. > > I've figured that out now. That's handled by perf_pending_event(), whcih > is the irq_work we schedule in __perf_event_overflow(). > > Does the below patch work for you? > > ---->8---- > From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001 > From: Mark Rutland <mark.rutland@arm.com> > Date: Wed, 28 Jun 2017 11:39:25 +0100 > Subject: [PATCH] perf/core: generate overflow signal when samples are dropped > > We recently tried to kill an information leak where users could receive > kernel samples due to skid on the PMU interrupt. To block this, we > bailed out early in perf_event_overflow(), as we do for non-sampling > events. > > This broke rr, which uses sampling events to receive a signal on > overflow (but does not care about the contents of the sample). These > signals are critical to the correct operation of rr. > > Instead of bailing out early in perf_event_overflow, we can bail prior > to performing the actual sampling in __perf_event_output(). This avoids > the information leak, but preserves the generation of the signal. > > Since we don't place any sample data into the ring buffer, the signal is > arguably spurious. However, a userspace ringbuffer consumer can already > consume data prior to taking the associated signals, and therefore must > handle spurious signals to operate correctly. Thus, this signal > shouldn't be harmful. > > Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified") > Reported-by: Kyle Huey <me@kylehuey.com> > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Jin Yao <yao.jin@linux.intel.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/events/core.c | 46 +++++++++++++++++++++++++--------------------- > 1 file changed, 25 insertions(+), 21 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 6c4e523..6b263f3 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header, > } > } > > +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) > +{ > + /* > + * Due to interrupt latency (AKA "skid"), we may enter the > + * kernel before taking an overflow, even if the PMU is only > + * counting user events. > + * To avoid leaking information to userspace, we must always > + * reject kernel samples when exclude_kernel is set. > + */ > + if (event->attr.exclude_kernel && !user_mode(regs)) > + return false; > + > + return true; > +} > + > static void __always_inline > __perf_event_output(struct perf_event *event, > struct perf_sample_data *data, > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, > struct perf_output_handle handle; > struct perf_event_header header; > > + /* > + * For security, drop the skid kernel samples if necessary. > + */ > + if (!sample_is_allowed(event, regs)) > + return ret; Just a bare return here. > + > /* protect the callchain buffers */ > rcu_read_lock(); > > @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event) > return __perf_event_account_interrupt(event, 1); > } > > -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) > -{ > - /* > - * Due to interrupt latency (AKA "skid"), we may enter the > - * kernel before taking an overflow, even if the PMU is only > - * counting user events. > - * To avoid leaking information to userspace, we must always > - * reject kernel samples when exclude_kernel is set. > - */ > - if (event->attr.exclude_kernel && !user_mode(regs)) > - return false; > - > - return true; > -} > - > /* > * Generic event overflow handling, sampling. > */ > @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event, > ret = __perf_event_account_interrupt(event, throttle); > > /* > - * For security, drop the skid kernel samples if necessary. > - */ > - if (!sample_is_allowed(event, regs)) > - return ret; > - > - /* > * XXX event_limit might not quite work as expected on inherited > * events > */ > @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event, > > READ_ONCE(event->overflow_handler)(event, data, regs); > > + /* > + * We must generate a wakeup regardless of whether we actually > + * generated a sample. This is relied upon by rr. > + */ > if (*perf_event_fasync(event) && event->pending_kill) { > event->pending_wakeup = 1; > irq_work_queue(&event->pending); > -- > 1.9.1 > I can confirm that with that fixed to compile, this patch fixes rr. Thanks! - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 16:48 ` Kyle Huey @ 2017-06-28 17:49 ` Mark Rutland 2017-06-28 22:55 ` Kyle Huey 2017-06-29 8:12 ` Ingo Molnar 0 siblings, 2 replies; 37+ messages in thread From: Mark Rutland @ 2017-06-28 17:49 UTC (permalink / raw) To: Kyle Huey, Vince Weaver Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, > > struct perf_output_handle handle; > > struct perf_event_header header; > > > > + /* > > + * For security, drop the skid kernel samples if necessary. > > + */ > > + if (!sample_is_allowed(event, regs)) > > + return ret; > > Just a bare return here. Ugh, yes. Sorry about that. I'll fix that up. [...] > I can confirm that with that fixed to compile, this patch fixes rr. Thanks for giving this a go. Having thought about this some more, I think Vince does make a good point that throwing away samples is liable to break stuff, e.g. that which only relies on (non-sensitive) samples. It still seems wrong to make up data, though. Maybe for exclude_kernel && !exclude_user events we can always generate samples from the user regs, rather than the exception regs. That's going to be closer to what the user wants, regardless. I'll take a look tomorrow. Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 17:49 ` Mark Rutland @ 2017-06-28 22:55 ` Kyle Huey 2017-06-29 0:27 ` Jin, Yao ` (2 more replies) 2017-06-29 8:12 ` Ingo Molnar 1 sibling, 3 replies; 37+ messages in thread From: Kyle Huey @ 2017-06-28 22:55 UTC (permalink / raw) To: Mark Rutland Cc: Vince Weaver, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >> > struct perf_output_handle handle; >> > struct perf_event_header header; >> > >> > + /* >> > + * For security, drop the skid kernel samples if necessary. >> > + */ >> > + if (!sample_is_allowed(event, regs)) >> > + return ret; >> >> Just a bare return here. > > Ugh, yes. Sorry about that. I'll fix that up. > > [...] > >> I can confirm that with that fixed to compile, this patch fixes rr. > > Thanks for giving this a go. > > Having thought about this some more, I think Vince does make a good > point that throwing away samples is liable to break stuff, e.g. that > which only relies on (non-sensitive) samples. > > It still seems wrong to make up data, though. > > Maybe for exclude_kernel && !exclude_user events we can always generate > samples from the user regs, rather than the exception regs. That's going > to be closer to what the user wants, regardless. I'll take a look > tomorrow. I'm not very familiar with the kernel internals, but the reason I didn't suggest this originally is it seems like it will be difficult to determine what the "correct" userspace registers are. For example, what happens if a performance counter is fixed to a given tid, the interrupt fires during a context switch from that task to another that is not being monitored, and the kernel is far enough along in the context switch that the current task struct has been switched out? Reporting the new task's registers seems as bad as reporting the kernel's registers. But maybe this is easier than I imagine for whatever reason. Something to think about. - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 22:55 ` Kyle Huey @ 2017-06-29 0:27 ` Jin, Yao 2017-06-30 17:44 ` Kyle Huey 2017-07-04 9:03 ` Peter Zijlstra 2 siblings, 0 replies; 37+ messages in thread From: Jin, Yao @ 2017-06-29 0:27 UTC (permalink / raw) To: Kyle Huey, Mark Rutland Cc: Vince Weaver, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list, yao.jin On 6/29/2017 6:55 AM, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>>> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >>>> struct perf_output_handle handle; >>>> struct perf_event_header header; >>>> >>>> + /* >>>> + * For security, drop the skid kernel samples if necessary. >>>> + */ >>>> + if (!sample_is_allowed(event, regs)) >>>> + return ret; >>> Just a bare return here. >> Ugh, yes. Sorry about that. I'll fix that up. >> >> [...] >> >>> I can confirm that with that fixed to compile, this patch fixes rr. >> Thanks for giving this a go. >> >> Having thought about this some more, I think Vince does make a good >> point that throwing away samples is liable to break stuff, e.g. that >> which only relies on (non-sensitive) samples. >> >> It still seems wrong to make up data, though. >> >> Maybe for exclude_kernel && !exclude_user events we can always generate >> samples from the user regs, rather than the exception regs. That's going >> to be closer to what the user wants, regardless. I'll take a look >> tomorrow. > I'm not very familiar with the kernel internals, but the reason I > didn't suggest this originally is it seems like it will be difficult > to determine what the "correct" userspace registers are. For example, > what happens if a performance counter is fixed to a given tid, the > interrupt fires during a context switch from that task to another that > is not being monitored, and the kernel is far enough along in the > context switch that the current task struct has been switched out? > Reporting the new task's registers seems as bad as reporting the > kernel's registers. But maybe this is easier than I imagine for > whatever reason. > > Something to think about. > > - Kyle Yes, I think so. The skid interrupt may be triggered at a wrong context and return wrong indications (e.g. wrong regs) to userspace. So that's why I think the *skid* interrupt had better be dropped. Thanks Jin Yao ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 22:55 ` Kyle Huey 2017-06-29 0:27 ` Jin, Yao @ 2017-06-30 17:44 ` Kyle Huey 2017-07-04 9:03 ` Peter Zijlstra 2 siblings, 0 replies; 37+ messages in thread From: Kyle Huey @ 2017-06-30 17:44 UTC (permalink / raw) To: Mark Rutland Cc: Vince Weaver, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey <me@kylehuey.com> wrote: > On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote: >>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header, >>> > struct perf_output_handle handle; >>> > struct perf_event_header header; >>> > >>> > + /* >>> > + * For security, drop the skid kernel samples if necessary. >>> > + */ >>> > + if (!sample_is_allowed(event, regs)) >>> > + return ret; >>> >>> Just a bare return here. >> >> Ugh, yes. Sorry about that. I'll fix that up. >> >> [...] >> >>> I can confirm that with that fixed to compile, this patch fixes rr. >> >> Thanks for giving this a go. >> >> Having thought about this some more, I think Vince does make a good >> point that throwing away samples is liable to break stuff, e.g. that >> which only relies on (non-sensitive) samples. >> >> It still seems wrong to make up data, though. >> >> Maybe for exclude_kernel && !exclude_user events we can always generate >> samples from the user regs, rather than the exception regs. That's going >> to be closer to what the user wants, regardless. I'll take a look >> tomorrow. > > I'm not very familiar with the kernel internals, but the reason I > didn't suggest this originally is it seems like it will be difficult > to determine what the "correct" userspace registers are. For example, > what happens if a performance counter is fixed to a given tid, the > interrupt fires during a context switch from that task to another that > is not being monitored, and the kernel is far enough along in the > context switch that the current task struct has been switched out? > Reporting the new task's registers seems as bad as reporting the > kernel's registers. But maybe this is easier than I imagine for > whatever reason. > > Something to think about. > > - Kyle We've noticed that the regressing changeset made it into stable branches that distros are shipping now[0], and we're starting to receive bug reports from our users. We would really like to get this patch or something similar into 4.12 and the next stable releases if at all possible. Thanks! - Kyle [0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 22:55 ` Kyle Huey 2017-06-29 0:27 ` Jin, Yao 2017-06-30 17:44 ` Kyle Huey @ 2017-07-04 9:03 ` Peter Zijlstra 2017-07-04 9:33 ` Mark Rutland 2 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2017-07-04 9:03 UTC (permalink / raw) To: Kyle Huey Cc: Mark Rutland, Vince Weaver, Jin, Yao, Ingo Molnar, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote: > > Having thought about this some more, I think Vince does make a good > > point that throwing away samples is liable to break stuff, e.g. that > > which only relies on (non-sensitive) samples. > > > > It still seems wrong to make up data, though. It is something we do in other places as well though. For example the printk() %pK thing fakes NULL pointers when kptr_restrict is set. Faking data gets a wee bit tricky in how much data we need to clear through, its not only IP, pretty much everything we get from the interrupt context, like the branch stack and registers is also suspect. > > Maybe for exclude_kernel && !exclude_user events we can always generate > > samples from the user regs, rather than the exception regs. That's going > > to be closer to what the user wants, regardless. I'll take a look > > tomorrow. > > I'm not very familiar with the kernel internals, but the reason I > didn't suggest this originally is it seems like it will be difficult > to determine what the "correct" userspace registers are. For example, > what happens if a performance counter is fixed to a given tid, the > interrupt fires during a context switch from that task to another that > is not being monitored, and the kernel is far enough along in the > context switch that the current task struct has been switched out? > Reporting the new task's registers seems as bad as reporting the > kernel's registers. But maybe this is easier than I imagine for > whatever reason. If the counter is fixed to a task then its scheduled along with the task. We'll schedule out the event before doing the actual task switch and switch in the new event after. That said, with a per-cpu event the TID sample value is indeed subject to skid like you describe. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-04 9:03 ` Peter Zijlstra @ 2017-07-04 9:33 ` Mark Rutland 2017-07-04 9:46 ` Peter Zijlstra 2017-07-04 10:21 ` Mark Rutland 0 siblings, 2 replies; 37+ messages in thread From: Mark Rutland @ 2017-07-04 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Kyle Huey, Vince Weaver, Jin, Yao, Ingo Molnar, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote: > On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote: > > > > Having thought about this some more, I think Vince does make a good > > > point that throwing away samples is liable to break stuff, e.g. that > > > which only relies on (non-sensitive) samples. > > > > > > It still seems wrong to make up data, though. > > It is something we do in other places as well though. For example the > printk() %pK thing fakes NULL pointers when kptr_restrict is set. It looks like I'm outnumbered on that, then. :) I'd still argue it's worth somehow indicating which samples were thrown away, so that (updated) userspace can choose to ignore them, but I guess that can come later. > Faking data gets a wee bit tricky in how much data we need to clear > through, its not only IP, pretty much everything we get from the > interrupt context, like the branch stack and registers is also suspect. Indeed. I'll take a run through __perf_event_output() and callees, and see what we need to drop. > > > Maybe for exclude_kernel && !exclude_user events we can always generate > > > samples from the user regs, rather than the exception regs. That's going > > > to be closer to what the user wants, regardless. I'll take a look > > > tomorrow. > > > > I'm not very familiar with the kernel internals, but the reason I > > didn't suggest this originally is it seems like it will be difficult > > to determine what the "correct" userspace registers are. For example, > > what happens if a performance counter is fixed to a given tid, the > > interrupt fires during a context switch from that task to another that > > is not being monitored, and the kernel is far enough along in the > > context switch that the current task struct has been switched out? > > Reporting the new task's registers seems as bad as reporting the > > kernel's registers. But maybe this is easier than I imagine for > > whatever reason. > > If the counter is fixed to a task then its scheduled along with the > task. We'll schedule out the event before doing the actual task switch > and switch in the new event after. > > That said, with a per-cpu event the TID sample value is indeed subject > to skid like you describe. For per-cpu events, does that matter? Those don't have TID filters in the first place, no? Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-04 9:33 ` Mark Rutland @ 2017-07-04 9:46 ` Peter Zijlstra 2017-07-04 10:21 ` Mark Rutland 1 sibling, 0 replies; 37+ messages in thread From: Peter Zijlstra @ 2017-07-04 9:46 UTC (permalink / raw) To: Mark Rutland Cc: Kyle Huey, Vince Weaver, Jin, Yao, Ingo Molnar, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote: > > That said, with a per-cpu event the TID sample value is indeed subject > > to skid like you describe. > > For per-cpu events, does that matter? Those don't have TID filters in > the first place, no? eBPF can do all sorts I suppose. But yes, at some point you just have to give up, there's only so much one can do. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-04 9:33 ` Mark Rutland 2017-07-04 9:46 ` Peter Zijlstra @ 2017-07-04 10:21 ` Mark Rutland 2017-07-06 5:07 ` Robert O'Callahan 1 sibling, 1 reply; 37+ messages in thread From: Mark Rutland @ 2017-07-04 10:21 UTC (permalink / raw) To: Peter Zijlstra Cc: Kyle Huey, Vince Weaver, Jin, Yao, Ingo Molnar, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote: > On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote: > > Faking data gets a wee bit tricky in how much data we need to clear > > through, its not only IP, pretty much everything we get from the > > interrupt context, like the branch stack and registers is also suspect. > > Indeed. I'll take a run through __perf_event_output() and callees, and > see what we need to drop. Looking at perf_event_sample_format in uapi/linux/perf_event.h, there are samples that are obviously sensitive, and should be dropped: * PERF_SAMPLE_IP * PERF_SAMPLE_CALLCHAIN * PERF_SAMPLE_BRANCH_STACK * PERF_SAMPLE_REGS_INTR ... samples that look benign: * PERF_SAMPLE_TID * PERF_SAMPLE_TIME * PERF_SAMPLE_CPU * PERF_SAMPLE_PERIOD * PERF_SAMPLE_REGS_USER * PERF_SAMPLE_STACK_USER * PERF_SAMPLE_READ * PERF_SAMPLE_ID * PERF_SAMPLE_STREAM_ID * PERF_SAMPLE_IDENTIFIER .. and samples that I have no idea about: * PERF_SAMPLE_ADDR * PERF_SAMPLE_RAW * PERF_SAMPLE_WEIGHT * PERF_SAMPLE_DATA_SRC * PERF_SAMPLE_TRANSACTION Should any of those be moved into the "should be dropped" pile? Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-04 10:21 ` Mark Rutland @ 2017-07-06 5:07 ` Robert O'Callahan 2017-07-11 2:03 ` Kyle Huey 0 siblings, 1 reply; 37+ messages in thread From: Robert O'Callahan @ 2017-07-06 5:07 UTC (permalink / raw) To: Mark Rutland Cc: Peter Zijlstra, Kyle Huey, Vince Weaver, Jin, Yao, Ingo Molnar, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, open list On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: > Should any of those be moved into the "should be dropped" pile? Why not be conservative and clear every sample you're not sure about? We'd appreciate a fix sooner rather than later here, since rr is currently broken on every stable Linux kernel and our attempts to implement a workaround have failed. (We have separate "interrupt" and "measure" counters, and I thought we might work around this regression by programming the "interrupt" counter to count kernel events as well as user events (interrupting early is OK), but that caused our (completely separate) "measure" counter to report off-by-one results (!), which seems to be a different bug present on a range of older kernels.) Thanks, Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-06 5:07 ` Robert O'Callahan @ 2017-07-11 2:03 ` Kyle Huey 2017-07-11 9:03 ` Ingo Molnar 0 siblings, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-07-11 2:03 UTC (permalink / raw) To: Mark Rutland, Ingo Molnar, gregkh, Peter Zijlstra Cc: Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> Should any of those be moved into the "should be dropped" pile? > > Why not be conservative and clear every sample you're not sure about? > > We'd appreciate a fix sooner rather than later here, since rr is > currently broken on every stable Linux kernel and our attempts to > implement a workaround have failed. > > (We have separate "interrupt" and "measure" counters, and I thought we > might work around this regression by programming the "interrupt" > counter to count kernel events as well as user events (interrupting > early is OK), but that caused our (completely separate) "measure" > counter to report off-by-one results (!), which seems to be a > different bug present on a range of older kernels.) This seems to have stalled out here unfortunately. Can we get a consensus (from ingo or peterz?) on Mark's question? Or, alternatively, can we move the patch at the top of this thread forward on the stable branches until we do reach an answer to that question? We've abandoned hope of working around this problem in rr and are currently broken for all of our users with an up-to-date kernel, so the situation for us is rather dire at the moment I'm afraid. Thanks, - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 2:03 ` Kyle Huey @ 2017-07-11 9:03 ` Ingo Molnar 2017-07-11 13:07 ` Jin, Yao ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: Ingo Molnar @ 2017-07-11 9:03 UTC (permalink / raw) To: Kyle Huey Cc: Mark Rutland, gregkh, Peter Zijlstra, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan * Kyle Huey <me@kylehuey.com> wrote: > On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: > > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: > >> Should any of those be moved into the "should be dropped" pile? > > > > Why not be conservative and clear every sample you're not sure about? > > > > We'd appreciate a fix sooner rather than later here, since rr is > > currently broken on every stable Linux kernel and our attempts to > > implement a workaround have failed. > > > > (We have separate "interrupt" and "measure" counters, and I thought we > > might work around this regression by programming the "interrupt" > > counter to count kernel events as well as user events (interrupting > > early is OK), but that caused our (completely separate) "measure" > > counter to report off-by-one results (!), which seems to be a > > different bug present on a range of older kernels.) > > This seems to have stalled out here unfortunately. > > Can we get a consensus (from ingo or peterz?) on Mark's question? Or, > alternatively, can we move the patch at the top of this thread forward > on the stable branches until we do reach an answer to that question? > > We've abandoned hope of working around this problem in rr and are > currently broken for all of our users with an up-to-date kernel, so > the situation for us is rather dire at the moment I'm afraid. Sorry about that - I've queued up a revert for the original commit and will send the fix to Linus later today. I've added a -stable tag as well so it can be forwarded to Greg the moment it hits upstream. We should do the original fix as well, but in a version that does not skip the sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also covers other possible places where skid-RIP is exposed, such as LBR. Thanks, Ingo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 9:03 ` Ingo Molnar @ 2017-07-11 13:07 ` Jin, Yao 2017-07-12 7:57 ` Ingo Molnar 2017-07-11 14:26 ` Mark Rutland 2017-07-11 15:32 ` Kyle Huey 2 siblings, 1 reply; 37+ messages in thread From: Jin, Yao @ 2017-07-11 13:07 UTC (permalink / raw) To: Ingo Molnar, Kyle Huey Cc: Mark Rutland, gregkh, Peter Zijlstra, Vince Weaver, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan On 7/11/2017 5:03 PM, Ingo Molnar wrote: > * Kyle Huey <me@kylehuey.com> wrote: > >> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >>> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>>> Should any of those be moved into the "should be dropped" pile? >>> Why not be conservative and clear every sample you're not sure about? >>> >>> We'd appreciate a fix sooner rather than later here, since rr is >>> currently broken on every stable Linux kernel and our attempts to >>> implement a workaround have failed. >>> >>> (We have separate "interrupt" and "measure" counters, and I thought we >>> might work around this regression by programming the "interrupt" >>> counter to count kernel events as well as user events (interrupting >>> early is OK), but that caused our (completely separate) "measure" >>> counter to report off-by-one results (!), which seems to be a >>> different bug present on a range of older kernels.) >> This seems to have stalled out here unfortunately. >> >> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >> alternatively, can we move the patch at the top of this thread forward >> on the stable branches until we do reach an answer to that question? >> >> We've abandoned hope of working around this problem in rr and are >> currently broken for all of our users with an up-to-date kernel, so >> the situation for us is rather dire at the moment I'm afraid. > Sorry about that - I've queued up a revert for the original commit and will send > the fix to Linus later today. I've added a -stable tag as well so it can be > forwarded to Greg the moment it hits upstream. > > We should do the original fix as well, but in a version that does not skip the > sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also > covers other possible places where skid-RIP is exposed, such as LBR. > > Thanks, > > Ingo Could we provide 2 options in user space when enabling the event sampling? One option is for the use case like rr debugger which only cares the PMI interrupt but doesn't care the skid. The skid samples doesn't need to be dropped. The other option is for the use case which needs the accurate sample data. For this option, the skid samples are dropped. I would suggest to let the user space make the decision to choose which option. Thanks Jin Yao ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 13:07 ` Jin, Yao @ 2017-07-12 7:57 ` Ingo Molnar 0 siblings, 0 replies; 37+ messages in thread From: Ingo Molnar @ 2017-07-12 7:57 UTC (permalink / raw) To: Jin, Yao Cc: Kyle Huey, Mark Rutland, gregkh, Peter Zijlstra, Vince Weaver, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan * Jin, Yao <yao.jin@linux.intel.com> wrote: > Could we provide 2 options in user space when enabling the event sampling? > > One option is for the use case like rr debugger which only cares the PMI > interrupt but doesn't care the skid. The skid samples doesn't need to be > dropped. Since it's an information leak, this is not something we want to expose as an interface. It's also a very ugly interface: why not just *clear* the sample, instead of dropping it? The hardware messed up and gave us something we specifically did not permit user-space to collect. We have to fix the bad effects to the best extent we can, and not based on some knob. Thanks, Ingo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 9:03 ` Ingo Molnar 2017-07-11 13:07 ` Jin, Yao @ 2017-07-11 14:26 ` Mark Rutland 2017-07-11 15:32 ` Kyle Huey 2 siblings, 0 replies; 37+ messages in thread From: Mark Rutland @ 2017-07-11 14:26 UTC (permalink / raw) To: Ingo Molnar Cc: Kyle Huey, gregkh, Peter Zijlstra, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan On Tue, Jul 11, 2017 at 11:03:58AM +0200, Ingo Molnar wrote: > > * Kyle Huey <me@kylehuey.com> wrote: > > > On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: > > > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > >> Should any of those be moved into the "should be dropped" pile? > > > > > > Why not be conservative and clear every sample you're not sure about? > > > > > > We'd appreciate a fix sooner rather than later here, since rr is > > > currently broken on every stable Linux kernel and our attempts to > > > implement a workaround have failed. > > > > > > (We have separate "interrupt" and "measure" counters, and I thought we > > > might work around this regression by programming the "interrupt" > > > counter to count kernel events as well as user events (interrupting > > > early is OK), but that caused our (completely separate) "measure" > > > counter to report off-by-one results (!), which seems to be a > > > different bug present on a range of older kernels.) > > > > This seems to have stalled out here unfortunately. > > > > Can we get a consensus (from ingo or peterz?) on Mark's question? Or, > > alternatively, can we move the patch at the top of this thread forward > > on the stable branches until we do reach an answer to that question? > > > > We've abandoned hope of working around this problem in rr and are > > currently broken for all of our users with an up-to-date kernel, so > > the situation for us is rather dire at the moment I'm afraid. > > Sorry about that - I've queued up a revert for the original commit and will send > the fix to Linus later today. I've added a -stable tag as well so it can be > forwarded to Greg the moment it hits upstream. Thanks for handling this. Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 9:03 ` Ingo Molnar 2017-07-11 13:07 ` Jin, Yao 2017-07-11 14:26 ` Mark Rutland @ 2017-07-11 15:32 ` Kyle Huey 2017-07-18 0:07 ` Kyle Huey 2 siblings, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-07-11 15:32 UTC (permalink / raw) To: Ingo Molnar Cc: Mark Rutland, gregkh, Peter Zijlstra, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Kyle Huey <me@kylehuey.com> wrote: > >> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> >> Should any of those be moved into the "should be dropped" pile? >> > >> > Why not be conservative and clear every sample you're not sure about? >> > >> > We'd appreciate a fix sooner rather than later here, since rr is >> > currently broken on every stable Linux kernel and our attempts to >> > implement a workaround have failed. >> > >> > (We have separate "interrupt" and "measure" counters, and I thought we >> > might work around this regression by programming the "interrupt" >> > counter to count kernel events as well as user events (interrupting >> > early is OK), but that caused our (completely separate) "measure" >> > counter to report off-by-one results (!), which seems to be a >> > different bug present on a range of older kernels.) >> >> This seems to have stalled out here unfortunately. >> >> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >> alternatively, can we move the patch at the top of this thread forward >> on the stable branches until we do reach an answer to that question? >> >> We've abandoned hope of working around this problem in rr and are >> currently broken for all of our users with an up-to-date kernel, so >> the situation for us is rather dire at the moment I'm afraid. > > Sorry about that - I've queued up a revert for the original commit and will send > the fix to Linus later today. I've added a -stable tag as well so it can be > forwarded to Greg the moment it hits upstream. Great, thank you. - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-11 15:32 ` Kyle Huey @ 2017-07-18 0:07 ` Kyle Huey 0 siblings, 0 replies; 37+ messages in thread From: Kyle Huey @ 2017-07-18 0:07 UTC (permalink / raw) To: Ingo Molnar Cc: Mark Rutland, gregkh, Peter Zijlstra, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, open list, Robert O'Callahan On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey <me@kylehuey.com> wrote: > On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <mingo@kernel.org> wrote: >> >> * Kyle Huey <me@kylehuey.com> wrote: >> >>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <robert@ocallahan.org> wrote: >>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>> >> Should any of those be moved into the "should be dropped" pile? >>> > >>> > Why not be conservative and clear every sample you're not sure about? >>> > >>> > We'd appreciate a fix sooner rather than later here, since rr is >>> > currently broken on every stable Linux kernel and our attempts to >>> > implement a workaround have failed. >>> > >>> > (We have separate "interrupt" and "measure" counters, and I thought we >>> > might work around this regression by programming the "interrupt" >>> > counter to count kernel events as well as user events (interrupting >>> > early is OK), but that caused our (completely separate) "measure" >>> > counter to report off-by-one results (!), which seems to be a >>> > different bug present on a range of older kernels.) >>> >>> This seems to have stalled out here unfortunately. >>> >>> Can we get a consensus (from ingo or peterz?) on Mark's question? Or, >>> alternatively, can we move the patch at the top of this thread forward >>> on the stable branches until we do reach an answer to that question? >>> >>> We've abandoned hope of working around this problem in rr and are >>> currently broken for all of our users with an up-to-date kernel, so >>> the situation for us is rather dire at the moment I'm afraid. >> >> Sorry about that - I've queued up a revert for the original commit and will send >> the fix to Linus later today. I've added a -stable tag as well so it can be >> forwarded to Greg the moment it hits upstream. > > Great, thank you. > > - Kyle Hi again, I saw that the revert landed on perf/urgent but it doesn't look like that got pulled by Linus in time for 4.13-rc1. Consider this a gentle poke to please get this merged :) - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-28 17:49 ` Mark Rutland 2017-06-28 22:55 ` Kyle Huey @ 2017-06-29 8:12 ` Ingo Molnar 2017-07-04 9:06 ` Peter Zijlstra 1 sibling, 1 reply; 37+ messages in thread From: Ingo Molnar @ 2017-06-29 8:12 UTC (permalink / raw) To: Mark Rutland Cc: Kyle Huey, Vince Weaver, Jin, Yao, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list * Mark Rutland <mark.rutland@arm.com> wrote: > It still seems wrong to make up data, though. So what we have here is a hardware quirk: we asked for user-space samples, but didn't get them and we cannot expose the kernel-internal address. The question is, how do we handle the hardware quirk. Since we cannot fix the hardware on existing systems there's really just two choices: - Lose the sample (and signal it as a lost sample) - Keep the sample but change the sensitive kernel-internal address to something that is not sensitive: 0 or -1 works, but we could perhaps also return a well-known user-space address such as the vDSO syscall trampoline or such? there's no other option really. I'd lean towards Vince's take: losing samples is more surprising than getting the occasional sample with some sanitized data in it. If we make the artificial data still a meaningful user-space address, related to kernel entries, then it might even be a bonus, as users would learn to recognize it as: 'oh, skid artifact, I know about that'. Thanks, Ingo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-06-29 8:12 ` Ingo Molnar @ 2017-07-04 9:06 ` Peter Zijlstra 2017-07-04 10:04 ` Ingo Molnar 0 siblings, 1 reply; 37+ messages in thread From: Peter Zijlstra @ 2017-07-04 9:06 UTC (permalink / raw) To: Ingo Molnar Cc: Mark Rutland, Kyle Huey, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote: > > * Mark Rutland <mark.rutland@arm.com> wrote: > > > It still seems wrong to make up data, though. > > So what we have here is a hardware quirk: we asked for user-space samples, but > didn't get them and we cannot expose the kernel-internal address. > > The question is, how do we handle the hardware quirk. Since we cannot fix the > hardware on existing systems there's really just two choices: > > - Lose the sample (and signal it as a lost sample) > > - Keep the sample but change the sensitive kernel-internal address to something > that is not sensitive: 0 or -1 works, but we could perhaps also return a > well-known user-space address such as the vDSO syscall trampoline or such? > > there's no other option really. > > I'd lean towards Vince's take: losing samples is more surprising than getting the > occasional sample with some sanitized data in it. > > If we make the artificial data still a meaningful user-space address, related to > kernel entries, then it might even be a bonus, as users would learn to recognize > it as: 'oh, skid artifact, I know about that'. So while we could easily fake SAMPLE_IP to do as you suggest, other entries might be much harder to fake. That said, I have no problems with just 0 stuffing them. The only real problem is determining how much to stuff I suppose. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) 2017-07-04 9:06 ` Peter Zijlstra @ 2017-07-04 10:04 ` Ingo Molnar 0 siblings, 0 replies; 37+ messages in thread From: Ingo Molnar @ 2017-07-04 10:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Mark Rutland, Kyle Huey, Vince Weaver, Jin, Yao, stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list * Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote: > > > > * Mark Rutland <mark.rutland@arm.com> wrote: > > > > > It still seems wrong to make up data, though. > > > > So what we have here is a hardware quirk: we asked for user-space samples, but > > didn't get them and we cannot expose the kernel-internal address. > > > > The question is, how do we handle the hardware quirk. Since we cannot fix the > > hardware on existing systems there's really just two choices: > > > > - Lose the sample (and signal it as a lost sample) > > > > - Keep the sample but change the sensitive kernel-internal address to something > > that is not sensitive: 0 or -1 works, but we could perhaps also return a > > well-known user-space address such as the vDSO syscall trampoline or such? > > > > there's no other option really. > > > > I'd lean towards Vince's take: losing samples is more surprising than getting the > > occasional sample with some sanitized data in it. > > > > If we make the artificial data still a meaningful user-space address, related to > > kernel entries, then it might even be a bonus, as users would learn to recognize > > it as: 'oh, skid artifact, I know about that'. > > So while we could easily fake SAMPLE_IP to do as you suggest, other > entries might be much harder to fake. That said, I have no problems with > just 0 stuffing them. > > The only real problem is determining how much to stuff I suppose. I think the RIP is the most important one to fix up in an informative fashion (instead of just zeroing it out), so that mainstream users of 'perf top' or 'perf report' have a chance to see that certain entries have this skid artifact. The other registers should be zeroed out once we stop trusting a sample. Thanks, Ingo ^ permalink raw reply [flat|nested] 37+ messages in thread
* [tip:perf/urgent] Revert "perf/core: Drop kernel samples even though :u is specified" 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland 2017-06-28 12:40 ` Vince Weaver 2017-06-28 16:48 ` Kyle Huey @ 2017-07-11 9:03 ` tip-bot for Ingo Molnar 2 siblings, 0 replies; 37+ messages in thread From: tip-bot for Ingo Molnar @ 2017-07-11 9:03 UTC (permalink / raw) To: linux-tip-commits Cc: jolsa, mingo, namhyung, me, eranian, a.p.zijlstra, alexander.shishkin, will.deacon, yao.jin, linux-kernel, hpa, vincent.weaver, torvalds, acme, stable, tglx Commit-ID: 6a8a75f3235724c5941a33e287b2f98966ad14c5 Gitweb: http://git.kernel.org/tip/6a8a75f3235724c5941a33e287b2f98966ad14c5 Author: Ingo Molnar <mingo@kernel.org> AuthorDate: Tue, 11 Jul 2017 10:56:54 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Tue, 11 Jul 2017 10:56:54 +0200 Revert "perf/core: Drop kernel samples even though :u is specified" This reverts commit cc1582c231ea041fbc68861dfaf957eaf902b829. This commit introduced a regression that broke rr-project, which uses sampling events to receive a signal on overflow (but does not care about the contents of the sample). These signals are critical to the correct operation of rr. There's been some back and forth about how to fix it - but to not keep applications in limbo queue up a revert. Reported-by: Kyle Huey <me@kylehuey.com> Acked-by: Kyle Huey <me@kylehuey.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20170628105600.GC5981@leverpostej Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/events/core.c | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 4d2c32f..9747e42 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7308,21 +7308,6 @@ int perf_event_account_interrupt(struct perf_event *event) return __perf_event_account_interrupt(event, 1); } -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs) -{ - /* - * Due to interrupt latency (AKA "skid"), we may enter the - * kernel before taking an overflow, even if the PMU is only - * counting user events. - * To avoid leaking information to userspace, we must always - * reject kernel samples when exclude_kernel is set. - */ - if (event->attr.exclude_kernel && !user_mode(regs)) - return false; - - return true; -} - /* * Generic event overflow handling, sampling. */ @@ -7344,12 +7329,6 @@ static int __perf_event_overflow(struct perf_event *event, ret = __perf_event_account_interrupt(event, throttle); /* - * For security, drop the skid kernel samples if necessary. - */ - if (!sample_is_allowed(event, regs)) - return ret; - - /* * XXX event_limit might not quite work as expected on inherited * events */ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 10:12 ` Mark Rutland 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland @ 2017-06-28 16:46 ` Kyle Huey 2017-06-28 17:19 ` Mark Rutland 1 sibling, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-06-28 16:46 UTC (permalink / raw) To: Mark Rutland Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: >> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <yao.jin@linux.intel.com> wrote: >> > Hi, >> > >> > In theory, the PMI interrupts in skid region should be dropped, right? >> >> No, why would they be dropped? >> >> My understanding of the situation is as follows: >> >> There is some time, call it t_0, where the hardware counter overflows. >> The PMU triggers an interrupt, but this is not instantaneous. Call >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0 >> is the "skid". >> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU >> *must* be running a userspace program. But by t_1, the CPU may be >> doing something else. Your patch changed things so that if at t_1 the >> CPU is in the kernel, then the interrupt is discarded. But rr has >> programmed the counter to deliver a signal on overflow (via F_SETSIG >> on the fd returned by perf_event_open). This change results in the >> signal never being delivered, because the interrupt was ignored. >> (More accurately, the signal is delivered the *next* time the counter >> overflows, which is far past where we wanted to inject our >> asynchronous event into our tracee. > > Yes, this is a bug. > > As we're trying to avoid smapling state, I think we can move the check > into perf_prepare_sample() or __perf_event_output(), where that state is > actually sampled. I'll take a look at that momentarily. > > Just to clarify, you don't care about the sample state at all? i.e. you > don't need the user program counter? Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`, etc are all 0. https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194 is what we do use. > Is that signal delivered to the tracee, or to a different process that > traces it? If the latter, what ensures that the task is stopped > sufficiently quickly? It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid). In practice we've found that on modern Intel hardware that the interrupt and resulting signal delivery delay is bounded by a relatively small number of counter events. >> It seems to me that it might be reasonable to ignore the interrupt if >> the purpose of the interrupt is to trigger sampling of the CPUs >> register state. But if the interrupt will trigger some other >> operation, such as a signal on an fd, then there's no reason to drop >> it. > > Agreed. I'll try to have a patch for this soon. > > I just need to figure out exactly where that overflow signal is > generated by the perf core. > > Thanks, > Mark. - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 16:46 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey @ 2017-06-28 17:19 ` Mark Rutland 2017-06-28 17:36 ` Kyle Huey 2017-06-28 17:48 ` Robert O'Callahan 0 siblings, 2 replies; 37+ messages in thread From: Mark Rutland @ 2017-06-28 17:19 UTC (permalink / raw) To: Kyle Huey Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: > >> My understanding of the situation is as follows: > >> > >> There is some time, call it t_0, where the hardware counter overflows. > >> The PMU triggers an interrupt, but this is not instantaneous. Call > >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0 > >> is the "skid". > >> > >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU > >> *must* be running a userspace program. But by t_1, the CPU may be > >> doing something else. Your patch changed things so that if at t_1 the > >> CPU is in the kernel, then the interrupt is discarded. But rr has > >> programmed the counter to deliver a signal on overflow (via F_SETSIG > >> on the fd returned by perf_event_open). This change results in the > >> signal never being delivered, because the interrupt was ignored. > >> (More accurately, the signal is delivered the *next* time the counter > >> overflows, which is far past where we wanted to inject our > >> asynchronous event into our tracee. > > > > Yes, this is a bug. > > > > As we're trying to avoid smapling state, I think we can move the check > > into perf_prepare_sample() or __perf_event_output(), where that state is > > actually sampled. I'll take a look at that momentarily. > > > > Just to clarify, you don't care about the sample state at all? i.e. you > > don't need the user program counter? > > Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`, > etc are all 0. > https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194 > is what we do use. Given that, I must be missing something. In __perf_event_overflow(), we already bail out early if !is_sampling_event(event), i.e. when the sample_period is 0. Your attr has a sample_period of zero, so something must be initialising that. Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core fiddling with the sample period behind your back? It seems odd that an event without any samples to take has a sample period. I'm surprised that there's not *some* sample_type set. > > Is that signal delivered to the tracee, or to a different process that > > traces it? If the latter, what ensures that the task is stopped > > sufficiently quickly? > > It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid). > In practice we've found that on modern Intel hardware that the > interrupt and resulting signal delivery delay is bounded by a > relatively small number of counter events. Ok. Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 17:19 ` Mark Rutland @ 2017-06-28 17:36 ` Kyle Huey 2017-06-28 17:52 ` Mark Rutland 2017-06-28 17:48 ` Robert O'Callahan 1 sibling, 1 reply; 37+ messages in thread From: Kyle Huey @ 2017-06-28 17:36 UTC (permalink / raw) To: Mark Rutland Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote: >> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote: >> >> My understanding of the situation is as follows: >> >> >> >> There is some time, call it t_0, where the hardware counter overflows. >> >> The PMU triggers an interrupt, but this is not instantaneous. Call >> >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0 >> >> is the "skid". >> >> >> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU >> >> *must* be running a userspace program. But by t_1, the CPU may be >> >> doing something else. Your patch changed things so that if at t_1 the >> >> CPU is in the kernel, then the interrupt is discarded. But rr has >> >> programmed the counter to deliver a signal on overflow (via F_SETSIG >> >> on the fd returned by perf_event_open). This change results in the >> >> signal never being delivered, because the interrupt was ignored. >> >> (More accurately, the signal is delivered the *next* time the counter >> >> overflows, which is far past where we wanted to inject our >> >> asynchronous event into our tracee. >> > >> > Yes, this is a bug. >> > >> > As we're trying to avoid smapling state, I think we can move the check >> > into perf_prepare_sample() or __perf_event_output(), where that state is >> > actually sampled. I'll take a look at that momentarily. >> > >> > Just to clarify, you don't care about the sample state at all? i.e. you >> > don't need the user program counter? >> >> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`, >> etc are all 0. >> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194 >> is what we do use. > > Given that, I must be missing something. > > In __perf_event_overflow(), we already bail out early if > !is_sampling_event(event), i.e. when the sample_period is 0. > > Your attr has a sample_period of zero, so something must be initialising > that. > > Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core > fiddling with the sample period behind your back? We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with an enormous number if we don't actually want an interrupt. See `PerfCounters::reset`, line 446. > It seems odd that an event without any samples to take has a sample > period. I'm surprised that there's not *some* sample_type set. Perhaps sample_period is misleadingly named :) Alternatively, you could imagine it as sampling where we're only interested in whether the counter passed the sampling value or not. >> > Is that signal delivered to the tracee, or to a different process that >> > traces it? If the latter, what ensures that the task is stopped >> > sufficiently quickly? >> >> It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid). >> In practice we've found that on modern Intel hardware that the >> interrupt and resulting signal delivery delay is bounded by a >> relatively small number of counter events. > > Ok. > > Thanks, > Mark. - Kyle ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 17:36 ` Kyle Huey @ 2017-06-28 17:52 ` Mark Rutland 0 siblings, 0 replies; 37+ messages in thread From: Mark Rutland @ 2017-06-28 17:52 UTC (permalink / raw) To: Kyle Huey Cc: Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, Robert O'Callahan, open list On Wed, Jun 28, 2017 at 10:36:20AM -0700, Kyle Huey wrote: > On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote: > >> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <mark.rutland@arm.com> wrote: > >> > Just to clarify, you don't care about the sample state at all? i.e. you > >> > don't need the user program counter? > >> > >> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`, > >> etc are all 0. > >> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194 > >> is what we do use. > > > > Given that, I must be missing something. > > > > In __perf_event_overflow(), we already bail out early if > > !is_sampling_event(event), i.e. when the sample_period is 0. > > > > Your attr has a sample_period of zero, so something must be initialising > > that. > > > > Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core > > fiddling with the sample period behind your back? > > We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with > an enormous number if we don't actually want an interrupt. See > `PerfCounters::reset`, line 446. Ah, thanks for the pointer. > > It seems odd that an event without any samples to take has a sample > > period. I'm surprised that there's not *some* sample_type set. > > Perhaps sample_period is misleadingly named :) Alternatively, you > could imagine it as sampling where we're only interested in whether > the counter passed the sampling value or not. Sure; it's just that I suspect the existing kernel behviour isn't *quite* intentional, and I could easily see it getting broken in future, e.g. if someone were to make is_sampling_event() check the attr for sample types. So we need to keep an eye on that, regardless. Thanks, Mark. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region 2017-06-28 17:19 ` Mark Rutland 2017-06-28 17:36 ` Kyle Huey @ 2017-06-28 17:48 ` Robert O'Callahan 1 sibling, 0 replies; 37+ messages in thread From: Robert O'Callahan @ 2017-06-28 17:48 UTC (permalink / raw) To: Mark Rutland Cc: Kyle Huey, Jin, Yao, Ingo Molnar, Peter Zijlstra (Intel), stable, Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa, Linus Torvalds, Namhyung Kim, Stephane Eranian, Thomas Gleixner, Vince Weaver, acme, jolsa, kan.liang, Will Deacon, gregkh, open list On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <mark.rutland@arm.com> wrote: > It seems odd that an event without any samples to take has a sample > period. I'm surprised that there's not *some* sample_type set. Yes, it is a bit odd :-). Rather than trying to gather some limited set of information about the target process, we want to interrupt its execution after a particular number of events have occurred. The bigger picture here is that we're replaying a previously recorded process execution. We want to the process to run until it reaches a particular state that occurred during recording; that state is identified by a value for the performance counter (retired conditional branches) plus the values of the general-purpose registers. So our first step is to run the process until the performance counter reaches a value less than but "close to" the target value. (We know the process will be interrupted after a number of extra events have occurred, so we set the sample period to a value less than the target by a heuristic "maximum skid size".) Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2017-07-18 0:07 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAP045Ap8cMx6mzSgcQ3n3bnh_8GJuCp7_KZe_5ZTCR_K6cPTLw@mail.gmail.com> 2017-06-28 1:01 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey 2017-06-28 2:09 ` Jin, Yao 2017-06-28 4:51 ` Kyle Huey 2017-06-28 5:35 ` Jin, Yao 2017-06-28 7:30 ` Kyle Huey 2017-06-28 10:12 ` Mark Rutland 2017-06-28 10:56 ` [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region) Mark Rutland 2017-06-28 12:40 ` Vince Weaver 2017-06-28 13:07 ` Mark Rutland 2017-06-29 8:13 ` Alexey Budankov 2017-06-29 8:25 ` Alexey Budankov 2017-06-28 16:48 ` Kyle Huey 2017-06-28 17:49 ` Mark Rutland 2017-06-28 22:55 ` Kyle Huey 2017-06-29 0:27 ` Jin, Yao 2017-06-30 17:44 ` Kyle Huey 2017-07-04 9:03 ` Peter Zijlstra 2017-07-04 9:33 ` Mark Rutland 2017-07-04 9:46 ` Peter Zijlstra 2017-07-04 10:21 ` Mark Rutland 2017-07-06 5:07 ` Robert O'Callahan 2017-07-11 2:03 ` Kyle Huey 2017-07-11 9:03 ` Ingo Molnar 2017-07-11 13:07 ` Jin, Yao 2017-07-12 7:57 ` Ingo Molnar 2017-07-11 14:26 ` Mark Rutland 2017-07-11 15:32 ` Kyle Huey 2017-07-18 0:07 ` Kyle Huey 2017-06-29 8:12 ` Ingo Molnar 2017-07-04 9:06 ` Peter Zijlstra 2017-07-04 10:04 ` Ingo Molnar 2017-07-11 9:03 ` [tip:perf/urgent] Revert "perf/core: Drop kernel samples even though :u is specified" tip-bot for Ingo Molnar 2017-06-28 16:46 ` [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region Kyle Huey 2017-06-28 17:19 ` Mark Rutland 2017-06-28 17:36 ` Kyle Huey 2017-06-28 17:52 ` Mark Rutland 2017-06-28 17:48 ` Robert O'Callahan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).