From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751597AbdF1Haf (ORCPT ); Wed, 28 Jun 2017 03:30:35 -0400 Received: from mail-yw0-f194.google.com ([209.85.161.194]:33760 "EHLO mail-yw0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751087AbdF1Ha1 (ORCPT ); Wed, 28 Jun 2017 03:30:27 -0400 MIME-Version: 1.0 In-Reply-To: References: <2256f9b5-1277-c4b1-1472-61a10cd1db9a@linux.intel.com> From: Kyle Huey Date: Wed, 28 Jun 2017 00:30:25 -0700 Message-ID: Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region To: "Jin, Yao" Cc: Ingo Molnar , "Peter Zijlstra (Intel)" , stable@vger.kernel.org, Alexander Shishkin , Arnaldo Carvalho de Melo , Jiri Olsa , Linus Torvalds , Namhyung Kim , Stephane Eranian , Thomas Gleixner , Vince Weaver , acme@kernel.org, jolsa@kernel.org, kan.liang@intel.com, Mark Rutland , Will Deacon , gregkh@linuxfoundation.org, "Robert O'Callahan" , open list , yao.jin@intel.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 27, 2017 at 10:35 PM, Jin, Yao wrote: > Hi Kyle, > > I understand your requirement. Sorry I don't know the detail of rr debugger, > but I guess if it just uses counter overflow to deliver a signal, could it > set the counter without "exclude_kernel"? Unfortunately we cannot. We depend on the counter value being exactly the same between recording and replay, and dropping `exclude_kernel` would introduce non-determinism. > Another consideration is, I'm not sure if the kernel can accurately realize > that if the interrupt is to trigger sampling or just deliver a signal. > Userspace may know that but kernel may not. After looking at this code a bit more, I think that changing the `is_sample_allowed` check from an early return to a guard around the invocation of `overflow_handler` would fix this. I believe, but have not tested, that `perf_event_fasync` is what must run to deliver our signal, while the `overflow_handler` is what copies the kernel RIP/etc into the output buffer that you want to skip. - Kyle > Anyway let's listen to more comments from community. > > Thanks > > Jin Yao > > > > On 6/28/2017 12:51 PM, Kyle Huey wrote: >> >> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao wrote: >>> >>> Hi, >>> >>> In theory, the PMI interrupts in skid region should be dropped, right? >> >> No, why would they be dropped? >> >> My understanding of the situation is as follows: >> >> There is some time, call it t_0, where the hardware counter overflows. >> The PMU triggers an interrupt, but this is not instantaneous. Call >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0 >> is the "skid". >> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU >> *must* be running a userspace program. But by t_1, the CPU may be >> doing something else. Your patch changed things so that if at t_1 the >> CPU is in the kernel, then the interrupt is discarded. But rr has >> programmed the counter to deliver a signal on overflow (via F_SETSIG >> on the fd returned by perf_event_open). This change results in the >> signal never being delivered, because the interrupt was ignored. >> (More accurately, the signal is delivered the *next* time the counter >> overflows, which is far past where we wanted to inject our >> asynchronous event into our tracee. >> >> It seems to me that it might be reasonable to ignore the interrupt if >> the purpose of the interrupt is to trigger sampling of the CPUs >> register state. But if the interrupt will trigger some other >> operation, such as a signal on an fd, then there's no reason to drop >> it. >> >>> For a userspace debugger, is it the only choice that relies on the *skid* >>> PMI interrupt? >> >> I don't understand this question, but hopefully the above clarified >> things. >> >> - Kyle >> >>> Thanks >>> Jin Yao >>> >>> >>> On 6/28/2017 9:01 AM, Kyle Huey wrote: >>>> >>>> Sent again with LKML CCd, sorry for the noise. >>>> >>>> - Kyle >>>> >>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey wrote: >>>>> >>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be >>>>> a candidate for backporting to stable branches. >>>>> >>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt >>>>> to stop a program during replay to inject asynchronous events such as >>>>> signals. We are counting retired conditional branches in userspace >>>>> only. This changeset causes the kernel to drop interrupts on the >>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters >>>>> kernel mode for whatever reason. When replaying traces of complex >>>>> programs such as Firefox, we intermittently fail to deliver >>>>> asynchronous events on time, leading the replay to diverge from the >>>>> recorded state. >>>>> >>>>> It seems like this change should, at a bare minimum, be limited to >>>>> counters that actually perform sampling of register state when the >>>>> interrupt fires. In our case, with the retired conditional branches >>>>> counter restricted to counting userspace events only, it makes no >>>>> difference that the PMU interrupt happened to be delivered in the >>>>> kernel. >>>>> >>>>> As this makes rr unusable on complex applications and cannot be >>>>> efficiently worked around, we would appreciate this being addressed >>>>> before 4.12 is finalized, and the regression not being introduced to >>>>> stable branches. >>>>> >>>>> Thanks, >>>>> >>>>> - Kyle >>>>> >>>>> [0] http://rr-project.org/ >>> >>> >