* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 10:03 ` [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping " Jin Yao
@ 2018-06-15 5:59 ` Stephane Eranian
2018-06-15 7:15 ` Jin, Yao
2018-06-15 6:02 ` Stephane Eranian
2018-06-15 11:36 ` Mark Rutland
2 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2018-06-15 5:59 UTC (permalink / raw)
To: yao.jin
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ingo Molnar,
Alexander Shishkin, me, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Liang, Kan, Jin, Yao
On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
>
> When doing sampling, for example:
>
> perf record -e cycles:u ...
>
> On workloads that do a lot of kernel entry/exits we see kernel
> samples, even though :u is specified. This is due to skid existing.
>
> This might be a security issue because it can leak kernel addresses even
> though kernel sampling support is disabled.
>
> One patch "perf/core: Drop kernel samples even though :u is specified"
> was posted in last year but it was reverted because it introduced a
> regression issue that broke the rr-project, which used sampling
> events to receive a signal on overflow. These signals were critical
> to the correct operation of rr.
>
> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
> though :u is specified"")' for detail.
>
> Now the idea is to use sysctl to control the dropping of leaked
> kernel samples.
>
> /sys/devices/cpu/perf_allow_sample_leakage:
>
> 0 - default, drop the leaked kernel samples.
> 1 - don't drop the leaked kernel samples.
>
> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage.
>
> For example,
>
> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> 0
> root@skl:/tmp# perf record -e cycles:u ./div
> root@skl:/tmp# perf report --stdio
>
> ........ ....... ............. ................
>
> 47.01% div div [.] main
> 20.74% div libc-2.23.so [.] __random_r
> 15.59% div libc-2.23.so [.] __random
> 8.68% div div [.] compute_flag
> 4.48% div libc-2.23.so [.] rand
> 3.50% div div [.] rand@plt
> 0.00% div ld-2.23.so [.] do_lookup_x
> 0.00% div ld-2.23.so [.] memcmp
> 0.00% div ld-2.23.so [.] _dl_start
> 0.00% div ld-2.23.so [.] _start
>
> There is no kernel symbol reported.
>
> root@skl:/tmp# echo 1 > /sys/devices/cpu/perf_allow_sample_leakage
> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> 1
> root@skl:/tmp# perf record -e cycles:u ./div
> root@skl:/tmp# perf report --stdio
>
> ........ ....... ................ .............
>
> 47.53% div div [.] main
> 20.62% div libc-2.23.so [.] __random_r
> 15.32% div libc-2.23.so [.] __random
> 8.66% div div [.] compute_flag
> 4.53% div libc-2.23.so [.] rand
> 3.34% div div [.] rand@plt
> 0.00% div [kernel.vmlinux] [k] apic_timer_interrupt
> 0.00% div libc-2.23.so [.] intel_check_word
> 0.00% div ld-2.23.so [.] brk
> 0.00% div [kernel.vmlinux] [k] page_fault
> 0.00% div ld-2.23.so [.] _start
>
> We can see the kernel symbols apic_timer_interrupt and page_fault.
>
These kernel symbols do not match your description here. How much skid
do you think you have here?
You're saying you are at the user level, you get a counter overflow,
and the interrupted IP lands in the kernel
because you where there by the time the interrupt is delivered. How
many instructions does it take to get
from user land to apic_timer_interrupt() or page_fault()? These
functions are not right at the kernel entry,
I believe. So how did you get there, the skid must have been VERY big
or symbolization has a problem.
> Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
> ---
> kernel/events/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 58 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 80cca2b..7867541 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7721,6 +7721,28 @@ int perf_event_account_interrupt(struct perf_event *event)
> return __perf_event_account_interrupt(event, 1);
> }
>
> +static int perf_allow_sample_leakage __read_mostly;
> +
> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> +{
> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> +
> + if (allow_leakage)
> + return true;
> +
> + /*
> + * Due to interrupt latency (AKA "skid"), we may enter the
> + * kernel before taking an overflow, even if the PMU is only
> + * counting user events.
> + * To avoid leaking information to userspace, we must always
> + * reject kernel samples when exclude_kernel is set.
> + */
> + if (event->attr.exclude_kernel && !user_mode(regs))
> + return false;
> +
> + return true;
> +}
> +
> /*
> * Generic event overflow handling, sampling.
> */
> @@ -7742,6 +7764,12 @@ static int __perf_event_overflow(struct perf_event *event,
> ret = __perf_event_account_interrupt(event, throttle);
>
> /*
> + * For security, drop the skid kernel samples if necessary.
> + */
> + if (!sample_is_allowed(event, regs))
> + return ret;
> +
> + /*
> * XXX event_limit might not quite work as expected on inherited
> * events
> */
> @@ -9500,9 +9528,39 @@ perf_event_mux_interval_ms_store(struct device *dev,
> }
> static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
>
> +static ssize_t
> +perf_allow_sample_leakage_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n", allow_leakage);
> +}
> +
> +static ssize_t
> +perf_allow_sample_leakage_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int allow_leakage, ret;
> +
> + ret = kstrtoint(buf, 0, &allow_leakage);
> + if (ret)
> + return ret;
> +
> + if (allow_leakage != 0 && allow_leakage != 1)
> + return -EINVAL;
> +
> + WRITE_ONCE(perf_allow_sample_leakage, allow_leakage);
> +
> + return count;
> +}
> +static DEVICE_ATTR_RW(perf_allow_sample_leakage);
> +
> static struct attribute *pmu_dev_attrs[] = {
> &dev_attr_type.attr,
> &dev_attr_perf_event_mux_interval_ms.attr,
> + &dev_attr_perf_allow_sample_leakage.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(pmu_dev);
> --
> 2.7.4
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 5:59 ` Stephane Eranian
@ 2018-06-15 7:15 ` Jin, Yao
2018-06-19 16:50 ` Stephane Eranian
0 siblings, 1 reply; 29+ messages in thread
From: Jin, Yao @ 2018-06-15 7:15 UTC (permalink / raw)
To: Stephane Eranian
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ingo Molnar,
Alexander Shishkin, me, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Liang, Kan, Jin, Yao
On 6/15/2018 1:59 PM, Stephane Eranian wrote:
> On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
>>
>> When doing sampling, for example:
>>
>> perf record -e cycles:u ...
>>
>> On workloads that do a lot of kernel entry/exits we see kernel
>> samples, even though :u is specified. This is due to skid existing.
>>
>> This might be a security issue because it can leak kernel addresses even
>> though kernel sampling support is disabled.
>>
>> One patch "perf/core: Drop kernel samples even though :u is specified"
>> was posted in last year but it was reverted because it introduced a
>> regression issue that broke the rr-project, which used sampling
>> events to receive a signal on overflow. These signals were critical
>> to the correct operation of rr.
>>
>> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
>> though :u is specified"")' for detail.
>>
>> Now the idea is to use sysctl to control the dropping of leaked
>> kernel samples.
>>
>> /sys/devices/cpu/perf_allow_sample_leakage:
>>
>> 0 - default, drop the leaked kernel samples.
>> 1 - don't drop the leaked kernel samples.
>>
>> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage.
>>
>> For example,
>>
>> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
>> 0
>> root@skl:/tmp# perf record -e cycles:u ./div
>> root@skl:/tmp# perf report --stdio
>>
>> ........ ....... ............. ................
>>
>> 47.01% div div [.] main
>> 20.74% div libc-2.23.so [.] __random_r
>> 15.59% div libc-2.23.so [.] __random
>> 8.68% div div [.] compute_flag
>> 4.48% div libc-2.23.so [.] rand
>> 3.50% div div [.] rand@plt
>> 0.00% div ld-2.23.so [.] do_lookup_x
>> 0.00% div ld-2.23.so [.] memcmp
>> 0.00% div ld-2.23.so [.] _dl_start
>> 0.00% div ld-2.23.so [.] _start
>>
>> There is no kernel symbol reported.
>>
>> root@skl:/tmp# echo 1 > /sys/devices/cpu/perf_allow_sample_leakage
>> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
>> 1
>> root@skl:/tmp# perf record -e cycles:u ./div
>> root@skl:/tmp# perf report --stdio
>>
>> ........ ....... ................ .............
>>
>> 47.53% div div [.] main
>> 20.62% div libc-2.23.so [.] __random_r
>> 15.32% div libc-2.23.so [.] __random
>> 8.66% div div [.] compute_flag
>> 4.53% div libc-2.23.so [.] rand
>> 3.34% div div [.] rand@plt
>> 0.00% div [kernel.vmlinux] [k] apic_timer_interrupt
>> 0.00% div libc-2.23.so [.] intel_check_word
>> 0.00% div ld-2.23.so [.] brk
>> 0.00% div [kernel.vmlinux] [k] page_fault
>> 0.00% div ld-2.23.so [.] _start
>>
>> We can see the kernel symbols apic_timer_interrupt and page_fault.
>>
> These kernel symbols do not match your description here. How much skid
> do you think you have here?
> You're saying you are at the user level, you get a counter overflow,
> and the interrupted IP lands in the kernel
> because you where there by the time the interrupt is delivered. How
> many instructions does it take to get
> from user land to apic_timer_interrupt() or page_fault()? These
> functions are not right at the kernel entry,
> I believe. So how did you get there, the skid must have been VERY big
> or symbolization has a problem.
>
I'm testing with the latest perf/core branch (4.17+). Again I test with
Linux's vmstat (not test with my application).
perf record -e cycles:u vmstat 1
perf script -F ip
7f84e2b0bc30
7f84e2b0bc30
7f84e2b0bc30
7f84e2b0bc30
ffffffffb7a01070
7f84e2b243f0
7f84e2b11891
7f84e2b27f5e
7f84e25a3b26
7f84e25680f5
cat /proc/kallsyms | grep page_fault
....
ffffffffb7a01070 T page_fault
ffffffffb7a010a0 T async_page_fault
....
So one sample (ip ffffffffb7a01070) hits on page_fault.
Maybe you can have a try too. :)
Thanks
Jin Yao
>> Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
>> ---
>> kernel/events/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 58 insertions(+)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 80cca2b..7867541 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7721,6 +7721,28 @@ int perf_event_account_interrupt(struct perf_event *event)
>> return __perf_event_account_interrupt(event, 1);
>> }
>>
>> +static int perf_allow_sample_leakage __read_mostly;
>> +
>> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
>> +{
>> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
>> +
>> + if (allow_leakage)
>> + return true;
>> +
>> + /*
>> + * Due to interrupt latency (AKA "skid"), we may enter the
>> + * kernel before taking an overflow, even if the PMU is only
>> + * counting user events.
>> + * To avoid leaking information to userspace, we must always
>> + * reject kernel samples when exclude_kernel is set.
>> + */
>> + if (event->attr.exclude_kernel && !user_mode(regs))
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> /*
>> * Generic event overflow handling, sampling.
>> */
>> @@ -7742,6 +7764,12 @@ static int __perf_event_overflow(struct perf_event *event,
>> ret = __perf_event_account_interrupt(event, throttle);
>>
>> /*
>> + * For security, drop the skid kernel samples if necessary.
>> + */
>> + if (!sample_is_allowed(event, regs))
>> + return ret;
>> +
>> + /*
>> * XXX event_limit might not quite work as expected on inherited
>> * events
>> */
>> @@ -9500,9 +9528,39 @@ perf_event_mux_interval_ms_store(struct device *dev,
>> }
>> static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
>>
>> +static ssize_t
>> +perf_allow_sample_leakage_show(struct device *dev,
>> + struct device_attribute *attr, char *page)
>> +{
>> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
>> +
>> + return snprintf(page, PAGE_SIZE-1, "%d\n", allow_leakage);
>> +}
>> +
>> +static ssize_t
>> +perf_allow_sample_leakage_store(struct device *dev,
>> + struct device_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + int allow_leakage, ret;
>> +
>> + ret = kstrtoint(buf, 0, &allow_leakage);
>> + if (ret)
>> + return ret;
>> +
>> + if (allow_leakage != 0 && allow_leakage != 1)
>> + return -EINVAL;
>> +
>> + WRITE_ONCE(perf_allow_sample_leakage, allow_leakage);
>> +
>> + return count;
>> +}
>> +static DEVICE_ATTR_RW(perf_allow_sample_leakage);
>> +
>> static struct attribute *pmu_dev_attrs[] = {
>> &dev_attr_type.attr,
>> &dev_attr_perf_event_mux_interval_ms.attr,
>> + &dev_attr_perf_allow_sample_leakage.attr,
>> NULL,
>> };
>> ATTRIBUTE_GROUPS(pmu_dev);
>> --
>> 2.7.4
>>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 7:15 ` Jin, Yao
@ 2018-06-19 16:50 ` Stephane Eranian
0 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2018-06-19 16:50 UTC (permalink / raw)
To: yao.jin
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ingo Molnar,
Alexander Shishkin, Kyle Huey, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Liang, Kan, Jin, Yao
On Fri, Jun 15, 2018 at 12:15 AM Jin, Yao <yao.jin@linux.intel.com> wrote:
>
>
>
> On 6/15/2018 1:59 PM, Stephane Eranian wrote:
> > On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
> >>
> >> When doing sampling, for example:
> >>
> >> perf record -e cycles:u ...
> >>
> >> On workloads that do a lot of kernel entry/exits we see kernel
> >> samples, even though :u is specified. This is due to skid existing.
> >>
> >> This might be a security issue because it can leak kernel addresses even
> >> though kernel sampling support is disabled.
> >>
> >> One patch "perf/core: Drop kernel samples even though :u is specified"
> >> was posted in last year but it was reverted because it introduced a
> >> regression issue that broke the rr-project, which used sampling
> >> events to receive a signal on overflow. These signals were critical
> >> to the correct operation of rr.
> >>
> >> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
> >> though :u is specified"")' for detail.
> >>
> >> Now the idea is to use sysctl to control the dropping of leaked
> >> kernel samples.
> >>
> >> /sys/devices/cpu/perf_allow_sample_leakage:
> >>
> >> 0 - default, drop the leaked kernel samples.
> >> 1 - don't drop the leaked kernel samples.
> >>
> >> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage.
> >>
> >> For example,
> >>
> >> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> >> 0
> >> root@skl:/tmp# perf record -e cycles:u ./div
> >> root@skl:/tmp# perf report --stdio
> >>
> >> ........ ....... ............. ................
> >>
> >> 47.01% div div [.] main
> >> 20.74% div libc-2.23.so [.] __random_r
> >> 15.59% div libc-2.23.so [.] __random
> >> 8.68% div div [.] compute_flag
> >> 4.48% div libc-2.23.so [.] rand
> >> 3.50% div div [.] rand@plt
> >> 0.00% div ld-2.23.so [.] do_lookup_x
> >> 0.00% div ld-2.23.so [.] memcmp
> >> 0.00% div ld-2.23.so [.] _dl_start
> >> 0.00% div ld-2.23.so [.] _start
> >>
> >> There is no kernel symbol reported.
> >>
> >> root@skl:/tmp# echo 1 > /sys/devices/cpu/perf_allow_sample_leakage
> >> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> >> 1
> >> root@skl:/tmp# perf record -e cycles:u ./div
> >> root@skl:/tmp# perf report --stdio
> >>
> >> ........ ....... ................ .............
> >>
> >> 47.53% div div [.] main
> >> 20.62% div libc-2.23.so [.] __random_r
> >> 15.32% div libc-2.23.so [.] __random
> >> 8.66% div div [.] compute_flag
> >> 4.53% div libc-2.23.so [.] rand
> >> 3.34% div div [.] rand@plt
> >> 0.00% div [kernel.vmlinux] [k] apic_timer_interrupt
> >> 0.00% div libc-2.23.so [.] intel_check_word
> >> 0.00% div ld-2.23.so [.] brk
> >> 0.00% div [kernel.vmlinux] [k] page_fault
> >> 0.00% div ld-2.23.so [.] _start
> >>
> >> We can see the kernel symbols apic_timer_interrupt and page_fault.
> >>
> > These kernel symbols do not match your description here. How much skid
> > do you think you have here?
> > You're saying you are at the user level, you get a counter overflow,
> > and the interrupted IP lands in the kernel
> > because you where there by the time the interrupt is delivered. How
> > many instructions does it take to get
> > from user land to apic_timer_interrupt() or page_fault()? These
> > functions are not right at the kernel entry,
> > I believe. So how did you get there, the skid must have been VERY big
> > or symbolization has a problem.
> >
>
> I'm testing with the latest perf/core branch (4.17+). Again I test with
> Linux's vmstat (not test with my application).
>
> perf record -e cycles:u vmstat 1
> perf script -F ip
>
> 7f84e2b0bc30
> 7f84e2b0bc30
> 7f84e2b0bc30
> 7f84e2b0bc30
> ffffffffb7a01070
> 7f84e2b243f0
> 7f84e2b11891
> 7f84e2b27f5e
> 7f84e25a3b26
> 7f84e25680f5
>
> cat /proc/kallsyms | grep page_fault
> ....
> ffffffffb7a01070 T page_fault
> ffffffffb7a010a0 T async_page_fault
> ....
>
> So one sample (ip ffffffffb7a01070) hits on page_fault.
>
> Maybe you can have a try too. :)
>
Ok, I tried and checked! These symbols are all in entry_64.S, these
are the first instructions executed on the timer intr or fault.
So it looks normal that they show up due to the interrupt skid, even
if the skid is 1 cycle.
PEBS, especially when using precise=1 could also show these symbols.
>
> Thanks
> Jin Yao
>
> >> Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
> >> ---
> >> kernel/events/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 58 insertions(+)
> >>
> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index 80cca2b..7867541 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> >> @@ -7721,6 +7721,28 @@ int perf_event_account_interrupt(struct perf_event *event)
> >> return __perf_event_account_interrupt(event, 1);
> >> }
> >>
> >> +static int perf_allow_sample_leakage __read_mostly;
> >> +
> >> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> >> +{
> >> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> >> +
> >> + if (allow_leakage)
> >> + return true;
> >> +
> >> + /*
> >> + * Due to interrupt latency (AKA "skid"), we may enter the
> >> + * kernel before taking an overflow, even if the PMU is only
> >> + * counting user events.
> >> + * To avoid leaking information to userspace, we must always
> >> + * reject kernel samples when exclude_kernel is set.
> >> + */
> >> + if (event->attr.exclude_kernel && !user_mode(regs))
> >> + return false;
> >> +
> >> + return true;
> >> +}
> >> +
> >> /*
> >> * Generic event overflow handling, sampling.
> >> */
> >> @@ -7742,6 +7764,12 @@ static int __perf_event_overflow(struct perf_event *event,
> >> ret = __perf_event_account_interrupt(event, throttle);
> >>
> >> /*
> >> + * For security, drop the skid kernel samples if necessary.
> >> + */
> >> + if (!sample_is_allowed(event, regs))
> >> + return ret;
> >> +
> >> + /*
> >> * XXX event_limit might not quite work as expected on inherited
> >> * events
> >> */
> >> @@ -9500,9 +9528,39 @@ perf_event_mux_interval_ms_store(struct device *dev,
> >> }
> >> static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
> >>
> >> +static ssize_t
> >> +perf_allow_sample_leakage_show(struct device *dev,
> >> + struct device_attribute *attr, char *page)
> >> +{
> >> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> >> +
> >> + return snprintf(page, PAGE_SIZE-1, "%d\n", allow_leakage);
> >> +}
> >> +
> >> +static ssize_t
> >> +perf_allow_sample_leakage_store(struct device *dev,
> >> + struct device_attribute *attr,
> >> + const char *buf, size_t count)
> >> +{
> >> + int allow_leakage, ret;
> >> +
> >> + ret = kstrtoint(buf, 0, &allow_leakage);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + if (allow_leakage != 0 && allow_leakage != 1)
> >> + return -EINVAL;
> >> +
> >> + WRITE_ONCE(perf_allow_sample_leakage, allow_leakage);
> >> +
> >> + return count;
> >> +}
> >> +static DEVICE_ATTR_RW(perf_allow_sample_leakage);
> >> +
> >> static struct attribute *pmu_dev_attrs[] = {
> >> &dev_attr_type.attr,
> >> &dev_attr_perf_event_mux_interval_ms.attr,
> >> + &dev_attr_perf_allow_sample_leakage.attr,
> >> NULL,
> >> };
> >> ATTRIBUTE_GROUPS(pmu_dev);
> >> --
> >> 2.7.4
> >>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 10:03 ` [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping " Jin Yao
2018-06-15 5:59 ` Stephane Eranian
@ 2018-06-15 6:02 ` Stephane Eranian
2018-06-15 8:16 ` Peter Zijlstra
2018-06-15 11:36 ` Mark Rutland
2 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2018-06-15 6:02 UTC (permalink / raw)
To: yao.jin
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ingo Molnar,
Alexander Shishkin, me, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Liang, Kan, Jin, Yao
On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
>
> When doing sampling, for example:
>
> perf record -e cycles:u ...
>
> On workloads that do a lot of kernel entry/exits we see kernel
> samples, even though :u is specified. This is due to skid existing.
>
> This might be a security issue because it can leak kernel addresses even
> though kernel sampling support is disabled.
>
> One patch "perf/core: Drop kernel samples even though :u is specified"
> was posted in last year but it was reverted because it introduced a
> regression issue that broke the rr-project, which used sampling
> events to receive a signal on overflow. These signals were critical
> to the correct operation of rr.
>
> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
> though :u is specified"")' for detail.
>
> Now the idea is to use sysctl to control the dropping of leaked
> kernel samples.
>
> /sys/devices/cpu/perf_allow_sample_leakage:
>
> 0 - default, drop the leaked kernel samples.
> 1 - don't drop the leaked kernel samples.
>
> For rr it can write 1 to /sys/devices/cpu/perf_allow_sample_leakage.
>
> For example,
>
> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> 0
> root@skl:/tmp# perf record -e cycles:u ./div
> root@skl:/tmp# perf report --stdio
>
> ........ ....... ............. ................
>
> 47.01% div div [.] main
> 20.74% div libc-2.23.so [.] __random_r
> 15.59% div libc-2.23.so [.] __random
> 8.68% div div [.] compute_flag
> 4.48% div libc-2.23.so [.] rand
> 3.50% div div [.] rand@plt
> 0.00% div ld-2.23.so [.] do_lookup_x
> 0.00% div ld-2.23.so [.] memcmp
> 0.00% div ld-2.23.so [.] _dl_start
> 0.00% div ld-2.23.so [.] _start
>
> There is no kernel symbol reported.
>
> root@skl:/tmp# echo 1 > /sys/devices/cpu/perf_allow_sample_leakage
> root@skl:/tmp# cat /sys/devices/cpu/perf_allow_sample_leakage
> 1
> root@skl:/tmp# perf record -e cycles:u ./div
> root@skl:/tmp# perf report --stdio
>
> ........ ....... ................ .............
>
> 47.53% div div [.] main
> 20.62% div libc-2.23.so [.] __random_r
> 15.32% div libc-2.23.so [.] __random
> 8.66% div div [.] compute_flag
> 4.53% div libc-2.23.so [.] rand
> 3.34% div div [.] rand@plt
> 0.00% div [kernel.vmlinux] [k] apic_timer_interrupt
> 0.00% div libc-2.23.so [.] intel_check_word
> 0.00% div ld-2.23.so [.] brk
> 0.00% div [kernel.vmlinux] [k] page_fault
> 0.00% div ld-2.23.so [.] _start
>
> We can see the kernel symbols apic_timer_interrupt and page_fault.
>
> Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
> ---
> kernel/events/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 58 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 80cca2b..7867541 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7721,6 +7721,28 @@ int perf_event_account_interrupt(struct perf_event *event)
> return __perf_event_account_interrupt(event, 1);
> }
>
> +static int perf_allow_sample_leakage __read_mostly;
> +
> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> +{
> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> +
> + if (allow_leakage)
> + return true;
> +
> + /*
> + * Due to interrupt latency (AKA "skid"), we may enter the
> + * kernel before taking an overflow, even if the PMU is only
> + * counting user events.
> + * To avoid leaking information to userspace, we must always
> + * reject kernel samples when exclude_kernel is set.
> + */
> + if (event->attr.exclude_kernel && !user_mode(regs))
> + return false;
> +
And how does that filter PEBS or LBR records?
> + return true;
> +}
> +
> /*
> * Generic event overflow handling, sampling.
> */
> @@ -7742,6 +7764,12 @@ static int __perf_event_overflow(struct perf_event *event,
> ret = __perf_event_account_interrupt(event, throttle);
>
> /*
> + * For security, drop the skid kernel samples if necessary.
> + */
> + if (!sample_is_allowed(event, regs))
> + return ret;
> +
> + /*
> * XXX event_limit might not quite work as expected on inherited
> * events
> */
> @@ -9500,9 +9528,39 @@ perf_event_mux_interval_ms_store(struct device *dev,
> }
> static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
>
> +static ssize_t
> +perf_allow_sample_leakage_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + int allow_leakage = READ_ONCE(perf_allow_sample_leakage);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n", allow_leakage);
> +}
> +
> +static ssize_t
> +perf_allow_sample_leakage_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int allow_leakage, ret;
> +
> + ret = kstrtoint(buf, 0, &allow_leakage);
> + if (ret)
> + return ret;
> +
> + if (allow_leakage != 0 && allow_leakage != 1)
> + return -EINVAL;
> +
> + WRITE_ONCE(perf_allow_sample_leakage, allow_leakage);
> +
> + return count;
> +}
> +static DEVICE_ATTR_RW(perf_allow_sample_leakage);
> +
> static struct attribute *pmu_dev_attrs[] = {
> &dev_attr_type.attr,
> &dev_attr_perf_event_mux_interval_ms.attr,
> + &dev_attr_perf_allow_sample_leakage.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(pmu_dev);
> --
> 2.7.4
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 6:02 ` Stephane Eranian
@ 2018-06-15 8:16 ` Peter Zijlstra
2018-06-15 13:31 ` Liang, Kan
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-06-15 8:16 UTC (permalink / raw)
To: Stephane Eranian
Cc: yao.jin, Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar,
Alexander Shishkin, me, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Liang, Kan, Jin, Yao
On Thu, Jun 14, 2018 at 11:02:53PM -0700, Stephane Eranian wrote:
> On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
> > + /*
> > + * Due to interrupt latency (AKA "skid"), we may enter the
> > + * kernel before taking an overflow, even if the PMU is only
> > + * counting user events.
> > + * To avoid leaking information to userspace, we must always
> > + * reject kernel samples when exclude_kernel is set.
> > + */
> > + if (event->attr.exclude_kernel && !user_mode(regs))
> > + return false;
> > +
> And how does that filter PEBS or LBR records?
I suspect the user_mode() thing actually covers PEBS, but yes LBR might
need additional filtering.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 8:16 ` Peter Zijlstra
@ 2018-06-15 13:31 ` Liang, Kan
2018-06-18 10:41 ` Peter Zijlstra
0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2018-06-15 13:31 UTC (permalink / raw)
To: Peter Zijlstra, Stephane Eranian
Cc: yao.jin, Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar,
Alexander Shishkin, me, LKML, Vince Weaver, Will Deacon,
Namhyung Kim, Andi Kleen, Jin, Yao
> On Thu, Jun 14, 2018 at 11:02:53PM -0700, Stephane Eranian wrote:
> > On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
> > > + /*
> > > + * Due to interrupt latency (AKA "skid"), we may enter the
> > > + * kernel before taking an overflow, even if the PMU is only
> > > + * counting user events.
> > > + * To avoid leaking information to userspace, we must always
> > > + * reject kernel samples when exclude_kernel is set.
> > > + */
> > > + if (event->attr.exclude_kernel && !user_mode(regs))
> > > + return false;
> > > +
> > And how does that filter PEBS or LBR records?
>
> I suspect the user_mode() thing actually covers PEBS, but yes LBR might need
> additional filtering.
I think the large PEBS still need to be specially handled.
Thanks,
Kan
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 13:31 ` Liang, Kan
@ 2018-06-18 10:41 ` Peter Zijlstra
0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2018-06-18 10:41 UTC (permalink / raw)
To: Liang, Kan
Cc: Stephane Eranian, yao.jin, Arnaldo Carvalho de Melo, Jiri Olsa,
Ingo Molnar, Alexander Shishkin, me, LKML, Vince Weaver,
Will Deacon, Namhyung Kim, Andi Kleen, Jin, Yao
On Fri, Jun 15, 2018 at 01:31:34PM +0000, Liang, Kan wrote:
> > On Thu, Jun 14, 2018 at 11:02:53PM -0700, Stephane Eranian wrote:
> > > On Thu, Jun 14, 2018 at 7:10 PM Jin Yao <yao.jin@linux.intel.com> wrote:
> > > > + /*
> > > > + * Due to interrupt latency (AKA "skid"), we may enter the
> > > > + * kernel before taking an overflow, even if the PMU is only
> > > > + * counting user events.
> > > > + * To avoid leaking information to userspace, we must always
> > > > + * reject kernel samples when exclude_kernel is set.
> > > > + */
> > > > + if (event->attr.exclude_kernel && !user_mode(regs))
> > > > + return false;
> > > > +
> > > And how does that filter PEBS or LBR records?
> >
> > I suspect the user_mode() thing actually covers PEBS, but yes LBR might need
> > additional filtering.
>
> I think the large PEBS still need to be specially handled.
large pebs should be fine too, all pebs stuff eventually calls
perf_event_output() with a synthesized register set.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 10:03 ` [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping " Jin Yao
2018-06-15 5:59 ` Stephane Eranian
2018-06-15 6:02 ` Stephane Eranian
@ 2018-06-15 11:36 ` Mark Rutland
2018-06-16 1:27 ` Linus Torvalds
2018-06-18 6:55 ` Jin, Yao
2 siblings, 2 replies; 29+ messages in thread
From: Mark Rutland @ 2018-06-15 11:36 UTC (permalink / raw)
To: Jin Yao
Cc: acme, jolsa, peterz, mingo, alexander.shishkin, me, Linux-kernel,
vincent.weaver, will.deacon, eranian, namhyung, ak, kan.liang,
yao.jin
On Fri, Jun 15, 2018 at 06:03:22PM +0800, Jin Yao wrote:
> When doing sampling, for example:
>
> perf record -e cycles:u ...
>
> On workloads that do a lot of kernel entry/exits we see kernel
> samples, even though :u is specified. This is due to skid existing.
>
> This might be a security issue because it can leak kernel addresses even
> though kernel sampling support is disabled.
>
> One patch "perf/core: Drop kernel samples even though :u is specified"
> was posted in last year but it was reverted because it introduced a
> regression issue that broke the rr-project, which used sampling
> events to receive a signal on overflow. These signals were critical
> to the correct operation of rr.
>
> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
> though :u is specified"")' for detail.
>
> Now the idea is to use sysctl to control the dropping of leaked
> kernel samples.
>
> /sys/devices/cpu/perf_allow_sample_leakage:
>
> 0 - default, drop the leaked kernel samples.
> 1 - don't drop the leaked kernel samples.
Does this need to be conditional at all?
At least for sampling the GPRs, we could do something like below
unconditionally, which seems sufficient for my test cases.
Mark.
---->8----
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67612ce359ad..79a21531d57c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6359,6 +6359,24 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
return callchain ?: &__empty_callchain;
}
+static struct pt_regs *perf_get_sample_regs(struct perf_event *event, struct pt_regs *regs)
+{
+ /*
+ * Due to interrupt latency (AKA "skid"), we may enter the kernel
+ * before taking an overflow, even if the PMU is only counting user
+ * events.
+ *
+ * If we're not counting kernel events, always use the user regs when
+ * sampling.
+ *
+ * TODO: how does this interact with guest sampling?
+ */
+ if (event->attr.exclude_kernel && !user_mode(regs))
+ return task_pt_regs(current);
+
+ return regs;
+}
+
void perf_prepare_sample(struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event,
@@ -6366,6 +6384,8 @@ void perf_prepare_sample(struct perf_event_header *header,
{
u64 sample_type = event->attr.sample_type;
+ regs = perf_get_sample_regs(event, regs);
+
header->type = PERF_RECORD_SAMPLE;
header->size = sizeof(*header) + event->header_size;
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 11:36 ` Mark Rutland
@ 2018-06-16 1:27 ` Linus Torvalds
2018-06-18 10:51 ` Peter Zijlstra
2018-06-18 6:55 ` Jin, Yao
1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2018-06-16 1:27 UTC (permalink / raw)
To: Mark Rutland
Cc: yao.jin, Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra,
Ingo Molnar, Alexander Shishkin, Kyle Huey,
Linux Kernel Mailing List, Vince Weaver, Will Deacon,
Stephane Eranian, Namhyung Kim, Andi Kleen, Liang, Kan, yao.jin
On Fri, Jun 15, 2018 at 8:36 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> At least for sampling the GPRs, we could do something like below
> unconditionally, which seems sufficient for my test cases.
Ack.
The PEBS case may need checking, but maybe PEBS doesn't even have this issue?
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-16 1:27 ` Linus Torvalds
@ 2018-06-18 10:51 ` Peter Zijlstra
0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2018-06-18 10:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mark Rutland, yao.jin, Arnaldo Carvalho de Melo, Jiri Olsa,
Ingo Molnar, Alexander Shishkin, Kyle Huey,
Linux Kernel Mailing List, Vince Weaver, Will Deacon,
Stephane Eranian, Namhyung Kim, Andi Kleen, Liang, Kan, yao.jin
On Sat, Jun 16, 2018 at 10:27:27AM +0900, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 8:36 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > At least for sampling the GPRs, we could do something like below
> > unconditionally, which seems sufficient for my test cases.
>
> Ack.
>
> The PEBS case may need checking, but maybe PEBS doesn't even have this issue?
Sadly PEBS is susceptible as well, but the proposed patch actually works
for that as well.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-15 11:36 ` Mark Rutland
2018-06-16 1:27 ` Linus Torvalds
@ 2018-06-18 6:55 ` Jin, Yao
2018-06-18 10:45 ` Peter Zijlstra
1 sibling, 1 reply; 29+ messages in thread
From: Jin, Yao @ 2018-06-18 6:55 UTC (permalink / raw)
To: Mark Rutland
Cc: acme, jolsa, peterz, mingo, alexander.shishkin, me, Linux-kernel,
vincent.weaver, will.deacon, eranian, namhyung, ak, kan.liang,
yao.jin
On 6/15/2018 7:36 PM, Mark Rutland wrote:
> On Fri, Jun 15, 2018 at 06:03:22PM +0800, Jin Yao wrote:
>> When doing sampling, for example:
>>
>> perf record -e cycles:u ...
>>
>> On workloads that do a lot of kernel entry/exits we see kernel
>> samples, even though :u is specified. This is due to skid existing.
>>
>> This might be a security issue because it can leak kernel addresses even
>> though kernel sampling support is disabled.
>>
>> One patch "perf/core: Drop kernel samples even though :u is specified"
>> was posted in last year but it was reverted because it introduced a
>> regression issue that broke the rr-project, which used sampling
>> events to receive a signal on overflow. These signals were critical
>> to the correct operation of rr.
>>
>> See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
>> though :u is specified"")' for detail.
>>
>> Now the idea is to use sysctl to control the dropping of leaked
>> kernel samples.
>>
>> /sys/devices/cpu/perf_allow_sample_leakage:
>>
>> 0 - default, drop the leaked kernel samples.
>> 1 - don't drop the leaked kernel samples.
>
> Does this need to be conditional at all?
>
> At least for sampling the GPRs, we could do something like below
> unconditionally, which seems sufficient for my test cases.
>
> Mark.
>
> ---->8----
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 67612ce359ad..79a21531d57c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6359,6 +6359,24 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
> return callchain ?: &__empty_callchain;
> }
>
> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event, struct pt_regs *regs)
> +{
> + /*
> + * Due to interrupt latency (AKA "skid"), we may enter the kernel
> + * before taking an overflow, even if the PMU is only counting user
> + * events.
> + *
> + * If we're not counting kernel events, always use the user regs when
> + * sampling.
> + *
> + * TODO: how does this interact with guest sampling?
> + */
> + if (event->attr.exclude_kernel && !user_mode(regs))
> + return task_pt_regs(current);
> +
> + return regs;
> +}
> +
> void perf_prepare_sample(struct perf_event_header *header,
> struct perf_sample_data *data,
> struct perf_event *event,
> @@ -6366,6 +6384,8 @@ void perf_prepare_sample(struct perf_event_header *header,
> {
> u64 sample_type = event->attr.sample_type;
>
> + regs = perf_get_sample_regs(event, regs);
> +
> header->type = PERF_RECORD_SAMPLE;
> header->size = sizeof(*header) + event->header_size;
>
>
Hi Mark,
Thanks for providing the patch. I understand this approach.
In my opinion, the skid window is from counter overflow to interrupt
delivered. While if the skid window is too *big* (e.g. user -> kernel),
it should be not very useful. So personally, I'd prefer to drop the samples.
Thanks
Jin Yao
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-18 6:55 ` Jin, Yao
@ 2018-06-18 10:45 ` Peter Zijlstra
2018-06-19 1:39 ` Jin, Yao
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-06-18 10:45 UTC (permalink / raw)
To: Jin, Yao
Cc: Mark Rutland, acme, jolsa, mingo, alexander.shishkin, me,
Linux-kernel, vincent.weaver, will.deacon, eranian, namhyung, ak,
kan.liang, yao.jin
On Mon, Jun 18, 2018 at 02:55:32PM +0800, Jin, Yao wrote:
> Thanks for providing the patch. I understand this approach.
>
> In my opinion, the skid window is from counter overflow to interrupt
> delivered. While if the skid window is too *big* (e.g. user -> kernel), it
> should be not very useful. So personally, I'd prefer to drop the samples.
I really don't get your insitence on dropping the sample. Dropping
samples is bad. Furthermore, doing what Mark suggests actually improves
the result by reducing the skid, if the event happened before we entered
(as it damn well should) then the user regs, which point at the entry
site, are a better approximation than our in-kernel set.
So not only do you not loose the sample, you actually get a better
sample.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-18 10:45 ` Peter Zijlstra
@ 2018-06-19 1:39 ` Jin, Yao
2018-06-19 6:01 ` Mark Rutland
0 siblings, 1 reply; 29+ messages in thread
From: Jin, Yao @ 2018-06-19 1:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mark Rutland, acme, jolsa, mingo, alexander.shishkin, me,
Linux-kernel, vincent.weaver, will.deacon, eranian, namhyung, ak,
kan.liang, yao.jin
On 6/18/2018 6:45 PM, Peter Zijlstra wrote:
> On Mon, Jun 18, 2018 at 02:55:32PM +0800, Jin, Yao wrote:
>> Thanks for providing the patch. I understand this approach.
>>
>> In my opinion, the skid window is from counter overflow to interrupt
>> delivered. While if the skid window is too *big* (e.g. user -> kernel), it
>> should be not very useful. So personally, I'd prefer to drop the samples.
>
> I really don't get your insitence on dropping the sample. Dropping
> samples is bad. Furthermore, doing what Mark suggests actually improves
> the result by reducing the skid, if the event happened before we entered
> (as it damn well should) then the user regs, which point at the entry
> site, are a better approximation than our in-kernel set.
>
> So not only do you not loose the sample, you actually get a better
> sample.
>
OK, that's fine, thanks!
I guess Mark will post this patch, right?
Anyway looks we don't need following patch (0-stuffing sample->ip to
indicate perf tool that it is a leak sample), right?
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80cca2b..628b515 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6361,6 +6361,21 @@ perf_callchain(struct perf_event *event, struct
pt_regs *regs)
return callchain ?: &__empty_callchain;
}
+static bool sample_is_leaked(struct perf_event *event, struct pt_regs
*regs)
+{
+ /*
+ * Due to interrupt latency (AKA "skid"), we may enter the
+ * kernel before taking an overflow, even if the PMU is only
+ * counting user events.
+ * To avoid leaking information to userspace, we must always
+ * reject kernel samples when exclude_kernel is set.
+ */
+ if (event->attr.exclude_kernel && !user_mode(regs))
+ return true;
+
+ return false;
+}
+
void perf_prepare_sample(struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event,
@@ -6480,6 +6495,9 @@ void perf_prepare_sample(struct perf_event_header
*header,
if (sample_type & PERF_SAMPLE_PHYS_ADDR)
data->phys_addr = perf_virt_to_phys(data->addr);
+
+ if (sample_is_leaked(event, regs))
+ data->ip = 0;
}
static void __always_inline
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index bfa60bc..1bfb697 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -404,6 +404,7 @@ struct events_stats {
u64 total_aux_lost;
u64 total_aux_partial;
u64 total_invalid_chains;
+ u64 total_dropped_samples;
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
u32 nr_lost_warned;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 8b93693..ec923f1 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1269,6 +1269,12 @@ static int machines__deliver_event(struct
machines *machines,
++evlist->stats.nr_unprocessable_samples;
return 0;
}
+
+ if (sample->ip == 0) {
+ /* Drop the leaked kernel samples */
+ ++evlist->stats.total_dropped_samples;
+ return 0;
+ }
return perf_evlist__deliver_sample(evlist, tool, event,
sample, evsel, machine);
case PERF_RECORD_MMAP:
return tool->mmap(tool, event, sample, machine);
Thanks
Jin Yao
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples
2018-06-19 1:39 ` Jin, Yao
@ 2018-06-19 6:01 ` Mark Rutland
0 siblings, 0 replies; 29+ messages in thread
From: Mark Rutland @ 2018-06-19 6:01 UTC (permalink / raw)
To: Jin, Yao
Cc: Peter Zijlstra, acme, jolsa, mingo, alexander.shishkin, me,
Linux-kernel, vincent.weaver, will.deacon, eranian, namhyung, ak,
kan.liang, yao.jin
On Tue, Jun 19, 2018 at 09:39:02AM +0800, Jin, Yao wrote:
>
>
> On 6/18/2018 6:45 PM, Peter Zijlstra wrote:
> > On Mon, Jun 18, 2018 at 02:55:32PM +0800, Jin, Yao wrote:
> > > Thanks for providing the patch. I understand this approach.
> > >
> > > In my opinion, the skid window is from counter overflow to interrupt
> > > delivered. While if the skid window is too *big* (e.g. user -> kernel), it
> > > should be not very useful. So personally, I'd prefer to drop the samples.
> >
> > I really don't get your insitence on dropping the sample. Dropping
> > samples is bad. Furthermore, doing what Mark suggests actually improves
> > the result by reducing the skid, if the event happened before we entered
> > (as it damn well should) then the user regs, which point at the entry
> > site, are a better approximation than our in-kernel set.
> >
> > So not only do you not loose the sample, you actually get a better
> > sample.
> >
>
> OK, that's fine, thanks!
>
> I guess Mark will post this patch, right?
I'll try to spin something shortly -- I'm just figuring out how this should
work with guest sampling.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 29+ messages in thread