* Perf loses events without reporting @ 2021-06-30 17:29 Stephen Brennan 2021-07-01 10:37 ` Jiri Olsa 0 siblings, 1 reply; 8+ messages in thread From: Stephen Brennan @ 2021-06-30 17:29 UTC (permalink / raw) To: linux-perf-users Hi all, I've been trying to understand the behavior of the x86_64 performance monitoring interrupt, specifically when IRQ is disabled. Since it's an NMI, it should still trigger and record events. However, I've noticed that when interrupts are disabled for a long time, events seem to be silently dropped, and I'm wondering if this is expected behavior. To test this, I created a simple kernel module "irqoff" which creates a file /proc/irqoff_sleep_millis. On write, the module uses "spin_lock_irq()" to disable interrupts, and then issues an mdelay() call for whatever number of milliseconds was written. This allows us to busy wait with IRQ disabled. (Source for the module at the end of this email). When I use perf to record a write to this file, we see the following: $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ] $ sudo perf script # ... filtered down: sh 62863 52318.991716: 100000 cycles: ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms]) sh 62863 52318.991740: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) sh 62863 52318.991765: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) # ^ v ~2 second gap! sh 62863 52320.963900: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) sh 62863 52320.963923: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) sh 62863 52320.963948: 100000 cycles: ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms]) The perf stat shows the following counters over a similar run: $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis': 1,975.55 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 61 page-faults # 0.031 K/sec 7,952,267,470 cycles # 4.025 GHz 541,904,608 instructions # 0.07 insn per cycle 83,406,021 branches # 42.219 M/sec 10,365 branch-misses # 0.01% of all branches 1.977234595 seconds time elapsed 0.000000000 seconds user 1.977162000 seconds sys According to this, we should see roughly 79k samples (7.9 billion cycles / 100k sample period), but perf only gets 58. What it "looks like" to me, is that the CPU ring buffer might run out of space after several events, and the perf process doesn't get scheduled soon enough to read the data? But in my experience, perf usually reports that it missed some events. So I wonder if anybody is familiar with the factors at play for when IRQ is disabled during a PMI? I'd appreciate any pointers to guide my exploration. My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I have also tested on a 5.4 based kernel. I'm happy to reproduce this on a mainline kernel too. Thanks, Stephen Makefile: <<< obj-m += irqoff.o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean >>> irqoff.c: <<< #include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/proc_fs.h> #include <linux/spinlock.h> #include <linux/delay.h> #include <linux/version.h> #include <linux/uaccess.h> MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time"); MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>"); // Store the proc dir entry we can use to check status struct proc_dir_entry *pde = NULL; DEFINE_SPINLOCK(irqoff_lock); static noinline void irqsoff_inirq_delay(unsigned long millis) { mdelay(millis); } static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) { char buf[32]; int rv; unsigned long usecs = 0; if (amt > sizeof(buf) - 1) return -EFBIG; if ((rv = copy_from_user(buf, data, amt)) != 0) return -EFAULT; buf[amt] = '\0'; if (sscanf(buf, "%lu", &usecs) != 1) return -EINVAL; /* We read number of milliseconds, but will convert to microseconds. Threshold it at 5 minutes for safety. */ if (usecs > 5 * 60 * 1000) return -EINVAL; pr_info("[irqoff] lock for %lu millis\n", usecs); spin_lock_irq(&irqoff_lock); irqsoff_inirq_delay(usecs); spin_unlock_irq(&irqoff_lock); return amt; } static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off) { return 0; } #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) static const struct file_operations irqsoff_fops = { .owner = THIS_MODULE, .read = irqsoff_read, .write = irqsoff_write, }; #else static const struct proc_ops irqsoff_fops = { .proc_read = irqsoff_read, .proc_write = irqsoff_write, }; #endif static int irqoff_init(void) { pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops); if (!pde) return -ENOENT; pr_info("[irqoff] successfully initialized\n"); return 0; } static void irqoff_exit(void) { proc_remove(pde); pde = NULL; } module_init(irqoff_init); module_exit(irqoff_exit); >>> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-06-30 17:29 Perf loses events without reporting Stephen Brennan @ 2021-07-01 10:37 ` Jiri Olsa 2021-07-01 16:45 ` Stephen Brennan 0 siblings, 1 reply; 8+ messages in thread From: Jiri Olsa @ 2021-07-01 10:37 UTC (permalink / raw) To: Stephen Brennan; +Cc: linux-perf-users On Wed, Jun 30, 2021 at 10:29:13AM -0700, Stephen Brennan wrote: > Hi all, > > I've been trying to understand the behavior of the x86_64 performance > monitoring interrupt, specifically when IRQ is disabled. Since it's an > NMI, it should still trigger and record events. However, I've noticed > that when interrupts are disabled for a long time, events seem to be > silently dropped, and I'm wondering if this is expected behavior. > > To test this, I created a simple kernel module "irqoff" which creates a > file /proc/irqoff_sleep_millis. On write, the module uses > "spin_lock_irq()" to disable interrupts, and then issues an mdelay() > call for whatever number of milliseconds was written. This allows us to > busy wait with IRQ disabled. (Source for the module at the end of this > email). > > When I use perf to record a write to this file, we see the following: > > $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' seems strange.. I'll check could you see that also when monitoring the cpu? like: $ sudo perf record -e cycles -c 100000 -C 1 -- taskset -c 1 sh -c .. jirka > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ] > > $ sudo perf script > # ... filtered down: > sh 62863 52318.991716: 100000 cycles: ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms]) > sh 62863 52318.991740: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) > sh 62863 52318.991765: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) > # ^ v ~2 second gap! > sh 62863 52320.963900: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) > sh 62863 52320.963923: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) > sh 62863 52320.963948: 100000 cycles: ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms]) > > The perf stat shows the following counters over a similar run: > > $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' > > Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis': > > 1,975.55 msec task-clock # 0.999 CPUs utilized > 1 context-switches # 0.001 K/sec > 0 cpu-migrations # 0.000 K/sec > 61 page-faults # 0.031 K/sec > 7,952,267,470 cycles # 4.025 GHz > 541,904,608 instructions # 0.07 insn per cycle > 83,406,021 branches # 42.219 M/sec > 10,365 branch-misses # 0.01% of all branches > > 1.977234595 seconds time elapsed > > 0.000000000 seconds user > 1.977162000 seconds sys > > According to this, we should see roughly 79k samples (7.9 billion cycles > / 100k sample period), but perf only gets 58. What it "looks like" to > me, is that the CPU ring buffer might run out of space after several > events, and the perf process doesn't get scheduled soon enough to read > the data? But in my experience, perf usually reports that it missed some > events. So I wonder if anybody is familiar with the factors at play for > when IRQ is disabled during a PMI? I'd appreciate any pointers to guide > my exploration. > > My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I > have also tested on a 5.4 based kernel. I'm happy to reproduce this on a > mainline kernel too. > > Thanks, > Stephen > > Makefile: > <<< > obj-m += irqoff.o > > all: > make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules > > clean: > make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean > >>> > > irqoff.c: > <<< > #include <linux/module.h> > #include <linux/kernel.h> > #include <linux/init.h> > #include <linux/proc_fs.h> > #include <linux/spinlock.h> > #include <linux/delay.h> > #include <linux/version.h> > #include <linux/uaccess.h> > > MODULE_LICENSE("GPL"); > MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time"); > MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>"); > > > // Store the proc dir entry we can use to check status > struct proc_dir_entry *pde = NULL; > > DEFINE_SPINLOCK(irqoff_lock); > > > static noinline void irqsoff_inirq_delay(unsigned long millis) > { > mdelay(millis); > } > > > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) > { > char buf[32]; > int rv; > unsigned long usecs = 0; > > if (amt > sizeof(buf) - 1) > return -EFBIG; > > if ((rv = copy_from_user(buf, data, amt)) != 0) > return -EFAULT; > > buf[amt] = '\0'; > > if (sscanf(buf, "%lu", &usecs) != 1) > return -EINVAL; > > /* We read number of milliseconds, but will convert to microseconds. > Threshold it at 5 minutes for safety. */ > if (usecs > 5 * 60 * 1000) > return -EINVAL; > > pr_info("[irqoff] lock for %lu millis\n", usecs); > spin_lock_irq(&irqoff_lock); > irqsoff_inirq_delay(usecs); > spin_unlock_irq(&irqoff_lock); > > return amt; > } > > static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off) > { > return 0; > } > > #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) > static const struct file_operations irqsoff_fops = { > .owner = THIS_MODULE, > .read = irqsoff_read, > .write = irqsoff_write, > }; > #else > static const struct proc_ops irqsoff_fops = { > .proc_read = irqsoff_read, > .proc_write = irqsoff_write, > }; > #endif > > static int irqoff_init(void) > { > pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops); > if (!pde) > return -ENOENT; > > pr_info("[irqoff] successfully initialized\n"); > return 0; > } > > static void irqoff_exit(void) > { > proc_remove(pde); > pde = NULL; > } > > module_init(irqoff_init); > module_exit(irqoff_exit); > >>> > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-01 10:37 ` Jiri Olsa @ 2021-07-01 16:45 ` Stephen Brennan 2021-07-01 17:05 ` Stephen Brennan 0 siblings, 1 reply; 8+ messages in thread From: Stephen Brennan @ 2021-07-01 16:45 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users Jiri Olsa <jolsa@redhat.com> writes: > On Wed, Jun 30, 2021 at 10:29:13AM -0700, Stephen Brennan wrote: >> Hi all, >> >> I've been trying to understand the behavior of the x86_64 performance >> monitoring interrupt, specifically when IRQ is disabled. Since it's an >> NMI, it should still trigger and record events. However, I've noticed >> that when interrupts are disabled for a long time, events seem to be >> silently dropped, and I'm wondering if this is expected behavior. >> >> To test this, I created a simple kernel module "irqoff" which creates a >> file /proc/irqoff_sleep_millis. On write, the module uses >> "spin_lock_irq()" to disable interrupts, and then issues an mdelay() >> call for whatever number of milliseconds was written. This allows us to >> busy wait with IRQ disabled. (Source for the module at the end of this >> email). >> >> When I use perf to record a write to this file, we see the following: >> >> $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' > > seems strange.. I'll check > > could you see that also when monitoring the cpu? like: > > $ sudo perf record -e cycles -c 100000 -C 1 -- taskset -c 1 sh -c .. > > jirka Thanks for taking a look. I tried monitoring the CPU instead of the task and saw the exact same result -- about a two second gap in the resulting data when viewed through perf script. I tried specifying a few different CPUs just in case. I've reproduced this behavior on a few bare-metal (no VM) systems with different CPUs, and my initial observation is that it seems to be Intel specific. Below are two Intel CPUs which have the exact same behavior: (a) perf stat shows the cycles counter in the billions, (b) orders of magnitude fewer samples than expected, and (c) large ~2s gap in the data. ** Server (v5.4 based) Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz Stepping: 1 ** Laptop (v5.11 based) Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz Stepping: 12 On the other hand, with AMD, I get similarly few samples, but the data seems correct. Here is the CPU info for my AMD test system: ** AMD Server (v5.4 based) Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7551 32-Core Processor Stepping: 2 $ sudo perf stat -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis' Performance counter stats for 'sh -c echo 2000 >/proc/irqoff_sleep_millis': 2,000.83 msec task-clock # 1.000 CPUs utilized 1 context-switches # 0.000 K/sec 1 cpu-migrations # 0.000 K/sec 130 page-faults # 0.065 K/sec 7,196,575 cycles # 0.004 GHz (99.75%) 3,606,261,791 stalled-cycles-frontend # 50110.81% frontend cycles idle (0.29%) 3,156,105 stalled-cycles-backend # 43.86% backend cycles idle (99.99%) 3,399,963 instructions # 0.47 insn per cycle # 1060.68 stalled cycles per insn 655,974 branches # 0.328 M/sec 12,880 branch-misses # 1.96% of all branches (99.97%) 2.001346768 seconds time elapsed 0.396201000 seconds user 1.605213000 seconds sys Since the cycles counter is so low, I assume that the AMD PMU doesn't count idle/stalled/nop cycles from the mdelay() function call, which seems reasonable. Maybe this is an example of frequency scaling? The recorded perf data during the mdelay call shows no gap, just very few samples that uniformly stretch over the runtime: $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis' ... $ sudo perf script ... sh 14305 683.219192: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.300989: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.383783: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.465580: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.547376: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.629173: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.710970: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.793764: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.875561: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 683.957357: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.039154: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.120951: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.203745: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.285542: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.367339: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.449135: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.530932: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.613726: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.695523: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.778317: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.861111: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.942908: 100000 cycles: ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms]) sh 14305 684.978837: 100000 cycles: ffffffffb0203d71 syscall_slow_exit_work+0x111 ([kernel.kallsyms]) sh 14305 684.978890: 100000 cycles: ffffffffb029d8a1 __raw_spin_unlock+0x1 ([kernel.kallsyms]) sh 14305 684.978922: 100000 cycles: ffffffffb04502f9 unmap_page_range+0x639 ([kernel.kallsyms] ... While this sort of data (very few data points during a long period of time) is not ideal for my debugging, it's at least consistent, and a well-known gotcha of perf. I also double checked monitoring a single CPU on AMD (rather than task) and saw the same results. So hopefully this information helps narrow down the exploration. It seems to be Intel PMU specific! Thanks, Stephen > >> [ perf record: Woken up 1 times to write data ] >> [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ] >> >> $ sudo perf script >> # ... filtered down: >> sh 62863 52318.991716: 100000 cycles: ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms]) >> sh 62863 52318.991740: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) >> sh 62863 52318.991765: 100000 cycles: ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms]) >> # ^ v ~2 second gap! >> sh 62863 52320.963900: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) >> sh 62863 52320.963923: 100000 cycles: ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms]) >> sh 62863 52320.963948: 100000 cycles: ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms]) >> >> The perf stat shows the following counters over a similar run: >> >> $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis' >> >> Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis': >> >> 1,975.55 msec task-clock # 0.999 CPUs utilized >> 1 context-switches # 0.001 K/sec >> 0 cpu-migrations # 0.000 K/sec >> 61 page-faults # 0.031 K/sec >> 7,952,267,470 cycles # 4.025 GHz >> 541,904,608 instructions # 0.07 insn per cycle >> 83,406,021 branches # 42.219 M/sec >> 10,365 branch-misses # 0.01% of all branches >> >> 1.977234595 seconds time elapsed >> >> 0.000000000 seconds user >> 1.977162000 seconds sys >> >> According to this, we should see roughly 79k samples (7.9 billion cycles >> / 100k sample period), but perf only gets 58. What it "looks like" to >> me, is that the CPU ring buffer might run out of space after several >> events, and the perf process doesn't get scheduled soon enough to read >> the data? But in my experience, perf usually reports that it missed some >> events. So I wonder if anybody is familiar with the factors at play for >> when IRQ is disabled during a PMI? I'd appreciate any pointers to guide >> my exploration. >> >> My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I >> have also tested on a 5.4 based kernel. I'm happy to reproduce this on a >> mainline kernel too. >> >> Thanks, >> Stephen >> >> Makefile: >> <<< >> obj-m += irqoff.o >> >> all: >> make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules >> >> clean: >> make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean >> >>> >> >> irqoff.c: >> <<< >> #include <linux/module.h> >> #include <linux/kernel.h> >> #include <linux/init.h> >> #include <linux/proc_fs.h> >> #include <linux/spinlock.h> >> #include <linux/delay.h> >> #include <linux/version.h> >> #include <linux/uaccess.h> >> >> MODULE_LICENSE("GPL"); >> MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time"); >> MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>"); >> >> >> // Store the proc dir entry we can use to check status >> struct proc_dir_entry *pde = NULL; >> >> DEFINE_SPINLOCK(irqoff_lock); >> >> >> static noinline void irqsoff_inirq_delay(unsigned long millis) >> { >> mdelay(millis); >> } >> >> >> static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) >> { >> char buf[32]; >> int rv; >> unsigned long usecs = 0; >> >> if (amt > sizeof(buf) - 1) >> return -EFBIG; >> >> if ((rv = copy_from_user(buf, data, amt)) != 0) >> return -EFAULT; >> >> buf[amt] = '\0'; >> >> if (sscanf(buf, "%lu", &usecs) != 1) >> return -EINVAL; >> >> /* We read number of milliseconds, but will convert to microseconds. >> Threshold it at 5 minutes for safety. */ >> if (usecs > 5 * 60 * 1000) >> return -EINVAL; >> >> pr_info("[irqoff] lock for %lu millis\n", usecs); >> spin_lock_irq(&irqoff_lock); >> irqsoff_inirq_delay(usecs); >> spin_unlock_irq(&irqoff_lock); >> >> return amt; >> } >> >> static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off) >> { >> return 0; >> } >> >> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) >> static const struct file_operations irqsoff_fops = { >> .owner = THIS_MODULE, >> .read = irqsoff_read, >> .write = irqsoff_write, >> }; >> #else >> static const struct proc_ops irqsoff_fops = { >> .proc_read = irqsoff_read, >> .proc_write = irqsoff_write, >> }; >> #endif >> >> static int irqoff_init(void) >> { >> pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops); >> if (!pde) >> return -ENOENT; >> >> pr_info("[irqoff] successfully initialized\n"); >> return 0; >> } >> >> static void irqoff_exit(void) >> { >> proc_remove(pde); >> pde = NULL; >> } >> >> module_init(irqoff_init); >> module_exit(irqoff_exit); >> >>> >> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-01 16:45 ` Stephen Brennan @ 2021-07-01 17:05 ` Stephen Brennan 2021-07-01 18:00 ` Stephen Brennan 0 siblings, 1 reply; 8+ messages in thread From: Stephen Brennan @ 2021-07-01 17:05 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users Stephen Brennan <stephen.s.brennan@oracle.com> writes: > So hopefully this information helps narrow down the exploration. It > seems to be Intel PMU specific! Or more accurately, we only observe it on the Intel systems I've checked. The AMD systems record fewer cycle events in mdelay() and thus may not have a chance to skip events. I'll try updating my kernel module to busy wait by looping until jiffies is above a particular value. This would hopefully avoid any CPU-specific idling which would reduce the cycle count on AMD. I'll update this thread with my findings after testing. Thanks again, Stephen ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-01 17:05 ` Stephen Brennan @ 2021-07-01 18:00 ` Stephen Brennan 2021-07-02 16:22 ` Jiri Olsa 0 siblings, 1 reply; 8+ messages in thread From: Stephen Brennan @ 2021-07-01 18:00 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users Stephen Brennan <stephen.s.brennan@oracle.com> writes: > Stephen Brennan <stephen.s.brennan@oracle.com> writes: >> So hopefully this information helps narrow down the exploration. It >> seems to be Intel PMU specific! > > Or more accurately, we only observe it on the Intel systems I've > checked. The AMD systems record fewer cycle events in mdelay() and thus > may not have a chance to skip events. I'll try updating my kernel module > to busy wait by looping until jiffies is above a particular value. This > would hopefully avoid any CPU-specific idling which would reduce the > cycle count on AMD. I'll update this thread with my findings after > testing. Quite sorry for the spam, but I have completed my testing on AMD. By abandoning mdelay() and using a jiffies-based busy loop (updated module code at the end of this message), I was able to reproduce the 2 second gap in the perf data. It seems that the difference here has more to do with the implementation of mdelay. Regards, Stephen Updated module code: /* * irqoff.c: Linux kernel module for disabling interrupts and delaying for a * certain amount of time. Useful for experimenting with perf, lockup detector, * and more. */ #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/proc_fs.h> #include <linux/spinlock.h> #include <linux/delay.h> #include <linux/version.h> #include <linux/uaccess.h> #include <linux/types.h> #include <linux/jiffies.h> #include <asm/msr.h> MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time"); MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>"); // Store the proc dir entry we can use to check status struct proc_dir_entry *pde = NULL; DEFINE_SPINLOCK(irqoff_lock); bool use_mdelay = false; module_param(use_mdelay, bool, false); static noinline void irqsoff_inirq_manual_delay(unsigned long millis) { uint64_t start_tsc, tsc; unsigned long jiffies_start = jiffies; unsigned long jiffies_end = jiffies + millis * HZ / 1000; /* * With interrupts disabled, it's possible that jiffies may not get * updated! To account for this, we'll use the x86_64 cycle count too. * After a million cycles, if jiffies are not updated, then it's likely * that we've accidentally disabled the timer interrupt and we can't * accurately time our sleep, so we should bail out. */ #define TSC_EMERGENCY 100000000 start_tsc = rdtsc(); pr_info("[irqoff] manual delay %lu millis (jiffies cur=%lu target=%lu)\n", millis, jiffies_start, jiffies_end); while (time_before(jiffies, jiffies_end)) { asm volatile ("nop"); asm volatile ("nop"); asm volatile ("nop"); asm volatile ("nop"); asm volatile ("nop"); tsc = rdtsc(); if ((tsc >= start_tsc + TSC_EMERGENCY) && (jiffies == jiffies_start)) { pr_info( "[irqoff] looks like we disabled timer interrupt, " "bailing! (TSC start %llu, cur %llu)\n", start_tsc, tsc); break; } } pr_info("[irqoff] manual delay complete, jiffies: %lu\n", jiffies); } static void irqsoff_inirq_delay(unsigned long millis) { if (use_mdelay) { mdelay(millis); } else { irqsoff_inirq_manual_delay(millis); } } static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) { char buf[32]; int rv; unsigned long millis = 0; if (amt > sizeof(buf) - 1) return -EFBIG; if ((rv = copy_from_user(buf, data, amt)) != 0) return -EFAULT; buf[amt] = '\0'; if (sscanf(buf, "%lu", &millis) != 1) return -EINVAL; /* Threshold it at 5 minutes for safety. */ if (millis > 5 * 60 * 1000) return -EINVAL; pr_info("[irqoff] lock for %lu millis\n", millis); spin_lock_irq(&irqoff_lock); irqsoff_inirq_delay(millis); spin_unlock_irq(&irqoff_lock); return amt; } static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off) { return 0; } #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) static const struct file_operations irqsoff_fops = { .owner = THIS_MODULE, .read = irqsoff_read, .write = irqsoff_write, }; #else static const struct proc_ops irqsoff_fops = { .proc_read = irqsoff_read, .proc_write = irqsoff_write, }; #endif static int irqoff_init(void) { pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops); if (!pde) return -ENOENT; pr_info("[irqoff] successfully initialized\n"); return 0; } static void irqoff_exit(void) { proc_remove(pde); pde = NULL; pr_info("[irqoff] exited\n"); } module_init(irqoff_init); module_exit(irqoff_exit); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-01 18:00 ` Stephen Brennan @ 2021-07-02 16:22 ` Jiri Olsa 2021-07-02 20:12 ` Jiri Olsa 0 siblings, 1 reply; 8+ messages in thread From: Jiri Olsa @ 2021-07-02 16:22 UTC (permalink / raw) To: Stephen Brennan; +Cc: linux-perf-users On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote: SNIP > > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) > { > char buf[32]; > int rv; > unsigned long millis = 0; > > if (amt > sizeof(buf) - 1) > return -EFBIG; > > if ((rv = copy_from_user(buf, data, amt)) != 0) > return -EFAULT; > > buf[amt] = '\0'; > > if (sscanf(buf, "%lu", &millis) != 1) > return -EINVAL; > > /* Threshold it at 5 minutes for safety. */ > if (millis > 5 * 60 * 1000) > return -EINVAL; > > pr_info("[irqoff] lock for %lu millis\n", millis); > spin_lock_irq(&irqoff_lock); > irqsoff_inirq_delay(millis); > spin_unlock_irq(&irqoff_lock); yea, I can see that as well.. if I took the spin_lock_irq it works as expected.. will check thanks, jirka ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-02 16:22 ` Jiri Olsa @ 2021-07-02 20:12 ` Jiri Olsa 2021-07-02 22:48 ` Stephen Brennan 0 siblings, 1 reply; 8+ messages in thread From: Jiri Olsa @ 2021-07-02 20:12 UTC (permalink / raw) To: Stephen Brennan; +Cc: linux-perf-users, Peter Zijlstra On Fri, Jul 02, 2021 at 06:22:10PM +0200, Jiri Olsa wrote: > On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote: > > SNIP > > > > > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) > > { > > char buf[32]; > > int rv; > > unsigned long millis = 0; > > > > if (amt > sizeof(buf) - 1) > > return -EFBIG; > > > > if ((rv = copy_from_user(buf, data, amt)) != 0) > > return -EFAULT; > > > > buf[amt] = '\0'; > > > > if (sscanf(buf, "%lu", &millis) != 1) > > return -EINVAL; > > > > /* Threshold it at 5 minutes for safety. */ > > if (millis > 5 * 60 * 1000) > > return -EINVAL; > > > > pr_info("[irqoff] lock for %lu millis\n", millis); > > spin_lock_irq(&irqoff_lock); > > irqsoff_inirq_delay(millis); > > spin_unlock_irq(&irqoff_lock); > > yea, I can see that as well.. if I took the spin_lock_irq it works as > expected.. will check ok, so with disabling irqs you disable task tick which kicks the unthrotling throtling is started when you cross the limit of samples per tick, which you do real fast with busy waiting then when event is throttled (and no samples are generated), the next task tick restores it.. but with disabled irqs, there's no task tick, so the event is throttled all the time when irqs are disabled is there any use case you have to have irqs disabled for so long? cc-ing peterz initial email: https://lore.kernel.org/linux-perf-users/YN89LwSzTKc8wBar@krava/T/#m616481aa95da10e17d9b0769717f006b9ff38637 jirka ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Perf loses events without reporting 2021-07-02 20:12 ` Jiri Olsa @ 2021-07-02 22:48 ` Stephen Brennan 0 siblings, 0 replies; 8+ messages in thread From: Stephen Brennan @ 2021-07-02 22:48 UTC (permalink / raw) To: Jiri Olsa; +Cc: linux-perf-users, Peter Zijlstra Jiri Olsa <jolsa@redhat.com> writes: > On Fri, Jul 02, 2021 at 06:22:10PM +0200, Jiri Olsa wrote: >> On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote: >> >> SNIP >> >> > >> > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off) >> > { >> > char buf[32]; >> > int rv; >> > unsigned long millis = 0; >> > >> > if (amt > sizeof(buf) - 1) >> > return -EFBIG; >> > >> > if ((rv = copy_from_user(buf, data, amt)) != 0) >> > return -EFAULT; >> > >> > buf[amt] = '\0'; >> > >> > if (sscanf(buf, "%lu", &millis) != 1) >> > return -EINVAL; >> > >> > /* Threshold it at 5 minutes for safety. */ >> > if (millis > 5 * 60 * 1000) >> > return -EINVAL; >> > >> > pr_info("[irqoff] lock for %lu millis\n", millis); >> > spin_lock_irq(&irqoff_lock); >> > irqsoff_inirq_delay(millis); >> > spin_unlock_irq(&irqoff_lock); >> >> yea, I can see that as well.. if I took the spin_lock_irq it works as >> expected.. will check > > ok, so with disabling irqs you disable task tick which > kicks the unthrotling > > throtling is started when you cross the limit of samples > per tick, which you do real fast with busy waiting > > then when event is throttled (and no samples are generated), > the next task tick restores it.. but with disabled irqs, > there's no task tick, so the event is throttled all the > time when irqs are disabled Ah, this makes sense, I'll take a look at the throttling code to understand it a bit better myself. > > is there any use case you have to have irqs disabled for so long? I don't have a good use case for keeping IRQs disabled so long. However, one of my use cases for perf is as a debugging tool for anything which doesn't result in panics. One common enough issue is spinlock contention, during which IRQs can be disabled. Normally if a lock is contended, IRQs are disabled and then reenabled very frequently. In these cases, you wouldn't necessarily notice lost events because either (a) they wouldn't be lost, or (b) the event could be unthrottled frequently. But sometimes, a buggy lock is held for a longer time, and the event loss can make it hard to see the contention in perf. For example, in a recent issue we saw a spinlock held with IRQs disabled for several seconds (causing other threads to pile up waiting). The wait was long enough to cause serious issues with devices due to missed interrupts, but short enough to not trigger the configured lockup detector. Our perf recording didn't show any particular "hot functions" at this spinlock, since so few samples were captured there, despite the large amount of time spent on it. Ideally for this sort of debugging it would be best to have all the samples. However, failing that, it would be useful to have a reliable detection mechanism for these sorts of gaps. I have a script right now for frequency-based sampling, which detects "holes" -- whenever the period between samples is more than 10x the expected sample period. Of course the sample period can never be perfect, but 10x seems to be a reasonable threshold that only really gets hit during idle (due to frequency scaling, I'd guess). This script is decent but still has false positives. Thanks for this analysis, it's quite helpful! Stephen > > cc-ing peterz > initial email: https://lore.kernel.org/linux-perf-users/YN89LwSzTKc8wBar@krava/T/#m616481aa95da10e17d9b0769717f006b9ff38637 > > jirka ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-07-02 22:49 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-30 17:29 Perf loses events without reporting Stephen Brennan 2021-07-01 10:37 ` Jiri Olsa 2021-07-01 16:45 ` Stephen Brennan 2021-07-01 17:05 ` Stephen Brennan 2021-07-01 18:00 ` Stephen Brennan 2021-07-02 16:22 ` Jiri Olsa 2021-07-02 20:12 ` Jiri Olsa 2021-07-02 22:48 ` Stephen Brennan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).