* Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
@ 2011-10-13 23:40 Oberman, Laurence (HAS GSE)
2011-10-14 10:08 ` David Rientjes
2011-10-14 12:38 ` Eric Dumazet
0 siblings, 2 replies; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-13 23:40 UTC (permalink / raw)
To: linux-kernel; +Cc: Seger, Mark, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4922 bytes --]
Hello,
The HP developed collectl utility depends heavily on abstractions in /proc and /sys provided by the kernel for kernel statistics capture.
During testing on newer NUMA platforms with large numbers of CPUâS HP has observed a significant increase in the time taken to parse this data with the newer 2.6.32.x and higher kernels.
In summary:
To read the data for a DL785 8-node NUMA with 4 CPUâS per node takes around 80 microseconds per read on the 2.6.18.x kernels, but when testing on the same platform with the newer 2.6.32.x and higher kernels each read takes more than 5 ms. This is significant when considering that collectl typically collects the data abstraction in the kernel every 10 seconds (default) for certain statistics.
Given that the kernel has significantly more data structure space to parse and that there has to be some locking interaction to get this data, it is not unusual for the parsing to take longer. However we were alarmed with the significant increase in overhead and wanted to make people aware of this.
Supporting Data
---------------
Mimicking a day of capture at 10 second intervals gives us 8640 reads of /proc/stat
I wanted to first validate if this was architecture or kernel behavior and it is actually kernel overhead on the 2.6.3x and higher kernels. I tested with 2.6.38 but the same applies for the RHEL 6 and SLES 11 2.6.32.x kernels.
Same architecture, just different kernels for my testing.
The newer kernel is around 100 times slower so it shows up in collectl behavior.
DL785 8-node NUMA with 128GB memory and 32 CPU'S.
l785sles:/sys/devices/system/node # ls
node0 node1 node2 node3 node4 node5 node6 node7 dl785sles:/sys/devices/system/node # pwd
/sys/devices/system/node
On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84 us.
dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.41 0.725717 84 8641 read
0.38 0.002768 0 8642 open
0.21 0.001539 0 8642 close
0.00 0.000000 0 1 write
0.00 0.000000 0 3 fstat
0.00 0.000000 0 8 mmap
0.00 0.000000 0 2 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 brk
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 madvise
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.730024 25945 1 total
On the 2.6.38.4 kernel each read takes > 6ms
dl785sles:~/wrk # cat 2.6.38.4-smp.log
Opened, read and closed 8640 times and read total of 86235840 bytes
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 59.021650 6830 8641 read
0.00 0.000520 0 8642 open
0.00 0.000425 0 8642 close
0.00 0.000000 0 1 write
0.00 0.000000 0 3 fstat
0.00 0.000000 0 9 mmap
0.00 0.000000 0 2 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 brk
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 madvise
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 59.022595 25946 1 total
Test program
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <asm-generic/fcntl.h>
#include <sys/errno.h>
extern int errno;
int fid,i,count,total;
char buffer[16384];
int main(int argc, char *argv[])
{
for (count=0;count<8640;count++) {
fid=open("/proc/stat",O_RDONLY,660);
if(fid <= 0) {
perror("error with file open\n");
}
i=read(fid,buffer,16384);
/* printf("Read %d characters \n",i); */
total+=i;
close(fid);
}
printf("Opened, read and closed %d times and read total of %d bytes\n",count,total);
exit(0);
}
Thanks
Laurence Oberman
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
@ 2011-10-14 10:08 ` David Rientjes
2011-10-14 10:35 ` Eric Dumazet
2011-10-14 12:38 ` Eric Dumazet
1 sibling, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-10-14 10:08 UTC (permalink / raw)
To: Oberman, Laurence (HAS GSE)
Cc: linux-kernel, Seger, Mark, Cabaniols, Sebastien
On Fri, 14 Oct 2011, Oberman, Laurence (HAS GSE) wrote:
> On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84
> us.
>
> dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 99.41 0.725717 84 8641 read
> 0.38 0.002768 0 8642 open
> 0.21 0.001539 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 8 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.730024 25945 1 total
>
> On the 2.6.38.4 kernel each read takes > 6ms
>
> dl785sles:~/wrk # cat 2.6.38.4-smp.log
> Opened, read and closed 8640 times and read total of 86235840 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.021650 6830 8641 read
> 0.00 0.000520 0 8642 open
> 0.00 0.000425 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 9 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.022595 25946 1 total
>
The overhead is probably in kstat_irqs_cpu() which is called for each
possible irq for each of the 32 possible cpus, and /proc/stat actually
does the sum twice. You would see the same type of overhead with
/proc/interrupts if it wasn't masked by the locking that it requires to
safely read irq_desc. "dmesg | grep nr_irqs" will show how many percpu
variables are being read for every cpu twice.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 10:08 ` David Rientjes
@ 2011-10-14 10:35 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 10:35 UTC (permalink / raw)
To: David Rientjes
Cc: Oberman, Laurence (HAS GSE),
linux-kernel, Seger, Mark, Cabaniols, Sebastien
Le vendredi 14 octobre 2011 à 03:08 -0700, David Rientjes a écrit :
> The overhead is probably in kstat_irqs_cpu() which is called for each
> possible irq for each of the 32 possible cpus, and /proc/stat actually
> does the sum twice. You would see the same type of overhead with
> /proc/interrupts if it wasn't masked by the locking that it requires to
> safely read irq_desc. "dmesg | grep nr_irqs" will show how many percpu
> variables are being read for every cpu twice.
One annoying thing with most HP servers is they claim more possible cpus
than reality. So each time we have to loop on possible cpus, we waste
time and memory cache lines.
For example, on this ProLiant BL460c G6, with two quad core cpus
(2x4x2), 32 'possible' cpus are found, instead of 16 :
setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:32 nr_node_ids:2
PERCPU: Embedded 26 pages/cpu @ffff88007dc00000 s76288 r8192 d22016 u131072
pcpu-alloc: s76288 r8192 d22016 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 02 04 06 08 10 12 14 17 19 21 23 25 27 29 31
pcpu-alloc: [1] 01 03 05 07 09 11 13 15 16 18 20 22 24 26 28 30
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
2011-10-14 10:08 ` David Rientjes
@ 2011-10-14 12:38 ` Eric Dumazet
2011-10-14 13:36 ` Seger, Mark
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 12:38 UTC (permalink / raw)
To: Oberman, Laurence (HAS GSE)
Cc: linux-kernel, Seger, Mark, Cabaniols, Sebastien
Le vendredi 14 octobre 2011 à 00:40 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Hello,
>
> The HP developed collectl utility depends heavily on abstractions in /proc and /sys provided by the kernel for kernel statistics capture.
> During testing on newer NUMA platforms with large numbers of CPU’S HP has observed a significant increase in the time taken to parse this data with the newer 2.6.32.x and higher kernels.
>
> In summary:
>
> To read the data for a DL785 8-node NUMA with 4 CPU’S per node takes around 80 microseconds per read on the 2.6.18.x kernels, but when testing on the same platform with the newer 2.6.32.x and higher kernels each read takes more than 5 ms. This is significant when considering that collectl typically collects the data abstraction in the kernel every 10 seconds (default) for certain statistics.
>
> Given that the kernel has significantly more data structure space to parse and that there has to be some locking interaction to get this data, it is not unusual for the parsing to take longer. However we were alarmed with the significant increase in overhead and wanted to make people aware of this.
>
> Supporting Data
> ---------------
>
> Mimicking a day of capture at 10 second intervals gives us 8640 reads of /proc/stat
>
> I wanted to first validate if this was architecture or kernel behavior and it is actually kernel overhead on the 2.6.3x and higher kernels. I tested with 2.6.38 but the same applies for the RHEL 6 and SLES 11 2.6.32.x kernels.
>
> Same architecture, just different kernels for my testing.
>
> The newer kernel is around 100 times slower so it shows up in collectl behavior.
>
> DL785 8-node NUMA with 128GB memory and 32 CPU'S.
>
> l785sles:/sys/devices/system/node # ls
> node0 node1 node2 node3 node4 node5 node6 node7 dl785sles:/sys/devices/system/node # pwd
> /sys/devices/system/node
>
> On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84 us.
>
> dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 99.41 0.725717 84 8641 read
> 0.38 0.002768 0 8642 open
> 0.21 0.001539 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 8 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.730024 25945 1 total
>
> On the 2.6.38.4 kernel each read takes > 6ms
>
> dl785sles:~/wrk # cat 2.6.38.4-smp.log
> Opened, read and closed 8640 times and read total of 86235840 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.021650 6830 8641 read
> 0.00 0.000520 0 8642 open
> 0.00 0.000425 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 9 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.022595 25946 1 total
>
> Test program
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <asm-generic/fcntl.h>
> #include <sys/errno.h>
>
> extern int errno;
>
> int fid,i,count,total;
> char buffer[16384];
>
> int main(int argc, char *argv[])
> {
> for (count=0;count<8640;count++) {
> fid=open("/proc/stat",O_RDONLY,660);
> if(fid <= 0) {
> perror("error with file open\n");
> }
> i=read(fid,buffer,16384);
> /* printf("Read %d characters \n",i); */
> total+=i;
> close(fid);
> }
> printf("Opened, read and closed %d times and read total of %d bytes\n",count,total);
> exit(0);
> }
Strange, here is what I have on my 2x4x2 machine.
Could you try following actions please ?
# perf record ./prog
Opened, read and closed 8640 times and read total of 29595948 bytes
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]
# perf report --stdio
# Events: 1K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ............................
#
22.82% prog [kernel.kallsyms] [k] format_decode
21.15% prog [kernel.kallsyms] [k] vsnprintf
13.46% prog [kernel.kallsyms] [k] number.isra.4
11.33% prog [kernel.kallsyms] [k] memcpy
8.87% prog [kernel.kallsyms] [k] kstat_irqs
3.80% prog [kernel.kallsyms] [k] radix_tree_lookup_element
3.69% prog [kernel.kallsyms] [k] seq_printf
3.63% prog [kernel.kallsyms] [k] find_next_bit
2.81% prog [kernel.kallsyms] [k] show_stat
1.18% prog [kernel.kallsyms] [k] put_dec_trunc
0.96% prog [kernel.kallsyms] [k] radix_tree_lookup
0.96% prog [kernel.kallsyms] [k] irq_to_desc
0.66% prog [kernel.kallsyms] [k] getname_flags
0.37% prog [kernel.kallsyms] [k] selinux_inode_alloc_security
0.22% prog [kernel.kallsyms] [k] jiffies_64_to_clock_t
0.22% prog [kernel.kallsyms] [k] copy_user_generic_string
0.22% prog [kernel.kallsyms] [k] __call_rcu
0.15% prog [kernel.kallsyms] [k] add_preempt_count
0.15% prog [kernel.kallsyms] [k] put_dec_full
0.15% prog [kernel.kallsyms] [k] nr_context_switches
0.15% prog [kernel.kallsyms] [k] kfree
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 12:38 ` Eric Dumazet
@ 2011-10-14 13:36 ` Seger, Mark
2011-10-14 13:47 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Seger, Mark @ 2011-10-14 13:36 UTC (permalink / raw)
To: Eric Dumazet, Oberman, Laurence (HAS GSE)
Cc: linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3438 bytes --]
> > The HP developed collectl utility depends heavily on abstractions in
> /proc and /sys provided by the kernel for kernel statistics capture.
> > During testing on newer NUMA platforms with large numbers of CPUâS HP
> has observed a significant increase in the time taken to parse this
> data with the newer 2.6.32.x and higher kernels.
While this does indeed effect collectl and renders its effectiveness at 0.1 or lower monitoring intervals it effects ANY utility that reads /proc/stat, and the more frequent the access the heavier the load. Remember, the increase in load is by a factor of over 50 times, so those who like to run top or vmstat or even iostat will also see a much heavier hand too. If people ran sar at a lower monitoring frequency it too would suffer.
I've also observed reading /proc/meminfo also consumes a lot more CPU, but not as bad as /proc/stat.
> Strange, here is what I have on my 2x4x2 machine.
Forgive a dumb question, but what does 2X4X2 mean? ;(
I've observed things are fine on 2 socket systems and are truly horrible on the few 8 socket machines I tested this on which also had higher core counts, so there were always on the order of 48+ cpus involved. I haven't done any testing on 4 socket machines and I don't even want to think about what could happen on 16 or 32 or however many socket machines are built in the future.
> Could you try following actions please ?
>
> # perf record ./prog
> Opened, read and closed 8640 times and read total of 29595948 bytes
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]
>
> # perf report --stdio
> # Events: 1K cycles
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. ............................
> #
> 22.82% prog [kernel.kallsyms] [k] format_decode
> 21.15% prog [kernel.kallsyms] [k] vsnprintf
> 13.46% prog [kernel.kallsyms] [k] number.isra.4
> 11.33% prog [kernel.kallsyms] [k] memcpy
> 8.87% prog [kernel.kallsyms] [k] kstat_irqs
> 3.80% prog [kernel.kallsyms] [k] radix_tree_lookup_element
> 3.69% prog [kernel.kallsyms] [k] seq_printf
> 3.63% prog [kernel.kallsyms] [k] find_next_bit
> 2.81% prog [kernel.kallsyms] [k] show_stat
> 1.18% prog [kernel.kallsyms] [k] put_dec_trunc
> 0.96% prog [kernel.kallsyms] [k] radix_tree_lookup
> 0.96% prog [kernel.kallsyms] [k] irq_to_desc
> 0.66% prog [kernel.kallsyms] [k] getname_flags
> 0.37% prog [kernel.kallsyms] [k]
> selinux_inode_alloc_security
> 0.22% prog [kernel.kallsyms] [k] jiffies_64_to_clock_t
> 0.22% prog [kernel.kallsyms] [k] copy_user_generic_string
> 0.22% prog [kernel.kallsyms] [k] __call_rcu
> 0.15% prog [kernel.kallsyms] [k] add_preempt_count
> 0.15% prog [kernel.kallsyms] [k] put_dec_full
> 0.15% prog [kernel.kallsyms] [k] nr_context_switches
> 0.15% prog [kernel.kallsyms] [k] kfree
For me the easiest reproducer, which admittedly doesn't show any deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop and time it.
-mark
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 13:36 ` Seger, Mark
@ 2011-10-14 13:47 ` Eric Dumazet
2011-10-14 13:48 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:27 ` Oberman, Laurence (HAS GSE)
0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 13:47 UTC (permalink / raw)
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE), linux-kernel, Cabaniols, Sebastien
Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :
> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.
OK, but you already provided a global number.
We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.
"perf" is provided in kernel sources, and really is a piece of cake.
cd tools/perf ; make
2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 13:47 ` Eric Dumazet
@ 2011-10-14 13:48 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:27 ` Oberman, Laurence (HAS GSE)
1 sibling, 0 replies; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 13:48 UTC (permalink / raw)
To: Eric Dumazet, Seger, Mark; +Cc: linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1217 bytes --]
Hi Eric,
I am preparing for that test. Need to stage a full RHEL6. I had only updated the kernel on an existing SLES10.
Standby and I will run the tests for you.
Thanks
Laurence
-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
Sent: Friday, October 14, 2011 9:47 AM
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE); linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :
> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.
OK, but you already provided a global number.
We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.
"perf" is provided in kernel sources, and really is a piece of cake.
cd tools/perf ; make
2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 13:47 ` Eric Dumazet
2011-10-14 13:48 ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:27 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:40 ` Eric Dumazet
1 sibling, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 16:27 UTC (permalink / raw)
To: Eric Dumazet, Seger, Mark; +Cc: linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 10105 bytes --]
Eric,
Got the system ready and ran the tests.
This is a DL980 128 CPCUS, 256GB memory.
[root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
128
[root@linuxbeast ~]# uname -a
Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
In this example we take >4ms per read. We are spending most of the time in
35.09% t [kernel.kallsyms] [k] kstat_irqs_cpu
32.13% t [kernel.kallsyms] [k] find_next_bit
24.32% t [kernel.kallsyms] [k] show_stat
1.98% t [kernel.kallsyms] [k] vsnprintf
1.63% t [kernel.kallsyms] [k] format_decode
1.23% t [kernel.kallsyms] [k] number
Thanks
[root@linuxbeast ~]# perf record ./t
Opened, read and closed 8640 times and read total of 72521754 bytes
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.422 MB perf.data (~62126 samples) ]
[root@linuxbeast ~]# perf report --stdio
# Events: 36K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .............................
#
35.09% t [kernel.kallsyms] [k] kstat_irqs_cpu
32.13% t [kernel.kallsyms] [k] find_next_bit
24.32% t [kernel.kallsyms] [k] show_stat
1.98% t [kernel.kallsyms] [k] vsnprintf
1.63% t [kernel.kallsyms] [k] format_decode
1.23% t [kernel.kallsyms] [k] number
0.92% t [kernel.kallsyms] [k] memcpy_c
0.42% t [kernel.kallsyms] [k] seq_printf
0.26% t [kernel.kallsyms] [k] put_dec_trunc
0.26% t [kernel.kallsyms] [k] _spin_lock
0.16% t [kernel.kallsyms] [k] put_dec
0.15% t [kernel.kallsyms] [k] arch_irq_stat_cpu
0.14% t [kernel.kallsyms] [k] scheduler_tick
0.12% t [kernel.kallsyms] [k] nr_running
0.10% t [kernel.kallsyms] [k] jiffies_64_to_clock_t
0.08% t [kernel.kallsyms] [k] dyntick_save_progress_counter
0.06% t [kernel.kallsyms] [k] nr_iowait
0.05% t [kernel.kallsyms] [k] nr_context_switches
0.05% t [kernel.kallsyms] [k] _spin_lock_irqsave
0.04% t [kernel.kallsyms] [k] rcu_process_gp_end
0.04% t [kernel.kallsyms] [k] account_system_time
0.04% t [kernel.kallsyms] [k] native_write_msr_safe
0.04% t [kernel.kallsyms] [k] memcpy
0.03% t [kernel.kallsyms] [k] update_vsyscall
0.03% t [kernel.kallsyms] [k] task_tick_fair
0.03% t [kernel.kallsyms] [k] __do_softirq
0.02% t [kernel.kallsyms] [k] apic_timer_interrupt
0.02% t [kernel.kallsyms] [k] ktime_get
0.02% t [kernel.kallsyms] [k] _spin_unlock_irqrestore
0.02% t [kernel.kallsyms] [k] check_for_new_grace_period
0.02% t [kernel.kallsyms] [k] __sched_text_end
0.02% t [kernel.kallsyms] [k] tick_do_update_jiffies64
0.02% t [kernel.kallsyms] [k] copy_user_generic_string
0.02% t [kernel.kallsyms] [k] run_timer_softirq
0.02% t [kernel.kallsyms] [k] rcu_process_callbacks
0.02% t [kernel.kallsyms] [k] __rcu_process_callbacks
0.02% t [kernel.kallsyms] [k] kfree
0.02% t [kernel.kallsyms] [k] __rcu_pending
0.02% t [kernel.kallsyms] [k] avtab_search_node
0.01% t [kernel.kallsyms] [k] rcu_process_dyntick
0.01% t [kernel.kallsyms] [k] force_quiescent_state
0.01% t [kernel.kallsyms] [k] irq_exit
0.01% t [kernel.kallsyms] [k] update_wall_time
0.01% t [kernel.kallsyms] [k] tg_shares_up
0.01% t [kernel.kallsyms] [k] acct_update_integrals
0.01% t [kernel.kallsyms] [k] __percpu_counter_add
0.01% t [kernel.kallsyms] [k] raise_softirq
0.01% t [kernel.kallsyms] [k] perf_event_do_pending
0.01% t [kernel.kallsyms] [k] update_curr
0.01% t [kernel.kallsyms] [k] tick_sched_timer
0.01% t [kernel.kallsyms] [k] avc_has_perm_noaudit
0.01% t [kernel.kallsyms] [k] run_posix_cpu_timers
0.01% t [kernel.kallsyms] [k] current_kernel_time
0.01% t [kernel.kallsyms] [k] update_cpu_load
0.01% t [kernel.kallsyms] [k] hrtimer_interrupt
0.01% t [kernel.kallsyms] [k] native_read_tsc
0.01% t [kernel.kallsyms] [k] vfs_read
0.01% t [kernel.kallsyms] [k] __call_rcu
0.01% t [kernel.kallsyms] [k] putname
0.01% t [kernel.kallsyms] [k] __d_lookup
0.01% t [kernel.kallsyms] [k] rb_erase
0.01% t [kernel.kallsyms] [k] rb_next
0.01% t [kernel.kallsyms] [k] rcu_check_callbacks
0.01% t [kernel.kallsyms] [k] inode_has_perm
0.01% t [kernel.kallsyms] [k] proc_reg_release
0.01% t [kernel.kallsyms] [k] __link_path_walk
0.01% t [kernel.kallsyms] [k] perf_event_task_tick
0.00% t [kernel.kallsyms] [k] __remove_hrtimer
0.00% t [kernel.kallsyms] [k] task_of
0.00% t [kernel.kallsyms] [k] irq_enter
0.00% t [kernel.kallsyms] [k] rb_insert_color
0.00% t [kernel.kallsyms] [k] rcu_start_gp
0.00% t [kernel.kallsyms] [k] alloc_fd
0.00% t [kernel.kallsyms] [k] path_init
0.00% t [kernel.kallsyms] [k] read_tsc
0.00% t [kernel.kallsyms] [k] rcu_irq_enter
0.00% t [kernel.kallsyms] [k] _read_lock
0.00% t [kernel.kallsyms] [k] sysret_check
0.00% t [kernel.kallsyms] [k] expand_files
0.00% t [kernel.kallsyms] [k] file_move
0.00% t [kernel.kallsyms] [k] update_process_times
0.00% t [kernel.kallsyms] [k] __kmalloc
0.00% t [kernel.kallsyms] [k] __run_hrtimer
0.00% t [kernel.kallsyms] [k] hweight64
0.00% t [kernel.kallsyms] [k] audit_syscall_entry
0.00% t [kernel.kallsyms] [k] lookup_mnt
0.00% t [kernel.kallsyms] [k] perf_pmu_disable
0.00% t [kernel.kallsyms] [k] kmem_cache_alloc
0.00% t [kernel.kallsyms] [k] getname
0.00% t [kernel.kallsyms] [k] mutex_lock
0.00% t [kernel.kallsyms] [k] restore_args
0.00% t [kernel.kallsyms] [k] x86_pmu_disable
0.00% t [kernel.kallsyms] [k] ima_counts_get
0.00% t [kernel.kallsyms] [k] call_rcu_sched
0.00% t [kernel.kallsyms] [k] rcu_implicit_dynticks_qs
0.00% t [kernel.kallsyms] [k] security_file_alloc
0.00% t [kernel.kallsyms] [k] memset_c
0.00% t [kernel.kallsyms] [k] lapic_next_event
0.00% t [kernel.kallsyms] [k] kmem_cache_free
0.00% t [kernel.kallsyms] [k] tick_program_event
0.00% t [kernel.kallsyms] [k] __tcp_select_window
0.00% t [kernel.kallsyms] [k] __dentry_open
0.00% t [kernel.kallsyms] [k] _local_bh_enable
0.00% t [kernel.kallsyms] [k] __smp_call_function_single
0.00% t [kernel.kallsyms] [k] sched_clock_tick
0.00% t [kernel.kallsyms] [k] call_softirq
0.00% t [kernel.kallsyms] [k] _spin_lock_irq
0.00% t [kernel.kallsyms] [k] _atomic_dec_and_lock
0.00% t [kernel.kallsyms] [k] hrtimer_forward
0.00% t [kernel.kallsyms] [k] account_process_tick
0.00% t [kernel.kallsyms] [k] clockevents_program_event
0.00% t [kernel.kallsyms] [k] sysret_careful
0.00% t [kernel.kallsyms] [k] x86_pmu_enable
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
(END)
[root@linuxbeast ~]# strace -c -T ./t
Opened, read and closed 8640 times and read total of 72446400 bytes
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 35.195100 4073 8641 read
0.01 0.002183 0 8642 close
0.00 0.001260 0 8642 open
0.00 0.000032 32 1 write
0.00 0.000000 0 3 fstat
0.00 0.000000 0 9 mmap
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 brk
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 35.198575 25946 1 total
[root@linuxbeast ~]#
-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
Sent: Friday, October 14, 2011 9:47 AM
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE); linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :
> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.
OK, but you already provided a global number.
We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.
"perf" is provided in kernel sources, and really is a piece of cake.
cd tools/perf ; make
2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 16:27 ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:40 ` Eric Dumazet
2011-10-14 16:52 ` Oberman, Laurence (HAS GSE)
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:40 UTC (permalink / raw)
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien
Le vendredi 14 octobre 2011 à 17:27 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Eric,
>
> Got the system ready and ran the tests.
>
> This is a DL980 128 CPCUS, 256GB memory.
>
> [root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
> 128
>
> [root@linuxbeast ~]# uname -a
> Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
>
> In this example we take >4ms per read. We are spending most of the time in
>
> 35.09% t [kernel.kallsyms] [k] kstat_irqs_cpu
> 32.13% t [kernel.kallsyms] [k] find_next_bit
> 24.32% t [kernel.kallsyms] [k] show_stat
> 1.98% t [kernel.kallsyms] [k] vsnprintf
> 1.63% t [kernel.kallsyms] [k] format_decode
> 1.23% t [kernel.kallsyms] [k] number
Hmm, have you compiled your kernel with NR_CPUS=4096 ???
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 16:40 ` Eric Dumazet
@ 2011-10-14 16:52 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:59 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 16:52 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1557 bytes --]
Standard RHEL kernel build.
What the customer would be using.
[root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
CONFIG_NR_CPUS=4096
-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
Sent: Friday, October 14, 2011 12:40 PM
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark; linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
Le vendredi 14 octobre 2011 Ã 17:27 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Eric,
>
> Got the system ready and ran the tests.
>
> This is a DL980 128 CPCUS, 256GB memory.
>
> [root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
> 128
>
> [root@linuxbeast ~]# uname -a
> Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
>
> In this example we take >4ms per read. We are spending most of the time in
>
> 35.09% t [kernel.kallsyms] [k] kstat_irqs_cpu
> 32.13% t [kernel.kallsyms] [k] find_next_bit
> 24.32% t [kernel.kallsyms] [k] show_stat
> 1.98% t [kernel.kallsyms] [k] vsnprintf
> 1.63% t [kernel.kallsyms] [k] format_decode
> 1.23% t [kernel.kallsyms] [k] number
Hmm, have you compiled your kernel with NR_CPUS=4096 ???
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 16:52 ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:59 ` Eric Dumazet
2011-10-14 17:02 ` Oberman, Laurence (HAS GSE)
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:59 UTC (permalink / raw)
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien
Le vendredi 14 octobre 2011 à 17:52 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Standard RHEL kernel build.
> What the customer would be using.
>
> [root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
> CONFIG_NR_CPUS=4096
Then something must be wrong.... what are your nr_cpu_ids /
nr_cpumask_bits values ?
dmesg | grep nr_cpu_ids
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 16:59 ` Eric Dumazet
@ 2011-10-14 17:02 ` Oberman, Laurence (HAS GSE)
2011-10-17 18:05 ` Seger, Mark
0 siblings, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 17:02 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1321 bytes --]
Hi Eric,
Just FYI, the same behavior is seen on multiple hardware platforms so it is definitely kernel changes. The earlier kernel on the DL785 platform was returning reads within 60 microseconds.
I have measured it on multiple platforms.
In this case we have a DL980 with HT enabled so you will see 128 CPUS. If I shut off HT, it does not make a difference.
[root@linuxbeast boot]# dmesg | grep nr_cpu_ids
NR_CPUS:4096 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:8
-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
Sent: Friday, October 14, 2011 12:59 PM
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark; linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
Le vendredi 14 octobre 2011 Ã 17:52 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Standard RHEL kernel build.
> What the customer would be using.
>
> [root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
> CONFIG_NR_CPUS=4096
Then something must be wrong.... what are your nr_cpu_ids /
nr_cpumask_bits values ?
dmesg | grep nr_cpu_ids
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-14 17:02 ` Oberman, Laurence (HAS GSE)
@ 2011-10-17 18:05 ` Seger, Mark
2011-10-17 19:24 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Seger, Mark @ 2011-10-17 18:05 UTC (permalink / raw)
To: Oberman, Laurence (HAS GSE), Eric Dumazet
Cc: linux-kernel, Cabaniols, Sebastien
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1139 bytes --]
>Just FYI, the same behavior is seen on multiple hardware platforms so it
>is definitely kernel changes. The earlier kernel on the DL785 platform
>was returning reads within 60 microseconds.
>I have measured it on multiple platforms.
>
>In this case we have a DL980 with HT enabled so you will see 128 CPUS.
>If I shut off HT, it does not make a difference.
Given that there have only been a couple of responses to this I can't help but wonder if this hasn't really gotten anyone's attention yet or if perhaps I'm just overreacting. I'd claim this breaks key Linux utilities by making them have a significantly heavier footprint than they used to have. Historically people would always run top whenever they wanted to knowing it had minimal impact on the system and now I'm not so sure that will be the case anymore, at least not on big numa boxes.
I can assure you if someone wants to report cpu stats every tenth of a second it will more definitely have an impact.
-mark
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-17 18:05 ` Seger, Mark
@ 2011-10-17 19:24 ` Eric Dumazet
2011-12-05 21:15 ` Laurence Oberman
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-17 19:24 UTC (permalink / raw)
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE), linux-kernel, Cabaniols, Sebastien
Le lundi 17 octobre 2011 à 19:05 +0100, Seger, Mark a écrit :
> >Just FYI, the same behavior is seen on multiple hardware platforms so it
> >is definitely kernel changes. The earlier kernel on the DL785 platform
> >was returning reads within 60 microseconds.
> >I have measured it on multiple platforms.
> >
> >In this case we have a DL980 with HT enabled so you will see 128 CPUS.
> >If I shut off HT, it does not make a difference.
>
> Given that there have only been a couple of responses to this I can't
> help but wonder if this hasn't really gotten anyone's attention yet or
> if perhaps I'm just overreacting. I'd claim this breaks key Linux
> utilities by making them have a significantly heavier footprint than
> they used to have. Historically people would always run top whenever
> they wanted to knowing it had minimal impact on the system and now I'm
> not so sure that will be the case anymore, at least not on big numa
> boxes.
>
> I can assure you if someone wants to report cpu stats every tenth of a
> second it will more definitely have an impact.
Maybe, but adding an increment of a sum field adds one instruction in
interrupt path. If you have one reader of /proc/stat every second, but
ten million interrupts per second, it might be better as is.
reading /proc/stat is slow path on most workloads, while handling an
interrupt is definitely fast path on all workloads.
The percpu variable makes sure no cache line bouncing occurs in the
kstat_irqs management.
kstat_irqs is needed to provide /proc/interrupts : per cpu counts per
IRQ.
It should be OK to add another field in struct irq_desc
and increment it in kstat_incr_irqs_this_cpu(), because we dirty at
least desc->lock cache line.
This field can be a plain integer, changed under desc->lock protection.
include/linux/irqdesc.h | 2 ++
include/linux/kernel_stat.h | 2 ++
kernel/irq/irqdesc.c | 10 +++-------
3 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 150134a..5b250d0 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -24,6 +24,7 @@ struct timer_rand_state;
* @depth: disable-depth, for nested irq_disable() calls
* @wake_depth: enable depth, for multiple irq_set_irq_wake() callers
* @irq_count: stats field to detect stalled irqs
+ * @stat_irq: irq stats (sum for all cpus)
* @last_unhandled: aging timer for unhandled count
* @irqs_unhandled: stats field for spurious unhandled interrupts
* @lock: locking for SMP
@@ -50,6 +51,7 @@ struct irq_desc {
unsigned int depth; /* nested irq disables */
unsigned int wake_depth; /* nested wake enables */
unsigned int irq_count; /* For detecting broken IRQs */
+ unsigned int stat_irq;
unsigned long last_unhandled; /* Aging timer for unhandled count */
unsigned int irqs_unhandled;
raw_spinlock_t lock;
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..56af674 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -54,6 +54,7 @@ static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
{
__this_cpu_inc(kstat.irqs[irq]);
__this_cpu_inc(kstat.irqs_sum);
+ desc->stat_irq++;
}
static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
@@ -68,6 +69,7 @@ extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
do { \
__this_cpu_inc(*(DESC)->kstat_irqs); \
__this_cpu_inc(kstat.irqs_sum); \
+ (DESC)->stat_irq++; \
} while (0)
#endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 039b889..938159b 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -467,13 +467,9 @@ unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
unsigned int kstat_irqs(unsigned int irq)
{
- struct irq_desc *desc = irq_to_desc(irq);
- int cpu;
- int sum = 0;
+ const struct irq_desc *desc = irq_to_desc(irq);
- if (!desc || !desc->kstat_irqs)
+ if (!desc)
return 0;
- for_each_possible_cpu(cpu)
- sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
- return sum;
+ return desc->stat_irq;
}
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-10-17 19:24 ` Eric Dumazet
@ 2011-12-05 21:15 ` Laurence Oberman
2011-12-05 21:38 ` Laurence Oberman
0 siblings, 1 reply; 16+ messages in thread
From: Laurence Oberman @ 2011-12-05 21:15 UTC (permalink / raw)
To: linux-kernel
Eric Dumazet <eric.dumazet <at> gmail.com> writes:
Hi Eric,
I finally managed to get some time to test the patch. I am testing on the
generic kernel.org 2.6.39 tree and will update you as soon as possible.
Thanks
Laurence
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
2011-12-05 21:15 ` Laurence Oberman
@ 2011-12-05 21:38 ` Laurence Oberman
0 siblings, 0 replies; 16+ messages in thread
From: Laurence Oberman @ 2011-12-05 21:38 UTC (permalink / raw)
To: linux-kernel
Laurence Oberman <laurence.oberman <at> hp.com> writes:
>
>
> Eric Dumazet <eric.dumazet <at> gmail.com> writes:
>
> Hi Eric,
>
> I finally managed to get some time to test the patch. I am testing on the
> generic kernel.org 2.6.39 tree and will update you as soon as possible.
>
> Thanks
> Laurence
>
>
Hello,
Results are in, patch makes a huge difference:
We are now getting results in the usecs again.
[root@o184i026 lobe]# strace -c ./t
Opened, read and closed 8640 times and read total of 69819840 bytes
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.86 0.964056 112 8641 read
0.08 0.000786 0 8642 open
0.06 0.000554 0 8642 close
0.00 0.000034 4 8 mmap
0.00 0.000000 0 1 write
0.00 0.000000 0 3 fstat
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 brk
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.965430 25946 1 total
What do you need me to do to get your patch accepted.
Thanks
Laurence
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-12-05 21:38 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
2011-10-14 10:08 ` David Rientjes
2011-10-14 10:35 ` Eric Dumazet
2011-10-14 12:38 ` Eric Dumazet
2011-10-14 13:36 ` Seger, Mark
2011-10-14 13:47 ` Eric Dumazet
2011-10-14 13:48 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:27 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:40 ` Eric Dumazet
2011-10-14 16:52 ` Oberman, Laurence (HAS GSE)
2011-10-14 16:59 ` Eric Dumazet
2011-10-14 17:02 ` Oberman, Laurence (HAS GSE)
2011-10-17 18:05 ` Seger, Mark
2011-10-17 19:24 ` Eric Dumazet
2011-12-05 21:15 ` Laurence Oberman
2011-12-05 21:38 ` Laurence Oberman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.