All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
@ 2011-10-13 23:40 Oberman, Laurence (HAS GSE)
  2011-10-14 10:08 ` David Rientjes
  2011-10-14 12:38 ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-13 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Seger, Mark, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4922 bytes --]

Hello,

The HP developed collectl utility depends heavily on abstractions in /proc and /sys provided by the kernel for kernel statistics capture.
During testing on newer NUMA platforms with large numbers of CPU’S HP has observed a significant increase in the time taken to parse this data with the newer 2.6.32.x and higher kernels.

In summary:

To read the data for a DL785 8-node NUMA with 4 CPU’S per node takes around 80 microseconds per read on the 2.6.18.x kernels, but when testing on the same platform with the newer 2.6.32.x and higher kernels each read takes more than 5 ms. This is significant when considering that collectl typically collects the data abstraction in the kernel every 10 seconds (default) for certain statistics.

Given that the kernel has significantly more data structure space to parse and that there has to be some locking interaction to get this data, it is not unusual for the parsing to take longer. However we were alarmed with the significant increase in overhead and wanted to make people aware of this.

Supporting Data
---------------

Mimicking a day of capture at 10 second intervals gives us 8640 reads of /proc/stat

I wanted to first validate if this was architecture or kernel behavior and it is actually kernel overhead on the 2.6.3x and higher kernels. I tested with 2.6.38 but the same applies for the RHEL 6 and SLES 11 2.6.32.x kernels.

Same architecture, just different kernels for my testing.

The newer kernel is around 100 times slower so it shows up in collectl behavior.

DL785 8-node NUMA with 128GB memory and 32 CPU'S.

l785sles:/sys/devices/system/node # ls
node0  node1  node2  node3  node4  node5  node6  node7 dl785sles:/sys/devices/system/node # pwd
/sys/devices/system/node   

On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84 us.

dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.41    0.725717          84      8641           read
  0.38    0.002768           0      8642           open
  0.21    0.001539           0      8642           close
  0.00    0.000000           0         1           write
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         2           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           madvise
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.730024                 25945         1 total    

On the 2.6.38.4 kernel each read takes > 6ms

dl785sles:~/wrk # cat 2.6.38.4-smp.log
Opened, read and closed 8640 times and read total of 86235840 bytes
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   59.021650        6830      8641           read
  0.00    0.000520           0      8642           open
  0.00    0.000425           0      8642           close
  0.00    0.000000           0         1           write
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         9           mmap
  0.00    0.000000           0         2           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           madvise
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00   59.022595                 25946         1 total        

Test program 
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <asm-generic/fcntl.h>
#include <sys/errno.h>

extern int errno;

int fid,i,count,total;
char buffer[16384];

int main(int argc, char *argv[])
{
for (count=0;count<8640;count++) {
        fid=open("/proc/stat",O_RDONLY,660);
        if(fid <= 0) {
         perror("error with file open\n");
        }
        i=read(fid,buffer,16384);
        /* printf("Read %d characters \n",i); */
        total+=i;
        close(fid);
}
printf("Opened, read and closed %d times and read total of %d bytes\n",count,total);
exit(0);
}

Thanks
Laurence Oberman
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
@ 2011-10-14 10:08 ` David Rientjes
  2011-10-14 10:35   ` Eric Dumazet
  2011-10-14 12:38 ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-10-14 10:08 UTC (permalink / raw)
  To: Oberman, Laurence (HAS GSE)
  Cc: linux-kernel, Seger, Mark, Cabaniols, Sebastien

On Fri, 14 Oct 2011, Oberman, Laurence (HAS GSE) wrote:

> On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84 
> us.
> 
> dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  99.41    0.725717          84      8641           read
>   0.38    0.002768           0      8642           open
>   0.21    0.001539           0      8642           close
>   0.00    0.000000           0         1           write
>   0.00    0.000000           0         3           fstat
>   0.00    0.000000           0         8           mmap
>   0.00    0.000000           0         2           mprotect
>   0.00    0.000000           0         1           munmap
>   0.00    0.000000           0         1           brk
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         1           madvise
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1           arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.730024                 25945         1 total    
> 
> On the 2.6.38.4 kernel each read takes > 6ms
> 
> dl785sles:~/wrk # cat 2.6.38.4-smp.log
> Opened, read and closed 8640 times and read total of 86235840 bytes
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   59.021650        6830      8641           read
>   0.00    0.000520           0      8642           open
>   0.00    0.000425           0      8642           close
>   0.00    0.000000           0         1           write
>   0.00    0.000000           0         3           fstat
>   0.00    0.000000           0         9           mmap
>   0.00    0.000000           0         2           mprotect
>   0.00    0.000000           0         1           munmap
>   0.00    0.000000           0         1           brk
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         1           madvise
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1           arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   59.022595                 25946         1 total        
> 

The overhead is probably in kstat_irqs_cpu() which is called for each 
possible irq for each of the 32 possible cpus, and /proc/stat actually 
does the sum twice.  You would see the same type of overhead with 
/proc/interrupts if it wasn't masked by the locking that it requires to 
safely read irq_desc.  "dmesg | grep nr_irqs" will show how many percpu 
variables are being read for every cpu twice.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 10:08 ` David Rientjes
@ 2011-10-14 10:35   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 10:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oberman, Laurence (HAS GSE),
	linux-kernel, Seger, Mark, Cabaniols, Sebastien

Le vendredi 14 octobre 2011 à 03:08 -0700, David Rientjes a écrit :

> The overhead is probably in kstat_irqs_cpu() which is called for each 
> possible irq for each of the 32 possible cpus, and /proc/stat actually 
> does the sum twice.  You would see the same type of overhead with 
> /proc/interrupts if it wasn't masked by the locking that it requires to 
> safely read irq_desc.  "dmesg | grep nr_irqs" will show how many percpu 
> variables are being read for every cpu twice.

One annoying thing with most HP servers is they claim more possible cpus
than reality. So each time we have to loop on possible cpus, we waste
time and memory cache lines.

For example, on this ProLiant BL460c G6, with two quad core cpus
(2x4x2), 32 'possible' cpus are found, instead of 16 :

setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:32 nr_node_ids:2
PERCPU: Embedded 26 pages/cpu @ffff88007dc00000 s76288 r8192 d22016 u131072
pcpu-alloc: s76288 r8192 d22016 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 02 04 06 08 10 12 14 17 19 21 23 25 27 29 31 
pcpu-alloc: [1] 01 03 05 07 09 11 13 15 16 18 20 22 24 26 28 30 




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
  2011-10-14 10:08 ` David Rientjes
@ 2011-10-14 12:38 ` Eric Dumazet
  2011-10-14 13:36   ` Seger, Mark
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 12:38 UTC (permalink / raw)
  To: Oberman, Laurence (HAS GSE)
  Cc: linux-kernel, Seger, Mark, Cabaniols, Sebastien

Le vendredi 14 octobre 2011 à 00:40 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Hello,
> 
> The HP developed collectl utility depends heavily on abstractions in /proc and /sys provided by the kernel for kernel statistics capture.
> During testing on newer NUMA platforms with large numbers of CPU’S HP has observed a significant increase in the time taken to parse this data with the newer 2.6.32.x and higher kernels.
> 
> In summary:
> 
> To read the data for a DL785 8-node NUMA with 4 CPU’S per node takes around 80 microseconds per read on the 2.6.18.x kernels, but when testing on the same platform with the newer 2.6.32.x and higher kernels each read takes more than 5 ms. This is significant when considering that collectl typically collects the data abstraction in the kernel every 10 seconds (default) for certain statistics.
> 
> Given that the kernel has significantly more data structure space to parse and that there has to be some locking interaction to get this data, it is not unusual for the parsing to take longer. However we were alarmed with the significant increase in overhead and wanted to make people aware of this.
> 
> Supporting Data
> ---------------
> 
> Mimicking a day of capture at 10 second intervals gives us 8640 reads of /proc/stat
> 
> I wanted to first validate if this was architecture or kernel behavior and it is actually kernel overhead on the 2.6.3x and higher kernels. I tested with 2.6.38 but the same applies for the RHEL 6 and SLES 11 2.6.32.x kernels.
> 
> Same architecture, just different kernels for my testing.
> 
> The newer kernel is around 100 times slower so it shows up in collectl behavior.
> 
> DL785 8-node NUMA with 128GB memory and 32 CPU'S.
> 
> l785sles:/sys/devices/system/node # ls
> node0  node1  node2  node3  node4  node5  node6  node7 dl785sles:/sys/devices/system/node # pwd
> /sys/devices/system/node   
> 
> On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84 us.
> 
> dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  99.41    0.725717          84      8641           read
>   0.38    0.002768           0      8642           open
>   0.21    0.001539           0      8642           close
>   0.00    0.000000           0         1           write
>   0.00    0.000000           0         3           fstat
>   0.00    0.000000           0         8           mmap
>   0.00    0.000000           0         2           mprotect
>   0.00    0.000000           0         1           munmap
>   0.00    0.000000           0         1           brk
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         1           madvise
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1           arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.730024                 25945         1 total    
> 
> On the 2.6.38.4 kernel each read takes > 6ms
> 
> dl785sles:~/wrk # cat 2.6.38.4-smp.log
> Opened, read and closed 8640 times and read total of 86235840 bytes
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   59.021650        6830      8641           read
>   0.00    0.000520           0      8642           open
>   0.00    0.000425           0      8642           close
>   0.00    0.000000           0         1           write
>   0.00    0.000000           0         3           fstat
>   0.00    0.000000           0         9           mmap
>   0.00    0.000000           0         2           mprotect
>   0.00    0.000000           0         1           munmap
>   0.00    0.000000           0         1           brk
>   0.00    0.000000           0         1         1 access
>   0.00    0.000000           0         1           madvise
>   0.00    0.000000           0         1           execve
>   0.00    0.000000           0         1           arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00   59.022595                 25946         1 total        
> 
> Test program 
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <asm-generic/fcntl.h>
> #include <sys/errno.h>
> 
> extern int errno;
> 
> int fid,i,count,total;
> char buffer[16384];
> 
> int main(int argc, char *argv[])
> {
> for (count=0;count<8640;count++) {
>         fid=open("/proc/stat",O_RDONLY,660);
>         if(fid <= 0) {
>          perror("error with file open\n");
>         }
>         i=read(fid,buffer,16384);
>         /* printf("Read %d characters \n",i); */
>         total+=i;
>         close(fid);
> }
> printf("Opened, read and closed %d times and read total of %d bytes\n",count,total);
> exit(0);
> }

Strange, here is what I have on my 2x4x2 machine.

Could you try following actions please ?

# perf record ./prog
Opened, read and closed 8640 times and read total of 29595948 bytes
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]

# perf report --stdio
# Events: 1K cycles
#
# Overhead  Command      Shared Object                        Symbol
# ........  .......  .................  ............................
#
    22.82%     prog  [kernel.kallsyms]  [k] format_decode
    21.15%     prog  [kernel.kallsyms]  [k] vsnprintf
    13.46%     prog  [kernel.kallsyms]  [k] number.isra.4
    11.33%     prog  [kernel.kallsyms]  [k] memcpy
     8.87%     prog  [kernel.kallsyms]  [k] kstat_irqs
     3.80%     prog  [kernel.kallsyms]  [k] radix_tree_lookup_element
     3.69%     prog  [kernel.kallsyms]  [k] seq_printf
     3.63%     prog  [kernel.kallsyms]  [k] find_next_bit
     2.81%     prog  [kernel.kallsyms]  [k] show_stat
     1.18%     prog  [kernel.kallsyms]  [k] put_dec_trunc
     0.96%     prog  [kernel.kallsyms]  [k] radix_tree_lookup
     0.96%     prog  [kernel.kallsyms]  [k] irq_to_desc
     0.66%     prog  [kernel.kallsyms]  [k] getname_flags
     0.37%     prog  [kernel.kallsyms]  [k] selinux_inode_alloc_security
     0.22%     prog  [kernel.kallsyms]  [k] jiffies_64_to_clock_t
     0.22%     prog  [kernel.kallsyms]  [k] copy_user_generic_string
     0.22%     prog  [kernel.kallsyms]  [k] __call_rcu
     0.15%     prog  [kernel.kallsyms]  [k] add_preempt_count
     0.15%     prog  [kernel.kallsyms]  [k] put_dec_full
     0.15%     prog  [kernel.kallsyms]  [k] nr_context_switches
     0.15%     prog  [kernel.kallsyms]  [k] kfree




^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 12:38 ` Eric Dumazet
@ 2011-10-14 13:36   ` Seger, Mark
  2011-10-14 13:47     ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Seger, Mark @ 2011-10-14 13:36 UTC (permalink / raw)
  To: Eric Dumazet, Oberman, Laurence (HAS GSE)
  Cc: linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3438 bytes --]

> > The HP developed collectl utility depends heavily on abstractions in
> /proc and /sys provided by the kernel for kernel statistics capture.
> > During testing on newer NUMA platforms with large numbers of CPU’S HP
> has observed a significant increase in the time taken to parse this
> data with the newer 2.6.32.x and higher kernels.

While this does indeed effect collectl and renders its effectiveness at 0.1 or lower monitoring intervals it effects ANY utility that reads /proc/stat, and the more frequent the access the heavier the load.  Remember, the increase in load is by a factor of over 50 times, so those who like to run top or vmstat or even iostat will also see a much heavier hand too.  If people ran sar at a lower monitoring frequency it too would suffer.

I've also observed reading /proc/meminfo also consumes a lot more CPU, but not as bad as /proc/stat.

> Strange, here is what I have on my 2x4x2 machine.

Forgive a dumb question, but what does 2X4X2 mean?  ;(

I've observed things are fine on 2 socket systems and are truly horrible on the few 8 socket machines I tested this on which also had higher core counts, so there were always on the order of 48+ cpus involved.  I haven't done any testing on 4 socket machines and I don't even want to think about what could happen on 16 or 32 or however many socket machines are built in the future.

> Could you try following actions please ?
> 
> # perf record ./prog
> Opened, read and closed 8640 times and read total of 29595948 bytes
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]
> 
> # perf report --stdio
> # Events: 1K cycles
> #
> # Overhead  Command      Shared Object                        Symbol
> # ........  .......  .................  ............................
> #
>     22.82%     prog  [kernel.kallsyms]  [k] format_decode
>     21.15%     prog  [kernel.kallsyms]  [k] vsnprintf
>     13.46%     prog  [kernel.kallsyms]  [k] number.isra.4
>     11.33%     prog  [kernel.kallsyms]  [k] memcpy
>      8.87%     prog  [kernel.kallsyms]  [k] kstat_irqs
>      3.80%     prog  [kernel.kallsyms]  [k] radix_tree_lookup_element
>      3.69%     prog  [kernel.kallsyms]  [k] seq_printf
>      3.63%     prog  [kernel.kallsyms]  [k] find_next_bit
>      2.81%     prog  [kernel.kallsyms]  [k] show_stat
>      1.18%     prog  [kernel.kallsyms]  [k] put_dec_trunc
>      0.96%     prog  [kernel.kallsyms]  [k] radix_tree_lookup
>      0.96%     prog  [kernel.kallsyms]  [k] irq_to_desc
>      0.66%     prog  [kernel.kallsyms]  [k] getname_flags
>      0.37%     prog  [kernel.kallsyms]  [k]
> selinux_inode_alloc_security
>      0.22%     prog  [kernel.kallsyms]  [k] jiffies_64_to_clock_t
>      0.22%     prog  [kernel.kallsyms]  [k] copy_user_generic_string
>      0.22%     prog  [kernel.kallsyms]  [k] __call_rcu
>      0.15%     prog  [kernel.kallsyms]  [k] add_preempt_count
>      0.15%     prog  [kernel.kallsyms]  [k] put_dec_full
>      0.15%     prog  [kernel.kallsyms]  [k] nr_context_switches
>      0.15%     prog  [kernel.kallsyms]  [k] kfree

For me the easiest reproducer, which admittedly doesn't show any deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop and time it.

-mark
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 13:36   ` Seger, Mark
@ 2011-10-14 13:47     ` Eric Dumazet
  2011-10-14 13:48       ` Oberman, Laurence (HAS GSE)
  2011-10-14 16:27       ` Oberman, Laurence (HAS GSE)
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 13:47 UTC (permalink / raw)
  To: Seger, Mark
  Cc: Oberman, Laurence (HAS GSE), linux-kernel, Cabaniols, Sebastien

Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :

> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.

OK, but you already provided a global number.

We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.

"perf" is provided in kernel sources, and really is a piece of cake.

cd tools/perf ; make 



2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 13:47     ` Eric Dumazet
@ 2011-10-14 13:48       ` Oberman, Laurence (HAS GSE)
  2011-10-14 16:27       ` Oberman, Laurence (HAS GSE)
  1 sibling, 0 replies; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 13:48 UTC (permalink / raw)
  To: Eric Dumazet, Seger, Mark; +Cc: linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1217 bytes --]

Hi Eric,
I am preparing for that test. Need to stage a full RHEL6. I had only updated the kernel on an existing SLES10.
Standby and I will run the tests for you.

Thanks
Laurence

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Friday, October 14, 2011 9:47 AM
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE); linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations

Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :

> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.

OK, but you already provided a global number.

We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.

"perf" is provided in kernel sources, and really is a piece of cake.

cd tools/perf ; make 



2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.



ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 13:47     ` Eric Dumazet
  2011-10-14 13:48       ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:27       ` Oberman, Laurence (HAS GSE)
  2011-10-14 16:40         ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 16:27 UTC (permalink / raw)
  To: Eric Dumazet, Seger, Mark; +Cc: linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 10105 bytes --]

Eric,

Got the system ready and ran the tests.

This is a DL980 128 CPCUS, 256GB memory.

[root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
128

[root@linuxbeast ~]# uname -a
Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

In this example we take >4ms per read. We are spending most of the time in 

    35.09%        t  [kernel.kallsyms]  [k] kstat_irqs_cpu
    32.13%        t  [kernel.kallsyms]  [k] find_next_bit
    24.32%        t  [kernel.kallsyms]  [k] show_stat
     1.98%        t  [kernel.kallsyms]  [k] vsnprintf
     1.63%        t  [kernel.kallsyms]  [k] format_decode
     1.23%        t  [kernel.kallsyms]  [k] number

Thanks

[root@linuxbeast ~]# perf record ./t
Opened, read and closed 8640 times and read total of 72521754 bytes
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.422 MB perf.data (~62126 samples) ]
[root@linuxbeast ~]# perf report --stdio
# Events: 36K cycles
#
# Overhead  Command      Shared Object                         Symbol
# ........  .......  .................  .............................
#
    35.09%        t  [kernel.kallsyms]  [k] kstat_irqs_cpu
    32.13%        t  [kernel.kallsyms]  [k] find_next_bit
    24.32%        t  [kernel.kallsyms]  [k] show_stat
     1.98%        t  [kernel.kallsyms]  [k] vsnprintf
     1.63%        t  [kernel.kallsyms]  [k] format_decode
     1.23%        t  [kernel.kallsyms]  [k] number
     0.92%        t  [kernel.kallsyms]  [k] memcpy_c
     0.42%        t  [kernel.kallsyms]  [k] seq_printf
     0.26%        t  [kernel.kallsyms]  [k] put_dec_trunc
     0.26%        t  [kernel.kallsyms]  [k] _spin_lock
     0.16%        t  [kernel.kallsyms]  [k] put_dec
     0.15%        t  [kernel.kallsyms]  [k] arch_irq_stat_cpu
     0.14%        t  [kernel.kallsyms]  [k] scheduler_tick
     0.12%        t  [kernel.kallsyms]  [k] nr_running
     0.10%        t  [kernel.kallsyms]  [k] jiffies_64_to_clock_t
     0.08%        t  [kernel.kallsyms]  [k] dyntick_save_progress_counter
     0.06%        t  [kernel.kallsyms]  [k] nr_iowait
     0.05%        t  [kernel.kallsyms]  [k] nr_context_switches
     0.05%        t  [kernel.kallsyms]  [k] _spin_lock_irqsave
     0.04%        t  [kernel.kallsyms]  [k] rcu_process_gp_end
     0.04%        t  [kernel.kallsyms]  [k] account_system_time
     0.04%        t  [kernel.kallsyms]  [k] native_write_msr_safe
     0.04%        t  [kernel.kallsyms]  [k] memcpy
     0.03%        t  [kernel.kallsyms]  [k] update_vsyscall
     0.03%        t  [kernel.kallsyms]  [k] task_tick_fair
     0.03%        t  [kernel.kallsyms]  [k] __do_softirq
     0.02%        t  [kernel.kallsyms]  [k] apic_timer_interrupt
     0.02%        t  [kernel.kallsyms]  [k] ktime_get
     0.02%        t  [kernel.kallsyms]  [k] _spin_unlock_irqrestore
     0.02%        t  [kernel.kallsyms]  [k] check_for_new_grace_period
     0.02%        t  [kernel.kallsyms]  [k] __sched_text_end
     0.02%        t  [kernel.kallsyms]  [k] tick_do_update_jiffies64
     0.02%        t  [kernel.kallsyms]  [k] copy_user_generic_string
     0.02%        t  [kernel.kallsyms]  [k] run_timer_softirq
     0.02%        t  [kernel.kallsyms]  [k] rcu_process_callbacks
     0.02%        t  [kernel.kallsyms]  [k] __rcu_process_callbacks
     0.02%        t  [kernel.kallsyms]  [k] kfree
     0.02%        t  [kernel.kallsyms]  [k] __rcu_pending
     0.02%        t  [kernel.kallsyms]  [k] avtab_search_node
     0.01%        t  [kernel.kallsyms]  [k] rcu_process_dyntick
     0.01%        t  [kernel.kallsyms]  [k] force_quiescent_state
     0.01%        t  [kernel.kallsyms]  [k] irq_exit
     0.01%        t  [kernel.kallsyms]  [k] update_wall_time
     0.01%        t  [kernel.kallsyms]  [k] tg_shares_up
     0.01%        t  [kernel.kallsyms]  [k] acct_update_integrals
     0.01%        t  [kernel.kallsyms]  [k] __percpu_counter_add
     0.01%        t  [kernel.kallsyms]  [k] raise_softirq
     0.01%        t  [kernel.kallsyms]  [k] perf_event_do_pending
     0.01%        t  [kernel.kallsyms]  [k] update_curr
     0.01%        t  [kernel.kallsyms]  [k] tick_sched_timer
     0.01%        t  [kernel.kallsyms]  [k] avc_has_perm_noaudit
     0.01%        t  [kernel.kallsyms]  [k] run_posix_cpu_timers
     0.01%        t  [kernel.kallsyms]  [k] current_kernel_time
     0.01%        t  [kernel.kallsyms]  [k] update_cpu_load
     0.01%        t  [kernel.kallsyms]  [k] hrtimer_interrupt
     0.01%        t  [kernel.kallsyms]  [k] native_read_tsc
     0.01%        t  [kernel.kallsyms]  [k] vfs_read
     0.01%        t  [kernel.kallsyms]  [k] __call_rcu
     0.01%        t  [kernel.kallsyms]  [k] putname
     0.01%        t  [kernel.kallsyms]  [k] __d_lookup
     0.01%        t  [kernel.kallsyms]  [k] rb_erase
     0.01%        t  [kernel.kallsyms]  [k] rb_next
     0.01%        t  [kernel.kallsyms]  [k] rcu_check_callbacks
     0.01%        t  [kernel.kallsyms]  [k] inode_has_perm
     0.01%        t  [kernel.kallsyms]  [k] proc_reg_release
     0.01%        t  [kernel.kallsyms]  [k] __link_path_walk
     0.01%        t  [kernel.kallsyms]  [k] perf_event_task_tick
     0.00%        t  [kernel.kallsyms]  [k] __remove_hrtimer
     0.00%        t  [kernel.kallsyms]  [k] task_of
     0.00%        t  [kernel.kallsyms]  [k] irq_enter
     0.00%        t  [kernel.kallsyms]  [k] rb_insert_color
     0.00%        t  [kernel.kallsyms]  [k] rcu_start_gp
     0.00%        t  [kernel.kallsyms]  [k] alloc_fd
     0.00%        t  [kernel.kallsyms]  [k] path_init
     0.00%        t  [kernel.kallsyms]  [k] read_tsc
     0.00%        t  [kernel.kallsyms]  [k] rcu_irq_enter
     0.00%        t  [kernel.kallsyms]  [k] _read_lock
     0.00%        t  [kernel.kallsyms]  [k] sysret_check
     0.00%        t  [kernel.kallsyms]  [k] expand_files
     0.00%        t  [kernel.kallsyms]  [k] file_move
     0.00%        t  [kernel.kallsyms]  [k] update_process_times
     0.00%        t  [kernel.kallsyms]  [k] __kmalloc
     0.00%        t  [kernel.kallsyms]  [k] __run_hrtimer
     0.00%        t  [kernel.kallsyms]  [k] hweight64
     0.00%        t  [kernel.kallsyms]  [k] audit_syscall_entry
     0.00%        t  [kernel.kallsyms]  [k] lookup_mnt
     0.00%        t  [kernel.kallsyms]  [k] perf_pmu_disable
     0.00%        t  [kernel.kallsyms]  [k] kmem_cache_alloc
     0.00%        t  [kernel.kallsyms]  [k] getname
     0.00%        t  [kernel.kallsyms]  [k] mutex_lock
     0.00%        t  [kernel.kallsyms]  [k] restore_args
     0.00%        t  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%        t  [kernel.kallsyms]  [k] ima_counts_get
     0.00%        t  [kernel.kallsyms]  [k] call_rcu_sched
     0.00%        t  [kernel.kallsyms]  [k] rcu_implicit_dynticks_qs
     0.00%        t  [kernel.kallsyms]  [k] security_file_alloc
     0.00%        t  [kernel.kallsyms]  [k] memset_c
     0.00%        t  [kernel.kallsyms]  [k] lapic_next_event
     0.00%        t  [kernel.kallsyms]  [k] kmem_cache_free
     0.00%        t  [kernel.kallsyms]  [k] tick_program_event
     0.00%        t  [kernel.kallsyms]  [k] __tcp_select_window
     0.00%        t  [kernel.kallsyms]  [k] __dentry_open
     0.00%        t  [kernel.kallsyms]  [k] _local_bh_enable
     0.00%        t  [kernel.kallsyms]  [k] __smp_call_function_single
     0.00%        t  [kernel.kallsyms]  [k] sched_clock_tick
     0.00%        t  [kernel.kallsyms]  [k] call_softirq
     0.00%        t  [kernel.kallsyms]  [k] _spin_lock_irq
     0.00%        t  [kernel.kallsyms]  [k] _atomic_dec_and_lock
     0.00%        t  [kernel.kallsyms]  [k] hrtimer_forward
     0.00%        t  [kernel.kallsyms]  [k] account_process_tick
     0.00%        t  [kernel.kallsyms]  [k] clockevents_program_event
     0.00%        t  [kernel.kallsyms]  [k] sysret_careful
     0.00%        t  [kernel.kallsyms]  [k] x86_pmu_enable


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
(END)

[root@linuxbeast ~]# strace -c -T ./t
Opened, read and closed 8640 times and read total of 72446400 bytes
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99   35.195100        4073      8641           read
  0.01    0.002183           0      8642           close
  0.00    0.001260           0      8642           open
  0.00    0.000032          32         1           write
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         9           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00   35.198575                 25946         1 total
[root@linuxbeast ~]#


-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Friday, October 14, 2011 9:47 AM
To: Seger, Mark
Cc: Oberman, Laurence (HAS GSE); linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations

Le vendredi 14 octobre 2011 à 14:36 +0100, Seger, Mark a écrit :

> For me the easiest reproducer, which admittedly doesn't show any
> deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop
> and time it.

OK, but you already provided a global number.

We would like to check where in kernel cpu time is consumed.
Maybe something really obvious could pop out.

"perf" is provided in kernel sources, and really is a piece of cake.

cd tools/perf ; make 



2x4x2 means : Two sockets, 4 cores per physical package, 2 threads per
core.



ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 16:27       ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:40         ` Eric Dumazet
  2011-10-14 16:52           ` Oberman, Laurence (HAS GSE)
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:40 UTC (permalink / raw)
  To: Oberman, Laurence (HAS GSE)
  Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien

Le vendredi 14 octobre 2011 à 17:27 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Eric,
> 
> Got the system ready and ran the tests.
> 
> This is a DL980 128 CPCUS, 256GB memory.
> 
> [root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
> 128
> 
> [root@linuxbeast ~]# uname -a
> Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
> 
> In this example we take >4ms per read. We are spending most of the time in 
> 
>     35.09%        t  [kernel.kallsyms]  [k] kstat_irqs_cpu
>     32.13%        t  [kernel.kallsyms]  [k] find_next_bit
>     24.32%        t  [kernel.kallsyms]  [k] show_stat
>      1.98%        t  [kernel.kallsyms]  [k] vsnprintf
>      1.63%        t  [kernel.kallsyms]  [k] format_decode
>      1.23%        t  [kernel.kallsyms]  [k] number

Hmm, have you compiled your kernel with NR_CPUS=4096 ???


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 16:40         ` Eric Dumazet
@ 2011-10-14 16:52           ` Oberman, Laurence (HAS GSE)
  2011-10-14 16:59             ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 16:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1557 bytes --]

Standard RHEL kernel build.
What the customer would be using.

[root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
CONFIG_NR_CPUS=4096

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Friday, October 14, 2011 12:40 PM
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark; linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations

Le vendredi 14 octobre 2011 à 17:27 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Eric,
> 
> Got the system ready and ran the tests.
> 
> This is a DL980 128 CPCUS, 256GB memory.
> 
> [root@linuxbeast ~]# cat /proc/cpuinfo | grep processor | wc -l
> 128
> 
> [root@linuxbeast ~]# uname -a
> Linux linuxbeast.gse.mvlabs.corp.hp.com 2.6.32-131.6.1.el6.x86_64 #1 SMP Mon Jun 20 14:15:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
> 
> In this example we take >4ms per read. We are spending most of the time in 
> 
>     35.09%        t  [kernel.kallsyms]  [k] kstat_irqs_cpu
>     32.13%        t  [kernel.kallsyms]  [k] find_next_bit
>     24.32%        t  [kernel.kallsyms]  [k] show_stat
>      1.98%        t  [kernel.kallsyms]  [k] vsnprintf
>      1.63%        t  [kernel.kallsyms]  [k] format_decode
>      1.23%        t  [kernel.kallsyms]  [k] number

Hmm, have you compiled your kernel with NR_CPUS=4096 ???

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 16:52           ` Oberman, Laurence (HAS GSE)
@ 2011-10-14 16:59             ` Eric Dumazet
  2011-10-14 17:02               ` Oberman, Laurence (HAS GSE)
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:59 UTC (permalink / raw)
  To: Oberman, Laurence (HAS GSE)
  Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien

Le vendredi 14 octobre 2011 à 17:52 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Standard RHEL kernel build.
> What the customer would be using.
> 
> [root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
> CONFIG_NR_CPUS=4096

Then something must be wrong.... what are your nr_cpu_ids /
nr_cpumask_bits values ?


dmesg | grep nr_cpu_ids




^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 16:59             ` Eric Dumazet
@ 2011-10-14 17:02               ` Oberman, Laurence (HAS GSE)
  2011-10-17 18:05                 ` Seger, Mark
  0 siblings, 1 reply; 16+ messages in thread
From: Oberman, Laurence (HAS GSE) @ 2011-10-14 17:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Seger, Mark, linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1321 bytes --]

Hi Eric,

Just FYI, the same behavior is seen on multiple hardware platforms so it is definitely kernel changes. The earlier kernel on the DL785 platform was returning reads within 60 microseconds.
I have measured it on multiple platforms.

In this case we have a DL980 with HT enabled so you will see 128 CPUS. If I shut off HT, it does not make a difference.

[root@linuxbeast boot]# dmesg | grep nr_cpu_ids
NR_CPUS:4096 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:8

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Friday, October 14, 2011 12:59 PM
To: Oberman, Laurence (HAS GSE)
Cc: Seger, Mark; linux-kernel@vger.kernel.org; Cabaniols, Sebastien
Subject: RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations

Le vendredi 14 octobre 2011 à 17:52 +0100, Oberman, Laurence (HAS GSE) a
écrit :
> Standard RHEL kernel build.
> What the customer would be using.
> 
> [root@linuxbeast boot]# cat config-2.6.32-131.6.1.el6.x86_64 | grep NR
> CONFIG_NR_CPUS=4096

Then something must be wrong.... what are your nr_cpu_ids /
nr_cpumask_bits values ?


dmesg | grep nr_cpu_ids



ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-14 17:02               ` Oberman, Laurence (HAS GSE)
@ 2011-10-17 18:05                 ` Seger, Mark
  2011-10-17 19:24                   ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Seger, Mark @ 2011-10-17 18:05 UTC (permalink / raw)
  To: Oberman, Laurence (HAS GSE), Eric Dumazet
  Cc: linux-kernel, Cabaniols, Sebastien

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1139 bytes --]

>Just FYI, the same behavior is seen on multiple hardware platforms so it
>is definitely kernel changes. The earlier kernel on the DL785 platform
>was returning reads within 60 microseconds.
>I have measured it on multiple platforms.
>
>In this case we have a DL980 with HT enabled so you will see 128 CPUS.
>If I shut off HT, it does not make a difference.

Given that there have only been a couple of responses to this I can't help but wonder if this hasn't really gotten anyone's attention yet or if perhaps I'm just overreacting.  I'd claim this breaks key Linux utilities by making them have a significantly heavier footprint than they used to have.  Historically people would always run top whenever they wanted to knowing it had minimal impact on the system and now I'm not so sure that will be the case anymore, at least not on big numa boxes.

I can assure you if someone wants to report cpu stats every tenth of a second it will more definitely have an impact.

-mark

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-17 18:05                 ` Seger, Mark
@ 2011-10-17 19:24                   ` Eric Dumazet
  2011-12-05 21:15                     ` Laurence Oberman
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2011-10-17 19:24 UTC (permalink / raw)
  To: Seger, Mark
  Cc: Oberman, Laurence (HAS GSE), linux-kernel, Cabaniols, Sebastien

Le lundi 17 octobre 2011 à 19:05 +0100, Seger, Mark a écrit :
> >Just FYI, the same behavior is seen on multiple hardware platforms so it
> >is definitely kernel changes. The earlier kernel on the DL785 platform
> >was returning reads within 60 microseconds.
> >I have measured it on multiple platforms.
> >
> >In this case we have a DL980 with HT enabled so you will see 128 CPUS.
> >If I shut off HT, it does not make a difference.
> 
> Given that there have only been a couple of responses to this I can't
> help but wonder if this hasn't really gotten anyone's attention yet or
> if perhaps I'm just overreacting.  I'd claim this breaks key Linux
> utilities by making them have a significantly heavier footprint than
> they used to have.  Historically people would always run top whenever
> they wanted to knowing it had minimal impact on the system and now I'm
> not so sure that will be the case anymore, at least not on big numa
> boxes.
> 
> I can assure you if someone wants to report cpu stats every tenth of a
> second it will more definitely have an impact.

Maybe, but adding an increment of a sum field adds one instruction in
interrupt path. If you have one reader of /proc/stat every second, but
ten million interrupts per second, it might be better as is.

reading /proc/stat is slow path on most workloads, while handling an
interrupt is definitely fast path on all workloads.

The percpu variable makes sure no cache line bouncing occurs in the
kstat_irqs management.

kstat_irqs is needed to provide /proc/interrupts : per cpu counts per
IRQ.

It should be OK to add another field in struct irq_desc
and increment it in kstat_incr_irqs_this_cpu(), because we dirty at
least desc->lock cache line.

This field can be a plain integer, changed under desc->lock protection.

 include/linux/irqdesc.h     |    2 ++
 include/linux/kernel_stat.h |    2 ++
 kernel/irq/irqdesc.c        |   10 +++-------
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 150134a..5b250d0 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -24,6 +24,7 @@ struct timer_rand_state;
  * @depth:		disable-depth, for nested irq_disable() calls
  * @wake_depth:		enable depth, for multiple irq_set_irq_wake() callers
  * @irq_count:		stats field to detect stalled irqs
+ * @stat_irq:		irq stats (sum for all cpus)
  * @last_unhandled:	aging timer for unhandled count
  * @irqs_unhandled:	stats field for spurious unhandled interrupts
  * @lock:		locking for SMP
@@ -50,6 +51,7 @@ struct irq_desc {
 	unsigned int		depth;		/* nested irq disables */
 	unsigned int		wake_depth;	/* nested wake enables */
 	unsigned int		irq_count;	/* For detecting broken IRQs */
+	unsigned int		stat_irq;
 	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
 	unsigned int		irqs_unhandled;
 	raw_spinlock_t		lock;
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..56af674 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -54,6 +54,7 @@ static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
 {
 	__this_cpu_inc(kstat.irqs[irq]);
 	__this_cpu_inc(kstat.irqs_sum);
+	desc->stat_irq++;
 }
 
 static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
@@ -68,6 +69,7 @@ extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
 do {							\
 	__this_cpu_inc(*(DESC)->kstat_irqs);		\
 	__this_cpu_inc(kstat.irqs_sum);			\
+	(DESC)->stat_irq++;				\
 } while (0)
 
 #endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 039b889..938159b 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -467,13 +467,9 @@ unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
 
 unsigned int kstat_irqs(unsigned int irq)
 {
-	struct irq_desc *desc = irq_to_desc(irq);
-	int cpu;
-	int sum = 0;
+	const struct irq_desc *desc = irq_to_desc(irq);
 
-	if (!desc || !desc->kstat_irqs)
+	if (!desc)
 		return 0;
-	for_each_possible_cpu(cpu)
-		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
-	return sum;
+	return desc->stat_irq;
 }



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-10-17 19:24                   ` Eric Dumazet
@ 2011-12-05 21:15                     ` Laurence Oberman
  2011-12-05 21:38                       ` Laurence Oberman
  0 siblings, 1 reply; 16+ messages in thread
From: Laurence Oberman @ 2011-12-05 21:15 UTC (permalink / raw)
  To: linux-kernel


Eric Dumazet <eric.dumazet <at> gmail.com> writes:


Hi Eric,

I finally managed to get some time to test the patch. I am testing on the
generic kernel.org 2.6.39 tree and will update you as soon as possible.

Thanks
Laurence





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations
  2011-12-05 21:15                     ` Laurence Oberman
@ 2011-12-05 21:38                       ` Laurence Oberman
  0 siblings, 0 replies; 16+ messages in thread
From: Laurence Oberman @ 2011-12-05 21:38 UTC (permalink / raw)
  To: linux-kernel

Laurence Oberman <laurence.oberman <at> hp.com> writes:

> 
> 
> Eric Dumazet <eric.dumazet <at> gmail.com> writes:
> 
> Hi Eric,
> 
> I finally managed to get some time to test the patch. I am testing on the
> generic kernel.org 2.6.39 tree and will update you as soon as possible.
> 
> Thanks
> Laurence
> 
> 

Hello,
Results are in, patch makes a huge difference:

We are now getting results in the usecs again.


[root@o184i026 lobe]# strace -c ./t
Opened, read and closed 8640 times and read total of 69819840 bytes
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.86    0.964056         112      8641           read
  0.08    0.000786           0      8642           open
  0.06    0.000554           0      8642           close
  0.00    0.000034           4         8           mmap
  0.00    0.000000           0         1           write
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.965430                 25946         1 total  

What do you need me to do to get your patch accepted.

Thanks
Laurence



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-12-05 21:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-13 23:40 Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations Oberman, Laurence (HAS GSE)
2011-10-14 10:08 ` David Rientjes
2011-10-14 10:35   ` Eric Dumazet
2011-10-14 12:38 ` Eric Dumazet
2011-10-14 13:36   ` Seger, Mark
2011-10-14 13:47     ` Eric Dumazet
2011-10-14 13:48       ` Oberman, Laurence (HAS GSE)
2011-10-14 16:27       ` Oberman, Laurence (HAS GSE)
2011-10-14 16:40         ` Eric Dumazet
2011-10-14 16:52           ` Oberman, Laurence (HAS GSE)
2011-10-14 16:59             ` Eric Dumazet
2011-10-14 17:02               ` Oberman, Laurence (HAS GSE)
2011-10-17 18:05                 ` Seger, Mark
2011-10-17 19:24                   ` Eric Dumazet
2011-12-05 21:15                     ` Laurence Oberman
2011-12-05 21:38                       ` Laurence Oberman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.