Significant performance difference for postgres w/o sched_autogroup

* Significant performance difference for postgres w/o sched_autogroup
@ 2015-07-08 15:45 Andres Freund
  2015-07-09  2:45 ` Mike Galbraith
  0 siblings, 1 reply; 4+ messages in thread
From: Andres Freund @ 2015-07-08 15:45 UTC (permalink / raw)
  To: linux-kernel, Mike Galbraith; +Cc: Ingo Molnar, Peter Zijlstra

Hi,

while debugging a performance issue I noticed the following very odd
bit.

Kernel: 4.1.0-rc7-andres-00049-ge64f638
CPU: 2xE5520

Workload:

postgresql (multi-process via shared memory SQL server) and benchmark
client (pgbench, multi threaded) running on the same server. Connected
using unix sockets.  The statements are relatively simple (~1.5ms on
average), forcing frequent back/forth between server processes and
pgbench threads.

I found that disabling sched_autogroup *significantly* reduces
throughput. Even when both server and client are started from the same
terminal and thus should be in the same group!

There's a significant difference in how %sys with autogroups
enabled/disabled. ~8% v ~27%. That sounds too much.

The difference in postgres performance is significant:
enabled:
progress: 1.0 s, 25857.2 tps, lat 1.171 ms stddev 3.314
progress: 2.0 s, 19582.5 tps, lat 1.586 ms stddev 5.349
progress: 3.0 s, 17796.3 tps, lat 1.804 ms stddev 7.364
progress: 4.0 s, 18736.7 tps, lat 1.696 ms stddev 7.612
progress: 5.0 s, 19269.9 tps, lat 1.640 ms stddev 6.835
progress: 6.0 s, 21389.6 tps, lat 1.467 ms stddev 5.663
progress: 7.0 s, 22153.8 tps, lat 1.417 ms stddev 8.417
progress: 8.0 s, 21550.7 tps, lat 1.487 ms stddev 7.460
progress: 9.0 s, 22602.3 tps, lat 1.401 ms stddev 6.738
progress: 10.0 s, 23475.5 tps, lat 1.327 ms stddev 7.111
tps = 21329.219141 (excluding connections establishing)

progress: 1.0 s, 25092.5 tps, lat 1.255 ms stddev 0.511
progress: 2.0 s, 13322.0 tps, lat 2.377 ms stddev 5.435
progress: 3.0 s, 13903.0 tps, lat 2.298 ms stddev 5.075
progress: 4.0 s, 13734.0 tps, lat 2.328 ms stddev 5.146
progress: 5.0 s, 14015.0 tps, lat 2.281 ms stddev 5.099
progress: 6.0 s, 14034.0 tps, lat 2.277 ms stddev 5.002
progress: 7.0 s, 13930.0 tps, lat 2.297 ms stddev 5.136
progress: 8.0 s, 14015.0 tps, lat 2.279 ms stddev 5.161
progress: 9.0 s, 14000.0 tps, lat 2.287 ms stddev 5.117
progress: 10.0 s, 13900.6 tps, lat 2.299 ms stddev 5.142
tps = 15006.317841 (excluding connections establishing)

(note the odd dip after the first second. That's what I was actually debugging)

A perf diff of -g -a -e cycles:k shows:
# Baseline    Delta  Shared Object        Symbol
# ........  .......  ...................  ..............................................
#
    11.70%   +3.87%  [kernel.vmlinux]     [k] unix_stream_recvmsg
     6.74%   +2.49%  [kernel.vmlinux]     [k] __fget
     6.69%   +4.31%  [kernel.vmlinux]     [k] sys_recvfrom
     6.49%   +1.77%  [kernel.vmlinux]     [k] system_call_after_swapgs
     5.37%   +1.39%  [kernel.vmlinux]     [k] _raw_spin_lock
     4.63%   +1.76%  [kernel.vmlinux]     [k] mutex_lock_interruptible
     4.27%   +1.27%  [kernel.vmlinux]     [k] mutex_unlock
     3.43%   +1.22%  [kernel.vmlinux]     [k] fput
     3.41%   -0.21%  [kernel.vmlinux]     [k] __fget_light
     2.96%   +1.33%  libpthread-2.19.so   [.] __libc_recv
     2.56%   +2.35%  [kernel.vmlinux]     [k] system_call_fastpath
     2.36%   +0.80%  [kernel.vmlinux]     [k] import_single_range
     2.30%   +0.82%  [kernel.vmlinux]     [k] sockfd_lookup_light

Looking a bit into the profile shows an apparent difference in how often
recv goes through the fastpath:

enabled:
-    6.55%  pgbench          [kernel.vmlinux]     [k] sys_recvfrom
   - sys_recvfrom
      + 85.88% system_call_fastpath
      + 14.12% __libc_recv

disabled:
-   10.97%  pgbench          [kernel.vmlinux]    [k] sys_recvfrom
   - sys_recvfrom
      + 74.16% system_call_fastpath
      + 25.84% __libc_recv

Is that actually expected or is there a bug somewhere? The difference
seems to be a bit large to me.

If interesting, I can try to script the workload in a reproducible
manner.

Regards,

Andres

^ permalink raw reply	[flat|nested] 4+ messages in thread