All of lore.kernel.org
 help / color / mirror / Atom feed
* 5.4.20 - high load - lots of incoming data - small data read.
@ 2020-04-06  8:04 Jesper Krogh
  2020-04-06  8:53 ` Yan, Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jesper Krogh @ 2020-04-06  8:04 UTC (permalink / raw)
  To: ceph-devel

This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
and transport data to the tape-library from CephFS - below 2 threads is
essentially doing something equivalent to

find /cephfs/ -type f | xargs cat | nc server

2 threads only, load exploding and the "net read vs net write" has
more than 100x difference.

Can anyone explain this as "normal" behaviour?
Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu

jk@wombat:~$ w
 07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
jk@wombat:~$ dstat -ar
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
  0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
  1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
  0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
  1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
  1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
  1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
  0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
  0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
  0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
  1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
  0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
  1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
  0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
  1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
  0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
  0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
  0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
  1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
  1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
0     0 ^C
jk@wombat:~$ w
 07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
jk@wombat:~$

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  8:04 5.4.20 - high load - lots of incoming data - small data read Jesper Krogh
@ 2020-04-06  8:53 ` Yan, Zheng
  2020-04-06  9:09   ` Jesper Krogh
  2020-04-06 11:45 ` Jeff Layton
  2020-04-09 17:50 ` Jesper Krogh
  2 siblings, 1 reply; 21+ messages in thread
From: Yan, Zheng @ 2020-04-06  8:53 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: ceph-devel

On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> and transport data to the tape-library from CephFS - below 2 threads is
> essentially doing something equivalent to
>
> find /cephfs/ -type f | xargs cat | nc server
>
> 2 threads only, load exploding and the "net read vs net write" has
> more than 100x difference.
>
> Can anyone explain this as "normal" behaviour?
> Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
>
> jk@wombat:~$ w
>  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$ dstat -ar
> --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
>   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
>   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
>   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
>   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
>   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
>   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
>   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
>   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
>   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
>   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
>   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
>   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
>   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
>   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
>   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
>   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
>   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
>   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
>   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> 0     0 ^C
> jk@wombat:~$ w
>  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$
>
> Thanks.

how small these files are?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  8:53 ` Yan, Zheng
@ 2020-04-06  9:09   ` Jesper Krogh
  2020-04-06  9:18     ` Jesper Krogh
  2020-04-06 11:49     ` Jeff Layton
  0 siblings, 2 replies; 21+ messages in thread
From: Jesper Krogh @ 2020-04-06  9:09 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

home-directory style - median is about 3KB - but varies greatly.

I also get this:
[41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
120 seconds.
[41204.865955]       Not tainted 5.4.30 #5
[41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
[41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
[41204.866120] Call Trace:
[41204.866156]  __schedule+0x45f/0x710
[41204.866162]  ? xas_store+0x391/0x5f0
[41204.866164]  schedule+0x3e/0xa0
[41204.866166]  io_schedule+0x16/0x40
[41204.866180]  __lock_page+0x12a/0x1d0
[41204.866182]  ? file_fdatawait_range+0x30/0x30
[41204.866187]  truncate_inode_pages_range+0x52c/0x980
[41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
[41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
[41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
[41204.866209]  truncate_inode_pages_final+0x4c/0x60
[41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
[41204.866219]  evict+0xca/0x1a0
[41204.866221]  iput+0x1ba/0x210
[41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
[41204.866232]  process_one_work+0x167/0x400
[41204.866233]  worker_thread+0x4d/0x460
[41204.866236]  kthread+0x105/0x140
[41204.866237]  ? rescuer_thread+0x370/0x370
[41204.866239]  ? kthread_destroy_worker+0x50/0x50
[41204.866240]  ret_from_fork+0x35/0x40

On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> >
> > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > and transport data to the tape-library from CephFS - below 2 threads is
> > essentially doing something equivalent to
> >
> > find /cephfs/ -type f | xargs cat | nc server
> >
> > 2 threads only, load exploding and the "net read vs net write" has
> > more than 100x difference.
> >
> > Can anyone explain this as "normal" behaviour?
> > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> >
> > jk@wombat:~$ w
> >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$ dstat -ar
> > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > 0     0 ^C
> > jk@wombat:~$ w
> >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$
> >
> > Thanks.
>
> how small these files are?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  9:09   ` Jesper Krogh
@ 2020-04-06  9:18     ` Jesper Krogh
  2020-04-06 10:23       ` Jesper Krogh
  2020-04-06 11:49     ` Jeff Layton
  1 sibling, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-06  9:18 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

And it looks like I can reproduce this pattern - and when it has been
stuck for "sufficient" amount of time - then it gets blacklisted by
the MDS/OSD and (a related) issue is that I cannot get the mountpoint
back without a reboot on the kernel-client side.

On Mon, Apr 6, 2020 at 11:09 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> home-directory style - median is about 3KB - but varies greatly.
>
> I also get this:
> [41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
> 120 seconds.
> [41204.865955]       Not tainted 5.4.30 #5
> [41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
> [41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
> [41204.866120] Call Trace:
> [41204.866156]  __schedule+0x45f/0x710
> [41204.866162]  ? xas_store+0x391/0x5f0
> [41204.866164]  schedule+0x3e/0xa0
> [41204.866166]  io_schedule+0x16/0x40
> [41204.866180]  __lock_page+0x12a/0x1d0
> [41204.866182]  ? file_fdatawait_range+0x30/0x30
> [41204.866187]  truncate_inode_pages_range+0x52c/0x980
> [41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
> [41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
> [41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
> [41204.866209]  truncate_inode_pages_final+0x4c/0x60
> [41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
> [41204.866219]  evict+0xca/0x1a0
> [41204.866221]  iput+0x1ba/0x210
> [41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
> [41204.866232]  process_one_work+0x167/0x400
> [41204.866233]  worker_thread+0x4d/0x460
> [41204.866236]  kthread+0x105/0x140
> [41204.866237]  ? rescuer_thread+0x370/0x370
> [41204.866239]  ? kthread_destroy_worker+0x50/0x50
> [41204.866240]  ret_from_fork+0x35/0x40
>
> On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >
> > On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > >
> > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > and transport data to the tape-library from CephFS - below 2 threads is
> > > essentially doing something equivalent to
> > >
> > > find /cephfs/ -type f | xargs cat | nc server
> > >
> > > 2 threads only, load exploding and the "net read vs net write" has
> > > more than 100x difference.
> > >
> > > Can anyone explain this as "normal" behaviour?
> > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > >
> > > jk@wombat:~$ w
> > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$ dstat -ar
> > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > 0     0 ^C
> > > jk@wombat:~$ w
> > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$
> > >
> > > Thanks.
> >
> > how small these files are?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  9:18     ` Jesper Krogh
@ 2020-04-06 10:23       ` Jesper Krogh
  2020-04-06 11:58         ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-06 10:23 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

It also emits this one - quite a bit later (not sure thats relevant)

[43364.349446] kworker/3:6: page allocation failure: order:0,
mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[43364.349459] CPU: 3 PID: 23433 Comm: kworker/3:6 Not tainted 5.4.30 #5
[43364.349460] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[43364.349515] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[43364.349516] Call Trace:
[43364.349523]  <IRQ>
[43364.349594]  dump_stack+0x6d/0x95
[43364.349614]  warn_alloc+0x10c/0x170
[43364.349616]  __alloc_pages_slowpath+0xe6c/0xef0
[43364.349634]  ? ip_local_deliver_finish+0x48/0x50
[43364.349635]  ? ip_local_deliver+0x6f/0xe0
[43364.349640]  ? tcp_v4_early_demux+0x11c/0x150
[43364.349641]  __alloc_pages_nodemask+0x2f3/0x360
[43364.349643]  page_frag_alloc+0x129/0x150
[43364.349654]  __napi_alloc_skb+0x86/0xd0
[43364.349670]  page_to_skb+0x67/0x350 [virtio_net]
[43364.349672]  receive_buf+0xe47/0x16c0 [virtio_net]
[43364.349675]  virtnet_poll+0xf2/0x364 [virtio_net]
[43364.349679]  net_rx_action+0x265/0x3e0
[43364.349691]  __do_softirq+0xf9/0x2aa
[43364.349701]  irq_exit+0xae/0xb0
[43364.349705]  do_IRQ+0x59/0xe0
[43364.349706]  common_interrupt+0xf/0xf
[43364.349707]  </IRQ>
[43364.349713] RIP: 0010:kvm_clock_get_cycles+0xc/0x20
[43364.349715] Code: c3 48 c1 e1 06 31 c0 48 81 c1 00 10 60 84 49 89
0c 10 eb be 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 65 48 8b 3d e4
8f 7a 7d <e8> ff 0f 00 00 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00
55 48
[43364.349716] RSP: 0018:ffffc20941f4bab8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffdb
[43364.349717] RAX: ffffffff8286e040 RBX: 0000277062955ccd RCX: 000000000000056a
[43364.349735] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffffffff846010c0
[43364.349750] RBP: ffffc20941f4bab8 R08: 0000000000008f3e R09: 000000000000097a
[43364.349751] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000000
[43364.349751] R13: 00000000004a63a2 R14: ffff9c1ac83665b4 R15: 0000000000000000
[43364.349753]  ? kvmclock_setup_percpu+0x80/0x80
[43364.349760]  ktime_get+0x3e/0xa0
[43364.349763]  tcp_mstamp_refresh+0x12/0x40
[43364.349764]  tcp_rcv_space_adjust+0x22/0x1d0
[43364.349766]  tcp_recvmsg+0x28b/0xbc0
[43364.349777]  ? aa_sk_perm+0x43/0x190
[43364.349781]  inet_recvmsg+0x64/0xf0
[43364.349786]  sock_recvmsg+0x66/0x70
[43364.349791]  ceph_tcp_recvpage+0x79/0xb0 [libceph]
[43364.349796]  read_partial_message+0x3c3/0x7c0 [libceph]
[43364.349801]  ceph_con_workfn+0xa6a/0x23d0 [libceph]
[43364.349809]  process_one_work+0x167/0x400
[43364.349810]  worker_thread+0x4d/0x460
[43364.349814]  kthread+0x105/0x140
[43364.349815]  ? rescuer_thread+0x370/0x370
[43364.349816]  ? kthread_destroy_worker+0x50/0x50
[43364.349817]  ret_from_fork+0x35/0x40
[43364.349845] Mem-Info:
[43364.349849] active_anon:10675 inactive_anon:12484 isolated_anon:0
                active_file:14532 inactive_file:3900471 isolated_file:8
                unevictable:0 dirty:1 writeback:0 unstable:0
                slab_reclaimable:68339 slab_unreclaimable:36220
                mapped:11089 shmem:608 pagetables:1093 bounce:0
                free:25551 free_pcp:3443 free_cma:0
[43364.349851] Node 0 active_anon:42700kB inactive_anon:49936kB
active_file:58128kB inactive_file:15601884kB unevictable:0kB
isolated(anon):0kB isolated(file):32kB mapped:44356kB dirty:4kB
writeback:0kB shmem:2432kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[43364.349852] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15992kB managed:15908kB
mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB
local_pcp:0kB free_cma:0kB
[43364.349857] lowmem_reserve[]: 0 3444 15930 15930 15930
[43364.349859] Node 0 DMA32 free:55240kB min:14596kB low:18244kB
high:21892kB active_anon:1556kB inactive_anon:188kB active_file:48kB
inactive_file:3430152kB unevictable:0kB writepending:0kB
present:3653608kB managed:3588072kB mlocked:0kB kernel_stack:0kB
pagetables:0kB bounce:0kB free_pcp:3568kB local_pcp:248kB free_cma:0kB
[43364.349861] lowmem_reserve[]: 0 0 12485 12485 12485
[43364.349862] Node 0 Normal free:31056kB min:83636kB low:96864kB
high:110092kB active_anon:41144kB inactive_anon:49748kB
active_file:58080kB inactive_file:12171732kB unevictable:0kB
writepending:4kB present:13107200kB managed:12793540kB mlocked:0kB
kernel_stack:7472kB pagetables:4372kB bounce:0kB free_pcp:10204kB
local_pcp:1456kB free_cma:0kB
[43364.349864] lowmem_reserve[]: 0 0 0 0 0
[43364.349865] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB
(U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB
(M) = 15908kB
[43364.349877] Node 0 DMA32: 22*4kB (EH) 18*8kB (UMEH) 0*16kB 13*32kB
(UME) 17*64kB (UM) 18*128kB (UMEH) 14*256kB (UEH) 5*512kB (UE)
24*1024kB (UM) 10*2048kB (M) 0*4096kB = 55240kB
[43364.349894] Node 0 Normal: 94*4kB (UMEH) 47*8kB (MEH) 60*16kB
(UMEH) 38*32kB (UMEH) 30*64kB (MEH) 21*128kB (MEH) 37*256kB (UM)
28*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 31344kB
[43364.349920] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[43364.349927] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[43364.349928] 3915698 total pagecache pages
[43364.349930] 79 pages in swap cache
[43364.349931] Swap cache stats: add 393, delete 314, find 38/80
[43364.349932] Free swap  = 4036080kB
[43364.349932] Total swap = 4038652kB
[43364.349933] 4194200 pages RAM
[43364.349933] 0 pages HighMem/MovableOnly
[43364.349934] 94820 pages reserved
[43364.349934] 0 pages cma reserved
[43364.349934] 0 pages hwpoisoned

On Mon, Apr 6, 2020 at 11:18 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> And it looks like I can reproduce this pattern - and when it has been
> stuck for "sufficient" amount of time - then it gets blacklisted by
> the MDS/OSD and (a related) issue is that I cannot get the mountpoint
> back without a reboot on the kernel-client side.
>
> On Mon, Apr 6, 2020 at 11:09 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> >
> > home-directory style - median is about 3KB - but varies greatly.
> >
> > I also get this:
> > [41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
> > 120 seconds.
> > [41204.865955]       Not tainted 5.4.30 #5
> > [41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
> > [41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
> > [41204.866120] Call Trace:
> > [41204.866156]  __schedule+0x45f/0x710
> > [41204.866162]  ? xas_store+0x391/0x5f0
> > [41204.866164]  schedule+0x3e/0xa0
> > [41204.866166]  io_schedule+0x16/0x40
> > [41204.866180]  __lock_page+0x12a/0x1d0
> > [41204.866182]  ? file_fdatawait_range+0x30/0x30
> > [41204.866187]  truncate_inode_pages_range+0x52c/0x980
> > [41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
> > [41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
> > [41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
> > [41204.866209]  truncate_inode_pages_final+0x4c/0x60
> > [41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
> > [41204.866219]  evict+0xca/0x1a0
> > [41204.866221]  iput+0x1ba/0x210
> > [41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
> > [41204.866232]  process_one_work+0x167/0x400
> > [41204.866233]  worker_thread+0x4d/0x460
> > [41204.866236]  kthread+0x105/0x140
> > [41204.866237]  ? rescuer_thread+0x370/0x370
> > [41204.866239]  ? kthread_destroy_worker+0x50/0x50
> > [41204.866240]  ret_from_fork+0x35/0x40
> >
> > On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > >
> > > On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > > >
> > > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > > and transport data to the tape-library from CephFS - below 2 threads is
> > > > essentially doing something equivalent to
> > > >
> > > > find /cephfs/ -type f | xargs cat | nc server
> > > >
> > > > 2 threads only, load exploding and the "net read vs net write" has
> > > > more than 100x difference.
> > > >
> > > > Can anyone explain this as "normal" behaviour?
> > > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > >
> > > > jk@wombat:~$ w
> > > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > jk@wombat:~$ dstat -ar
> > > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > > 0     0 ^C
> > > > jk@wombat:~$ w
> > > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > jk@wombat:~$
> > > >
> > > > Thanks.
> > >
> > > how small these files are?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  8:04 5.4.20 - high load - lots of incoming data - small data read Jesper Krogh
  2020-04-06  8:53 ` Yan, Zheng
@ 2020-04-06 11:45 ` Jeff Layton
  2020-04-06 13:17   ` Jesper Krogh
  2020-04-09 17:50 ` Jesper Krogh
  2 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2020-04-06 11:45 UTC (permalink / raw)
  To: Jesper Krogh, ceph-devel

On Mon, 2020-04-06 at 10:04 +0200, Jesper Krogh wrote:
> This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> and transport data to the tape-library from CephFS - below 2 threads is
> essentially doing something equivalent to
> 
> find /cephfs/ -type f | xargs cat | nc server
> 
> 2 threads only, load exploding and the "net read vs net write" has
> more than 100x difference.
> 

Makes sense. You're basically just reading in all of the data on this
cephfs, so the receive is going to be much larger than the send.

> Can anyone explain this as "normal" behaviour?
> Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> 
> jk@wombat:~$ w
>  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$ dstat -ar
> --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
>   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
>   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
>   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
>   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
>   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
>   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
>   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
>   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
>   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
>   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
>   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
>   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
>   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
>   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
>   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
>   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
>   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
>   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
>   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> 0     0 ^C

Load average is high, but it looks like it's all just waiting on I/O. 

> jk@wombat:~$ w
>  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$
> 
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  9:09   ` Jesper Krogh
  2020-04-06  9:18     ` Jesper Krogh
@ 2020-04-06 11:49     ` Jeff Layton
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-04-06 11:49 UTC (permalink / raw)
  To: Jesper Krogh, Yan, Zheng; +Cc: ceph-devel

On Mon, 2020-04-06 at 11:09 +0200, Jesper Krogh wrote:
> home-directory style - median is about 3KB - but varies greatly.
> 
> I also get this:
> [41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
> 120 seconds.
> [41204.865955]       Not tainted 5.4.30 #5
> [41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
> [41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
> [41204.866120] Call Trace:
> [41204.866156]  __schedule+0x45f/0x710
> [41204.866162]  ? xas_store+0x391/0x5f0
> [41204.866164]  schedule+0x3e/0xa0
> [41204.866166]  io_schedule+0x16/0x40
> [41204.866180]  __lock_page+0x12a/0x1d0
> [41204.866182]  ? file_fdatawait_range+0x30/0x30
> [41204.866187]  truncate_inode_pages_range+0x52c/0x980
> [41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
> [41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
> [41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
> [41204.866209]  truncate_inode_pages_final+0x4c/0x60
> [41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
> [41204.866219]  evict+0xca/0x1a0
> [41204.866221]  iput+0x1ba/0x210
> [41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
> [41204.866232]  process_one_work+0x167/0x400
> [41204.866233]  worker_thread+0x4d/0x460
> [41204.866236]  kthread+0x105/0x140
> [41204.866237]  ? rescuer_thread+0x370/0x370
> [41204.866239]  ? kthread_destroy_worker+0x50/0x50
> [41204.866240]  ret_from_fork+0x35/0x40
> 

This task is attempting to evict an inode from the cache, but one of its
pagecache pages is locked (probably due to some other active user of
it). Note that just above the __lock_page call is the call to
io_schedule. That means that the task will be put to sleep while waiting
for the page lock, and that sleep will be counted as I/O wait (which
contributes to the load average). 

> On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > and transport data to the tape-library from CephFS - below 2 threads is
> > > essentially doing something equivalent to
> > > 
> > > find /cephfs/ -type f | xargs cat | nc server
> > > 
> > > 2 threads only, load exploding and the "net read vs net write" has
> > > more than 100x difference.
> > > 
> > > Can anyone explain this as "normal" behaviour?
> > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > 
> > > jk@wombat:~$ w
> > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$ dstat -ar
> > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > 0     0 ^C
> > > jk@wombat:~$ w
> > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$
> > > 
> > > Thanks.
> > 
> > how small these files are?

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06 10:23       ` Jesper Krogh
@ 2020-04-06 11:58         ` Jeff Layton
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-04-06 11:58 UTC (permalink / raw)
  To: Jesper Krogh, Yan, Zheng; +Cc: ceph-devel

On Mon, 2020-04-06 at 12:23 +0200, Jesper Krogh wrote:
> It also emits this one - quite a bit later (not sure thats relevant)
> 
> [43364.349446] kworker/3:6: page allocation failure: order:0,
> mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0

It failed to allocate a single page (order:0) to handle a receive. This
is a GFP_ATOMIC allocation which means that it's not allowed to sleep
(and thus can't do any active reclaim).

> [43364.349459] CPU: 3 PID: 23433 Comm: kworker/3:6 Not tainted 5.4.30 #5
> [43364.349460] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [43364.349515] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> [43364.349516] Call Trace:
> [43364.349523]  <IRQ>
> [43364.349594]  dump_stack+0x6d/0x95
> [43364.349614]  warn_alloc+0x10c/0x170
> [43364.349616]  __alloc_pages_slowpath+0xe6c/0xef0
> [43364.349634]  ? ip_local_deliver_finish+0x48/0x50
> [43364.349635]  ? ip_local_deliver+0x6f/0xe0
> [43364.349640]  ? tcp_v4_early_demux+0x11c/0x150
> [43364.349641]  __alloc_pages_nodemask+0x2f3/0x360
> [43364.349643]  page_frag_alloc+0x129/0x150
> [43364.349654]  __napi_alloc_skb+0x86/0xd0
> [43364.349670]  page_to_skb+0x67/0x350 [virtio_net]
> [43364.349672]  receive_buf+0xe47/0x16c0 [virtio_net]
> [43364.349675]  virtnet_poll+0xf2/0x364 [virtio_net]
> [43364.349679]  net_rx_action+0x265/0x3e0
> [43364.349691]  __do_softirq+0xf9/0x2aa
> [43364.349701]  irq_exit+0xae/0xb0
> [43364.349705]  do_IRQ+0x59/0xe0
> [43364.349706]  common_interrupt+0xf/0xf
> [43364.349707]  </IRQ>
> [43364.349713] RIP: 0010:kvm_clock_get_cycles+0xc/0x20
> [43364.349715] Code: c3 48 c1 e1 06 31 c0 48 81 c1 00 10 60 84 49 89
> 0c 10 eb be 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 65 48 8b 3d e4
> 8f 7a 7d <e8> ff 0f 00 00 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00
> 55 48
> [43364.349716] RSP: 0018:ffffc20941f4bab8 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffdb
> [43364.349717] RAX: ffffffff8286e040 RBX: 0000277062955ccd RCX: 000000000000056a
> [43364.349735] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffffffff846010c0
> [43364.349750] RBP: ffffc20941f4bab8 R08: 0000000000008f3e R09: 000000000000097a
> [43364.349751] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000000
> [43364.349751] R13: 00000000004a63a2 R14: ffff9c1ac83665b4 R15: 0000000000000000
> [43364.349753]  ? kvmclock_setup_percpu+0x80/0x80
> [43364.349760]  ktime_get+0x3e/0xa0
> [43364.349763]  tcp_mstamp_refresh+0x12/0x40
> [43364.349764]  tcp_rcv_space_adjust+0x22/0x1d0
> [43364.349766]  tcp_recvmsg+0x28b/0xbc0
> [43364.349777]  ? aa_sk_perm+0x43/0x190
> [43364.349781]  inet_recvmsg+0x64/0xf0
> [43364.349786]  sock_recvmsg+0x66/0x70
> [43364.349791]  ceph_tcp_recvpage+0x79/0xb0 [libceph]
> [43364.349796]  read_partial_message+0x3c3/0x7c0 [libceph]
> [43364.349801]  ceph_con_workfn+0xa6a/0x23d0 [libceph]
> [43364.349809]  process_one_work+0x167/0x400
> [43364.349810]  worker_thread+0x4d/0x460
> [43364.349814]  kthread+0x105/0x140
> [43364.349815]  ? rescuer_thread+0x370/0x370
> [43364.349816]  ? kthread_destroy_worker+0x50/0x50
> [43364.349817]  ret_from_fork+0x35/0x40
> [43364.349845] Mem-Info:
> [43364.349849] active_anon:10675 inactive_anon:12484 isolated_anon:0
>                 active_file:14532 inactive_file:3900471 isolated_file:8
>                 unevictable:0 dirty:1 writeback:0 unstable:0
>                 slab_reclaimable:68339 slab_unreclaimable:36220
>                 mapped:11089 shmem:608 pagetables:1093 bounce:0
>                 free:25551 free_pcp:3443 free_cma:0
> [43364.349851] Node 0 active_anon:42700kB inactive_anon:49936kB
> active_file:58128kB inactive_file:15601884kB unevictable:0kB
> isolated(anon):0kB isolated(file):32kB mapped:44356kB dirty:4kB
> writeback:0kB shmem:2432kB shmem_thp: 0kB shmem_pmdmapped: 0kB
> anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [43364.349852] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB writepending:0kB present:15992kB managed:15908kB
> mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB
> [43364.349857] lowmem_reserve[]: 0 3444 15930 15930 15930
> [43364.349859] Node 0 DMA32 free:55240kB min:14596kB low:18244kB
> high:21892kB active_anon:1556kB inactive_anon:188kB active_file:48kB
> inactive_file:3430152kB unevictable:0kB writepending:0kB
> present:3653608kB managed:3588072kB mlocked:0kB kernel_stack:0kB
> pagetables:0kB bounce:0kB free_pcp:3568kB local_pcp:248kB free_cma:0kB
> [43364.349861] lowmem_reserve[]: 0 0 12485 12485 12485
> [43364.349862] Node 0 Normal free:31056kB min:83636kB low:96864kB
> high:110092kB active_anon:41144kB inactive_anon:49748kB
> active_file:58080kB inactive_file:12171732kB unevictable:0kB
> writepending:4kB present:13107200kB managed:12793540kB mlocked:0kB
> kernel_stack:7472kB pagetables:4372kB bounce:0kB free_pcp:10204kB
> local_pcp:1456kB free_cma:0kB
> [43364.349864] lowmem_reserve[]: 0 0 0 0 0
> [43364.349865] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB
> (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB
> (M) = 15908kB
> [43364.349877] Node 0 DMA32: 22*4kB (EH) 18*8kB (UMEH) 0*16kB 13*32kB
> (UME) 17*64kB (UM) 18*128kB (UMEH) 14*256kB (UEH) 5*512kB (UE)
> 24*1024kB (UM) 10*2048kB (M) 0*4096kB = 55240kB
> [43364.349894] Node 0 Normal: 94*4kB (UMEH) 47*8kB (MEH) 60*16kB
> (UMEH) 38*32kB (UMEH) 30*64kB (MEH) 21*128kB (MEH) 37*256kB (UM)
> 28*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 31344kB

Here though, it looks like there is quite a bit of free memory, so I'm
not sure why the allocation failed. It's possible that at the time that
the allocation failed there was nothing free, but just afterward quite a
bit of memory did become free (before the kernel could print out the
summary).

It may also be indicative of something else being wrong. If this
happened much later, then it may just be fallout from whatever the
original problem was.

> [43364.349920] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [43364.349927] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> [43364.349928] 3915698 total pagecache pages
> [43364.349930] 79 pages in swap cache
> [43364.349931] Swap cache stats: add 393, delete 314, find 38/80
> [43364.349932] Free swap  = 4036080kB
> [43364.349932] Total swap = 4038652kB
> [43364.349933] 4194200 pages RAM
> [43364.349933] 0 pages HighMem/MovableOnly
> [43364.349934] 94820 pages reserved
> [43364.349934] 0 pages cma reserved
> [43364.349934] 0 pages hwpoisoned
> 
> On Mon, Apr 6, 2020 at 11:18 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > And it looks like I can reproduce this pattern - and when it has been
> > stuck for "sufficient" amount of time - then it gets blacklisted by
> > the MDS/OSD and (a related) issue is that I cannot get the mountpoint
> > back without a reboot on the kernel-client side.
> > 
> > On Mon, Apr 6, 2020 at 11:09 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > > home-directory style - median is about 3KB - but varies greatly.
> > > 
> > > I also get this:
> > > [41204.865818] INFO: task kworker/u16:102:21903 blocked for more than
> > > 120 seconds.
> > > [41204.865955]       Not tainted 5.4.30 #5
> > > [41204.866006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [41204.866056] kworker/u16:102 D    0 21903      2 0x80004000
> > > [41204.866119] Workqueue: ceph-inode ceph_inode_work [ceph]
> > > [41204.866120] Call Trace:
> > > [41204.866156]  __schedule+0x45f/0x710
> > > [41204.866162]  ? xas_store+0x391/0x5f0
> > > [41204.866164]  schedule+0x3e/0xa0
> > > [41204.866166]  io_schedule+0x16/0x40
> > > [41204.866180]  __lock_page+0x12a/0x1d0
> > > [41204.866182]  ? file_fdatawait_range+0x30/0x30
> > > [41204.866187]  truncate_inode_pages_range+0x52c/0x980
> > > [41204.866191]  ? syscall_return_via_sysret+0x12/0x7f
> > > [41204.866197]  ? drop_inode_snap_realm+0x98/0xa0 [ceph]
> > > [41204.866207]  ? fsnotify_grab_connector+0x4d/0x90
> > > [41204.866209]  truncate_inode_pages_final+0x4c/0x60
> > > [41204.866214]  ceph_evict_inode+0x2d/0x210 [ceph]
> > > [41204.866219]  evict+0xca/0x1a0
> > > [41204.866221]  iput+0x1ba/0x210
> > > [41204.866225]  ceph_inode_work+0x40/0x270 [ceph]
> > > [41204.866232]  process_one_work+0x167/0x400
> > > [41204.866233]  worker_thread+0x4d/0x460
> > > [41204.866236]  kthread+0x105/0x140
> > > [41204.866237]  ? rescuer_thread+0x370/0x370
> > > [41204.866239]  ? kthread_destroy_worker+0x50/0x50
> > > [41204.866240]  ret_from_fork+0x35/0x40
> > > 
> > > On Mon, Apr 6, 2020 at 10:53 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > > > On Mon, Apr 6, 2020 at 4:06 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > > > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > > > and transport data to the tape-library from CephFS - below 2 threads is
> > > > > essentially doing something equivalent to
> > > > > 
> > > > > find /cephfs/ -type f | xargs cat | nc server
> > > > > 
> > > > > 2 threads only, load exploding and the "net read vs net write" has
> > > > > more than 100x difference.
> > > > > 
> > > > > Can anyone explain this as "normal" behaviour?
> > > > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > > > 
> > > > > jk@wombat:~$ w
> > > > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$ dstat -ar
> > > > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > > > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > > > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > > > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > > > 0     0 ^C
> > > > > jk@wombat:~$ w
> > > > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$
> > > > > 
> > > > > Thanks.
> > > > 
> > > > how small these files are?

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06 11:45 ` Jeff Layton
@ 2020-04-06 13:17   ` Jesper Krogh
  2020-04-06 19:45     ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-06 13:17 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel

Hi Jeff.

No, because the client "bacula-fd" is reading from the local
filesystem - here CephFS and sending it over the network to the server
with the tape-libraries attached to it.  Thus "ideal" receive == send
- which is also the pattern I see when using larger files (multiple
MB).

Is the per-file overhead many KB?

Jesper

On Mon, Apr 6, 2020 at 1:45 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2020-04-06 at 10:04 +0200, Jesper Krogh wrote:
> > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > and transport data to the tape-library from CephFS - below 2 threads is
> > essentially doing something equivalent to
> >
> > find /cephfs/ -type f | xargs cat | nc server
> >
> > 2 threads only, load exploding and the "net read vs net write" has
> > more than 100x difference.
> >
>
> Makes sense. You're basically just reading in all of the data on this
> cephfs, so the receive is going to be much larger than the send.
>
> > Can anyone explain this as "normal" behaviour?
> > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> >
> > jk@wombat:~$ w
> >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$ dstat -ar
> > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > 0     0 ^C
>
> Load average is high, but it looks like it's all just waiting on I/O.
>
> > jk@wombat:~$ w
> >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$
> >
> --
> Jeff Layton <jlayton@kernel.org>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06 13:17   ` Jesper Krogh
@ 2020-04-06 19:45     ` Jeff Layton
       [not found]       ` <CAED-sicefQuJ3GGWh2bBXz6R=BFr1EQ=h6Sz-W_mh3u+-tO0wA@mail.gmail.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2020-04-06 19:45 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: ceph-devel

On Mon, 2020-04-06 at 15:17 +0200, Jesper Krogh wrote:
> Hi Jeff.
> 
> No, because the client "bacula-fd" is reading from the local
> filesystem - here CephFS and sending it over the network to the server
> with the tape-libraries attached to it.  Thus "ideal" receive == send
> - which is also the pattern I see when using larger files (multiple
> MB).
> 
> Is the per-file overhead many KB?
> 

Maybe not "many" but "several".

CephFS is quite chatty. There can also be quite a bit of back and forth
between the client and MDS. The protocol has a lot of extraneous fields
for any given message. Writing also means cap flushes (particularly on
size changes), and those can add up.

Whether that accounts for what you're seeing though, I'm not sure.

> 
> On Mon, Apr 6, 2020 at 1:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Mon, 2020-04-06 at 10:04 +0200, Jesper Krogh wrote:
> > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > and transport data to the tape-library from CephFS - below 2 threads is
> > > essentially doing something equivalent to
> > > 
> > > find /cephfs/ -type f | xargs cat | nc server
> > > 
> > > 2 threads only, load exploding and the "net read vs net write" has
> > > more than 100x difference.
> > > 
> > 
> > Makes sense. You're basically just reading in all of the data on this
> > cephfs, so the receive is going to be much larger than the send.
> > 
> > > Can anyone explain this as "normal" behaviour?
> > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > 
> > > jk@wombat:~$ w
> > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$ dstat -ar
> > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > 0     0 ^C
> > 
> > Load average is high, but it looks like it's all just waiting on I/O.
> > 
> > > jk@wombat:~$ w
> > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > jk@wombat:~$
> > > 
> > --
> > Jeff Layton <jlayton@kernel.org>
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
       [not found]       ` <CAED-sicefQuJ3GGWh2bBXz6R=BFr1EQ=h6Sz-W_mh3u+-tO0wA@mail.gmail.com>
@ 2020-04-09 14:06         ` Jeff Layton
  2020-04-09 16:00           ` Jesper Krogh
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2020-04-09 14:06 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: ceph-devel

On Thu, 2020-04-09 at 12:23 +0200, Jesper Krogh wrote:
> And now it jammed up again - stuck mountpoint - reboot needed to move
> un - load 80 - zero activity.
> Last stacktrace here (full kern.log attached). As mentioned - this
> system "only" reads from CephFS to ship data over the network to a
> tape-library system on another server.  - Suggetions are really -
> really welcome.
> 
> jk@wombat:~$ ls -l /ceph/cluster
> ^C
> jk@wombat:~$ sudo umount -f -l ' /ceph/cluster
> > ^C
> jk@wombat:~$ sudo umount -f -l /ceph/cluster
> ^C^C^C
> 
> - hard reboot needed.
> 
> Apr  9 05:21:26 wombat kernel: [291368.861032] warn_alloc: 122
> callbacks suppressed
> Apr  9 05:21:26 wombat kernel: [291368.861035] kworker/3:10: page
> allocation failure: order:0, mode:0xa20(GFP_ATOMIC),
> nodemask=(null),cpuset=/,mems_allowed=0
> Apr  9 05:21:26 wombat kernel: [291368.861041] CPU: 3 PID: 22346 Comm:
> kworker/3:10 Not tainted 5.4.30 #5
> Apr  9 05:21:26 wombat kernel: [291368.861042] Hardware name: Bochs
> Bochs, BIOS Bochs 01/01/2011
> Apr  9 05:21:26 wombat kernel: [291368.861058] Workqueue: ceph-msgr
> ceph_con_workfn [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861059] Call Trace:
> Apr  9 05:21:26 wombat kernel: [291368.861061]  <IRQ>
> Apr  9 05:21:26 wombat kernel: [291368.861066]  dump_stack+0x6d/0x95
> Apr  9 05:21:26 wombat kernel: [291368.861069]  warn_alloc+0x10c/0x170
> Apr  9 05:21:26 wombat kernel: [291368.861070]
> __alloc_pages_slowpath+0xe6c/0xef0
> Apr  9 05:21:26 wombat kernel: [291368.861073]  ? tcp4_gro_receive+0x114/0x1b0
> Apr  9 05:21:26 wombat kernel: [291368.861074]  ? inet_gro_receive+0x25d/0x2c0
> Apr  9 05:21:26 wombat kernel: [291368.861076]  ? dev_gro_receive+0x69d/0x6f0
> Apr  9 05:21:26 wombat kernel: [291368.861077]
> __alloc_pages_nodemask+0x2f3/0x360
> Apr  9 05:21:26 wombat kernel: [291368.861080]  alloc_pages_current+0x6a/0xe0
> Apr  9 05:21:26 wombat kernel: [291368.861082]  skb_page_frag_refill+0xda/0x100
> Apr  9 05:21:26 wombat kernel: [291368.861085]
> try_fill_recv+0x285/0x6f0 [virtio_net]

This looks like the same problem you reported before.

The network interface driver (virtio_net) took an interrupt and the
handler called skb_page_frag_refill to allocate a page (order:0) to
handle a receive. This is a GFP_ATOMIC allocation (because this is in
softirq context) which is generally means that the task can't do
activity that sleeps in order to satisfy it (no reclaim allowed). That
allocation failed for reasons that aren't clear.

I think this probably has very little to do with ceph. It's just that it
happened to occur while it was attempting to do a receive on behalf of
the ceph socket. I'd probably start with the virtio_net maintainers (see
the MAINTAINERS file in the kernel sources), and see if they have
thoughts, and maybe cc the netdev mailing list.

> Apr  9 05:21:26 wombat kernel: [291368.861087]
> virtnet_poll+0x32d/0x364 [virtio_net]
> Apr  9 05:21:26 wombat kernel: [291368.861088]  net_rx_action+0x265/0x3e0
> Apr  9 05:21:26 wombat kernel: [291368.861091]  __do_softirq+0xf9/0x2aa
> Apr  9 05:21:26 wombat kernel: [291368.861095]  irq_exit+0xae/0xb0
> Apr  9 05:21:26 wombat kernel: [291368.861096]  do_IRQ+0x59/0xe0
> Apr  9 05:21:26 wombat kernel: [291368.861098]  common_interrupt+0xf/0xf
> Apr  9 05:21:26 wombat kernel: [291368.861098]  </IRQ>
> Apr  9 05:21:26 wombat kernel: [291368.861124] RIP:
> 0010:ceph_str_hash_rjenkins+0x199/0x270 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861125] Code: 60 0f b6 57 0a c1
> e2 18 01 d6 0f b6 57 09 c1 e2 10 01 d6 0f b6 57 08 c1 e2 08 01 d6 0f
> b6 57 07 c1 e2 18 41 01 d1 0f b6 57 06 <c1> e2 10 41 01 d1 0f b6 57 05
> c1 e2 08 41 01 d1 0f b6 57 04 41 01
> Apr  9 05:21:26 wombat kernel: [291368.861126] RSP:
> 0018:ffffc2094113f890 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdb
> Apr  9 05:21:26 wombat kernel: [291368.861127] RAX: 00000000cfce7272
> RBX: ffff9c171adcd060 RCX: 0000000000000008
> Apr  9 05:21:26 wombat kernel: [291368.861128] RDX: 0000000000000030
> RSI: 00000000c35b3cfb RDI: ffff9c171adcd0c4
> Apr  9 05:21:26 wombat kernel: [291368.861128] RBP: ffffc2094113f890
> R08: 0000000000000220 R09: 0000000031103db4
> Apr  9 05:21:26 wombat kernel: [291368.861129] R10: 0000000000303031
> R11: 000000000000002c R12: ffff9c171adcd0b0
> Apr  9 05:21:26 wombat kernel: [291368.861130] R13: ffffc2094113fa28
> R14: ffff9c171adcd0b0 R15: ffff9c171adcd0f0
> Apr  9 05:21:26 wombat kernel: [291368.861137]
> ceph_str_hash+0x20/0x70 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861143]
> __ceph_object_locator_to_pg+0x1bf/0x200 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861149]  ?
> crush_do_rule+0x412/0x460 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861155]  ?
> ceph_pg_to_up_acting_osds+0x547/0x8e0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861160]  ?
> ceph_pg_to_up_acting_osds+0x672/0x8e0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861166]
> calc_target+0x101/0x590 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861172]  ?
> free_pg_mapping+0x13/0x20 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861177]  ?
> alloc_pg_mapping+0x30/0x30 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861184]
> scan_requests.constprop.57+0x165/0x270 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861189]
> handle_one_map+0x198/0x200 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861194]
> ceph_osdc_handle_map+0x22f/0x710 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861199]  dispatch+0x3b1/0x9c0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861204]  ? dispatch+0x3b1/0x9c0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861209]
> ceph_con_workfn+0xae8/0x23d0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861213]  ?
> ceph_con_workfn+0xae8/0x23d0 [libceph]
> Apr  9 05:21:26 wombat kernel: [291368.861215]  process_one_work+0x167/0x400
> Apr  9 05:21:26 wombat kernel: [291368.861216]  worker_thread+0x4d/0x460
> Apr  9 05:21:26 wombat kernel: [291368.861219]  kthread+0x105/0x140
> Apr  9 05:21:26 wombat kernel: [291368.861220]  ? rescuer_thread+0x370/0x370
> Apr  9 05:21:26 wombat kernel: [291368.861221]  ?
> kthread_destroy_worker+0x50/0x50
> Apr  9 05:21:26 wombat kernel: [291368.861222]  ret_from_fork+0x35/0x40
> Apr  9 05:21:26 wombat kernel: [291368.861224] Mem-Info:
> Apr  9 05:21:26 wombat kernel: [291368.861227] active_anon:1001298
> inactive_anon:182746 isolated_anon:0
> Apr  9 05:21:26 wombat kernel: [291368.861227]  active_file:9826
> inactive_file:2765016 isolated_file:56
> Apr  9 05:21:26 wombat kernel: [291368.861227]  unevictable:0 dirty:0
> writeback:0 unstable:0
> Apr  9 05:21:26 wombat kernel: [291368.861227]  slab_reclaimable:38024
> slab_unreclaimable:39667
> Apr  9 05:21:26 wombat kernel: [291368.861227]  mapped:7274 shmem:561
> pagetables:3465 bounce:0
> Apr  9 05:21:26 wombat kernel: [291368.861227]  free:31398
> free_pcp:494 free_cma:0
> Apr  9 05:21:26 wombat kernel: [291368.861229] Node 0
> active_anon:4005192kB inactive_anon:730984kB active_file:39304kB
> inactive_file:11060064kB unevictable:0kB isolated(anon):0kB
> isolated(file):224kB mapped:29096kB dirty:0kB writeback:0kB
> shmem:2244kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB
> writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> Apr  9 05:21:26 wombat kernel: [291368.861230] Node 0 DMA free:15908kB
> min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
> present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> Apr  9 05:21:26 wombat kernel: [291368.861232] lowmem_reserve[]: 0
> 3444 15930 15930 15930
> Apr  9 05:21:26 wombat kernel: [291368.861234] Node 0 DMA32
> free:60576kB min:28932kB low:32580kB high:36228kB active_anon:287052kB
> inactive_anon:287460kB active_file:2888kB inactive_file:2859520kB
> unevictable:0kB writepending:0kB present:3653608kB managed:3588072kB
> mlocked:0kB kernel_stack:96kB pagetables:1196kB bounce:0kB
> free_pcp:184kB local_pcp:184kB free_cma:0kB
> Apr  9 05:21:26 wombat kernel: [291368.861236] lowmem_reserve[]: 0 0
> 12485 12485 12485
> Apr  9 05:21:26 wombat kernel: [291368.861237] Node 0 Normal
> free:49108kB min:132788kB low:146016kB high:159244kB
> active_anon:3718140kB inactive_anon:443524kB active_file:36416kB
> inactive_file:8200544kB unevictable:0kB writepending:0kB
> present:13107200kB managed:12793540kB mlocked:0kB kernel_stack:7936kB
> pagetables:12664kB bounce:0kB free_pcp:1792kB local_pcp:1384kB
> free_cma:0kB
> Apr  9 05:21:26 wombat kernel: [291368.861239] lowmem_reserve[]: 0 0 0 0 0
> Apr  9 05:21:26 wombat kernel: [291368.861240] Node 0 DMA: 1*4kB (U)
> 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
> Apr  9 05:21:26 wombat kernel: [291368.861245] Node 0 DMA32: 734*4kB
> (UMEH) 699*8kB (UMEH) 373*16kB (UMEH) 46*32kB (ME) 91*64kB (UM)
> 67*128kB (UME) 54*256kB (UME) 22*512kB (U) 5*1024kB (U) 0*2048kB
> 0*4096kB = 60576kB
> Apr  9 05:21:26 wombat kernel: [291368.861250] Node 0 Normal: 5133*4kB
> (UMEH) 414*8kB (UMEH) 1005*16kB (UMH) 145*32kB (UME) 68*64kB (UME)
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 48916kB
> Apr  9 05:21:26 wombat kernel: [291368.861259] Node 0
> hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=1048576kB
> Apr  9 05:21:26 wombat kernel: [291368.861260] Node 0
> hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
> Apr  9 05:21:26 wombat kernel: [291368.861260] 2775949 total pagecache pages
> Apr  9 05:21:26 wombat kernel: [291368.861263] 477 pages in swap cache
> Apr  9 05:21:26 wombat kernel: [291368.861263] Swap cache stats: add
> 24073, delete 23596, find 88552/91776
> Apr  9 05:21:26 wombat kernel: [291368.861264] Free swap  = 3963900kB
> Apr  9 05:21:26 wombat kernel: [291368.861264] Total swap = 4038652kB
> Apr  9 05:21:26 wombat kernel: [291368.861265] 4194200 pages RAM
> Apr  9 05:21:26 wombat kernel: [291368.861265] 0 pages HighMem/MovableOnly
> Apr  9 05:21:26 wombat kernel: [291368.861266] 94820 pages reserved
> Apr  9 05:21:26 wombat kernel: [291368.861266] 0 pages cma reserved
> Apr  9 05:21:26 wombat kernel: [291368.861267] 0 pages hwpoisoned
> 
> On Mon, Apr 6, 2020 at 9:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Mon, 2020-04-06 at 15:17 +0200, Jesper Krogh wrote:
> > > Hi Jeff.
> > > 
> > > No, because the client "bacula-fd" is reading from the local
> > > filesystem - here CephFS and sending it over the network to the server
> > > with the tape-libraries attached to it.  Thus "ideal" receive == send
> > > - which is also the pattern I see when using larger files (multiple
> > > MB).
> > > 
> > > Is the per-file overhead many KB?
> > > 
> > 
> > Maybe not "many" but "several".
> > 
> > CephFS is quite chatty. There can also be quite a bit of back and forth
> > between the client and MDS. The protocol has a lot of extraneous fields
> > for any given message. Writing also means cap flushes (particularly on
> > size changes), and those can add up.
> > 
> > Whether that accounts for what you're seeing though, I'm not sure.
> > 
> > > On Mon, Apr 6, 2020 at 1:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > On Mon, 2020-04-06 at 10:04 +0200, Jesper Krogh wrote:
> > > > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > > > and transport data to the tape-library from CephFS - below 2 threads is
> > > > > essentially doing something equivalent to
> > > > > 
> > > > > find /cephfs/ -type f | xargs cat | nc server
> > > > > 
> > > > > 2 threads only, load exploding and the "net read vs net write" has
> > > > > more than 100x difference.
> > > > > 
> > > > 
> > > > Makes sense. You're basically just reading in all of the data on this
> > > > cephfs, so the receive is going to be much larger than the send.
> > > > 
> > > > > Can anyone explain this as "normal" behaviour?
> > > > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > > > 
> > > > > jk@wombat:~$ w
> > > > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$ dstat -ar
> > > > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > > > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > > > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > > > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > > > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > > > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > > > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > > > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > > > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > > > 0     0 ^C
> > > > 
> > > > Load average is high, but it looks like it's all just waiting on I/O.
> > > > 
> > > > > jk@wombat:~$ w
> > > > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > jk@wombat:~$
> > > > > 
> > > > --
> > > > Jeff Layton <jlayton@kernel.org>
> > > > 
> > 
> > --
> > Jeff Layton <jlayton@kernel.org>
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-09 14:06         ` Jeff Layton
@ 2020-04-09 16:00           ` Jesper Krogh
  2020-04-09 16:30             ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-09 16:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel

Thanks Jeff - I'll try that.

I would just add to the case that this is a problem we have had on a
physical machine - but too many "other" workloads at the same time -
so we isolated it off to a VM - assuming that it was the mixed
workload situation that did cause us issues. I cannot be sure that it
is "excactly" the same problem we're seeing but symptoms are
identical.

Jesper

On Thu, Apr 9, 2020 at 4:06 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2020-04-09 at 12:23 +0200, Jesper Krogh wrote:
> > And now it jammed up again - stuck mountpoint - reboot needed to move
> > un - load 80 - zero activity.
> > Last stacktrace here (full kern.log attached). As mentioned - this
> > system "only" reads from CephFS to ship data over the network to a
> > tape-library system on another server.  - Suggetions are really -
> > really welcome.
> >
> > jk@wombat:~$ ls -l /ceph/cluster
> > ^C
> > jk@wombat:~$ sudo umount -f -l ' /ceph/cluster
> > > ^C
> > jk@wombat:~$ sudo umount -f -l /ceph/cluster
> > ^C^C^C
> >
> > - hard reboot needed.
> >
> > Apr  9 05:21:26 wombat kernel: [291368.861032] warn_alloc: 122
> > callbacks suppressed
> > Apr  9 05:21:26 wombat kernel: [291368.861035] kworker/3:10: page
> > allocation failure: order:0, mode:0xa20(GFP_ATOMIC),
> > nodemask=(null),cpuset=/,mems_allowed=0
> > Apr  9 05:21:26 wombat kernel: [291368.861041] CPU: 3 PID: 22346 Comm:
> > kworker/3:10 Not tainted 5.4.30 #5
> > Apr  9 05:21:26 wombat kernel: [291368.861042] Hardware name: Bochs
> > Bochs, BIOS Bochs 01/01/2011
> > Apr  9 05:21:26 wombat kernel: [291368.861058] Workqueue: ceph-msgr
> > ceph_con_workfn [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861059] Call Trace:
> > Apr  9 05:21:26 wombat kernel: [291368.861061]  <IRQ>
> > Apr  9 05:21:26 wombat kernel: [291368.861066]  dump_stack+0x6d/0x95
> > Apr  9 05:21:26 wombat kernel: [291368.861069]  warn_alloc+0x10c/0x170
> > Apr  9 05:21:26 wombat kernel: [291368.861070]
> > __alloc_pages_slowpath+0xe6c/0xef0
> > Apr  9 05:21:26 wombat kernel: [291368.861073]  ? tcp4_gro_receive+0x114/0x1b0
> > Apr  9 05:21:26 wombat kernel: [291368.861074]  ? inet_gro_receive+0x25d/0x2c0
> > Apr  9 05:21:26 wombat kernel: [291368.861076]  ? dev_gro_receive+0x69d/0x6f0
> > Apr  9 05:21:26 wombat kernel: [291368.861077]
> > __alloc_pages_nodemask+0x2f3/0x360
> > Apr  9 05:21:26 wombat kernel: [291368.861080]  alloc_pages_current+0x6a/0xe0
> > Apr  9 05:21:26 wombat kernel: [291368.861082]  skb_page_frag_refill+0xda/0x100
> > Apr  9 05:21:26 wombat kernel: [291368.861085]
> > try_fill_recv+0x285/0x6f0 [virtio_net]
>
> This looks like the same problem you reported before.
>
> The network interface driver (virtio_net) took an interrupt and the
> handler called skb_page_frag_refill to allocate a page (order:0) to
> handle a receive. This is a GFP_ATOMIC allocation (because this is in
> softirq context) which is generally means that the task can't do
> activity that sleeps in order to satisfy it (no reclaim allowed). That
> allocation failed for reasons that aren't clear.
>
> I think this probably has very little to do with ceph. It's just that it
> happened to occur while it was attempting to do a receive on behalf of
> the ceph socket. I'd probably start with the virtio_net maintainers (see
> the MAINTAINERS file in the kernel sources), and see if they have
> thoughts, and maybe cc the netdev mailing list.
>
> > Apr  9 05:21:26 wombat kernel: [291368.861087]
> > virtnet_poll+0x32d/0x364 [virtio_net]
> > Apr  9 05:21:26 wombat kernel: [291368.861088]  net_rx_action+0x265/0x3e0
> > Apr  9 05:21:26 wombat kernel: [291368.861091]  __do_softirq+0xf9/0x2aa
> > Apr  9 05:21:26 wombat kernel: [291368.861095]  irq_exit+0xae/0xb0
> > Apr  9 05:21:26 wombat kernel: [291368.861096]  do_IRQ+0x59/0xe0
> > Apr  9 05:21:26 wombat kernel: [291368.861098]  common_interrupt+0xf/0xf
> > Apr  9 05:21:26 wombat kernel: [291368.861098]  </IRQ>
> > Apr  9 05:21:26 wombat kernel: [291368.861124] RIP:
> > 0010:ceph_str_hash_rjenkins+0x199/0x270 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861125] Code: 60 0f b6 57 0a c1
> > e2 18 01 d6 0f b6 57 09 c1 e2 10 01 d6 0f b6 57 08 c1 e2 08 01 d6 0f
> > b6 57 07 c1 e2 18 41 01 d1 0f b6 57 06 <c1> e2 10 41 01 d1 0f b6 57 05
> > c1 e2 08 41 01 d1 0f b6 57 04 41 01
> > Apr  9 05:21:26 wombat kernel: [291368.861126] RSP:
> > 0018:ffffc2094113f890 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdb
> > Apr  9 05:21:26 wombat kernel: [291368.861127] RAX: 00000000cfce7272
> > RBX: ffff9c171adcd060 RCX: 0000000000000008
> > Apr  9 05:21:26 wombat kernel: [291368.861128] RDX: 0000000000000030
> > RSI: 00000000c35b3cfb RDI: ffff9c171adcd0c4
> > Apr  9 05:21:26 wombat kernel: [291368.861128] RBP: ffffc2094113f890
> > R08: 0000000000000220 R09: 0000000031103db4
> > Apr  9 05:21:26 wombat kernel: [291368.861129] R10: 0000000000303031
> > R11: 000000000000002c R12: ffff9c171adcd0b0
> > Apr  9 05:21:26 wombat kernel: [291368.861130] R13: ffffc2094113fa28
> > R14: ffff9c171adcd0b0 R15: ffff9c171adcd0f0
> > Apr  9 05:21:26 wombat kernel: [291368.861137]
> > ceph_str_hash+0x20/0x70 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861143]
> > __ceph_object_locator_to_pg+0x1bf/0x200 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861149]  ?
> > crush_do_rule+0x412/0x460 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861155]  ?
> > ceph_pg_to_up_acting_osds+0x547/0x8e0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861160]  ?
> > ceph_pg_to_up_acting_osds+0x672/0x8e0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861166]
> > calc_target+0x101/0x590 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861172]  ?
> > free_pg_mapping+0x13/0x20 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861177]  ?
> > alloc_pg_mapping+0x30/0x30 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861184]
> > scan_requests.constprop.57+0x165/0x270 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861189]
> > handle_one_map+0x198/0x200 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861194]
> > ceph_osdc_handle_map+0x22f/0x710 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861199]  dispatch+0x3b1/0x9c0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861204]  ? dispatch+0x3b1/0x9c0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861209]
> > ceph_con_workfn+0xae8/0x23d0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861213]  ?
> > ceph_con_workfn+0xae8/0x23d0 [libceph]
> > Apr  9 05:21:26 wombat kernel: [291368.861215]  process_one_work+0x167/0x400
> > Apr  9 05:21:26 wombat kernel: [291368.861216]  worker_thread+0x4d/0x460
> > Apr  9 05:21:26 wombat kernel: [291368.861219]  kthread+0x105/0x140
> > Apr  9 05:21:26 wombat kernel: [291368.861220]  ? rescuer_thread+0x370/0x370
> > Apr  9 05:21:26 wombat kernel: [291368.861221]  ?
> > kthread_destroy_worker+0x50/0x50
> > Apr  9 05:21:26 wombat kernel: [291368.861222]  ret_from_fork+0x35/0x40
> > Apr  9 05:21:26 wombat kernel: [291368.861224] Mem-Info:
> > Apr  9 05:21:26 wombat kernel: [291368.861227] active_anon:1001298
> > inactive_anon:182746 isolated_anon:0
> > Apr  9 05:21:26 wombat kernel: [291368.861227]  active_file:9826
> > inactive_file:2765016 isolated_file:56
> > Apr  9 05:21:26 wombat kernel: [291368.861227]  unevictable:0 dirty:0
> > writeback:0 unstable:0
> > Apr  9 05:21:26 wombat kernel: [291368.861227]  slab_reclaimable:38024
> > slab_unreclaimable:39667
> > Apr  9 05:21:26 wombat kernel: [291368.861227]  mapped:7274 shmem:561
> > pagetables:3465 bounce:0
> > Apr  9 05:21:26 wombat kernel: [291368.861227]  free:31398
> > free_pcp:494 free_cma:0
> > Apr  9 05:21:26 wombat kernel: [291368.861229] Node 0
> > active_anon:4005192kB inactive_anon:730984kB active_file:39304kB
> > inactive_file:11060064kB unevictable:0kB isolated(anon):0kB
> > isolated(file):224kB mapped:29096kB dirty:0kB writeback:0kB
> > shmem:2244kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB
> > writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > Apr  9 05:21:26 wombat kernel: [291368.861230] Node 0 DMA free:15908kB
> > min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB
> > active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
> > present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > Apr  9 05:21:26 wombat kernel: [291368.861232] lowmem_reserve[]: 0
> > 3444 15930 15930 15930
> > Apr  9 05:21:26 wombat kernel: [291368.861234] Node 0 DMA32
> > free:60576kB min:28932kB low:32580kB high:36228kB active_anon:287052kB
> > inactive_anon:287460kB active_file:2888kB inactive_file:2859520kB
> > unevictable:0kB writepending:0kB present:3653608kB managed:3588072kB
> > mlocked:0kB kernel_stack:96kB pagetables:1196kB bounce:0kB
> > free_pcp:184kB local_pcp:184kB free_cma:0kB
> > Apr  9 05:21:26 wombat kernel: [291368.861236] lowmem_reserve[]: 0 0
> > 12485 12485 12485
> > Apr  9 05:21:26 wombat kernel: [291368.861237] Node 0 Normal
> > free:49108kB min:132788kB low:146016kB high:159244kB
> > active_anon:3718140kB inactive_anon:443524kB active_file:36416kB
> > inactive_file:8200544kB unevictable:0kB writepending:0kB
> > present:13107200kB managed:12793540kB mlocked:0kB kernel_stack:7936kB
> > pagetables:12664kB bounce:0kB free_pcp:1792kB local_pcp:1384kB
> > free_cma:0kB
> > Apr  9 05:21:26 wombat kernel: [291368.861239] lowmem_reserve[]: 0 0 0 0 0
> > Apr  9 05:21:26 wombat kernel: [291368.861240] Node 0 DMA: 1*4kB (U)
> > 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> > 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
> > Apr  9 05:21:26 wombat kernel: [291368.861245] Node 0 DMA32: 734*4kB
> > (UMEH) 699*8kB (UMEH) 373*16kB (UMEH) 46*32kB (ME) 91*64kB (UM)
> > 67*128kB (UME) 54*256kB (UME) 22*512kB (U) 5*1024kB (U) 0*2048kB
> > 0*4096kB = 60576kB
> > Apr  9 05:21:26 wombat kernel: [291368.861250] Node 0 Normal: 5133*4kB
> > (UMEH) 414*8kB (UMEH) 1005*16kB (UMH) 145*32kB (UME) 68*64kB (UME)
> > 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 48916kB
> > Apr  9 05:21:26 wombat kernel: [291368.861259] Node 0
> > hugepages_total=0 hugepages_free=0 hugepages_surp=0
> > hugepages_size=1048576kB
> > Apr  9 05:21:26 wombat kernel: [291368.861260] Node 0
> > hugepages_total=0 hugepages_free=0 hugepages_surp=0
> > hugepages_size=2048kB
> > Apr  9 05:21:26 wombat kernel: [291368.861260] 2775949 total pagecache pages
> > Apr  9 05:21:26 wombat kernel: [291368.861263] 477 pages in swap cache
> > Apr  9 05:21:26 wombat kernel: [291368.861263] Swap cache stats: add
> > 24073, delete 23596, find 88552/91776
> > Apr  9 05:21:26 wombat kernel: [291368.861264] Free swap  = 3963900kB
> > Apr  9 05:21:26 wombat kernel: [291368.861264] Total swap = 4038652kB
> > Apr  9 05:21:26 wombat kernel: [291368.861265] 4194200 pages RAM
> > Apr  9 05:21:26 wombat kernel: [291368.861265] 0 pages HighMem/MovableOnly
> > Apr  9 05:21:26 wombat kernel: [291368.861266] 94820 pages reserved
> > Apr  9 05:21:26 wombat kernel: [291368.861266] 0 pages cma reserved
> > Apr  9 05:21:26 wombat kernel: [291368.861267] 0 pages hwpoisoned
> >
> > On Mon, Apr 6, 2020 at 9:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > On Mon, 2020-04-06 at 15:17 +0200, Jesper Krogh wrote:
> > > > Hi Jeff.
> > > >
> > > > No, because the client "bacula-fd" is reading from the local
> > > > filesystem - here CephFS and sending it over the network to the server
> > > > with the tape-libraries attached to it.  Thus "ideal" receive == send
> > > > - which is also the pattern I see when using larger files (multiple
> > > > MB).
> > > >
> > > > Is the per-file overhead many KB?
> > > >
> > >
> > > Maybe not "many" but "several".
> > >
> > > CephFS is quite chatty. There can also be quite a bit of back and forth
> > > between the client and MDS. The protocol has a lot of extraneous fields
> > > for any given message. Writing also means cap flushes (particularly on
> > > size changes), and those can add up.
> > >
> > > Whether that accounts for what you're seeing though, I'm not sure.
> > >
> > > > On Mon, Apr 6, 2020 at 1:45 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > On Mon, 2020-04-06 at 10:04 +0200, Jesper Krogh wrote:
> > > > > > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > > > > > and transport data to the tape-library from CephFS - below 2 threads is
> > > > > > essentially doing something equivalent to
> > > > > >
> > > > > > find /cephfs/ -type f | xargs cat | nc server
> > > > > >
> > > > > > 2 threads only, load exploding and the "net read vs net write" has
> > > > > > more than 100x difference.
> > > > > >
> > > > >
> > > > > Makes sense. You're basically just reading in all of the data on this
> > > > > cephfs, so the receive is going to be much larger than the send.
> > > > >
> > > > > > Can anyone explain this as "normal" behaviour?
> > > > > > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> > > > > >
> > > > > > jk@wombat:~$ w
> > > > > >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > > jk@wombat:~$ dstat -ar
> > > > > > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > > > > > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> > > > > >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> > > > > >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> > > > > >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> > > > > >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> > > > > >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> > > > > >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> > > > > >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> > > > > >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> > > > > >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> > > > > >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> > > > > >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> > > > > >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> > > > > >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> > > > > >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> > > > > >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> > > > > >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> > > > > >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> > > > > >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> > > > > >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > > > > > 0     0 ^C
> > > > >
> > > > > Load average is high, but it looks like it's all just waiting on I/O.
> > > > >
> > > > > > jk@wombat:~$ w
> > > > > >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > > > > > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > > > > > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > > > > > jk@wombat:~$
> > > > > >
> > > > > --
> > > > > Jeff Layton <jlayton@kernel.org>
> > > > >
> > >
> > > --
> > > Jeff Layton <jlayton@kernel.org>
> > >
>
> --
> Jeff Layton <jlayton@kernel.org>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-09 16:00           ` Jesper Krogh
@ 2020-04-09 16:30             ` Jeff Layton
  2020-04-09 22:47               ` Tony Lill
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2020-04-09 16:30 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: ceph-devel

On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
> Thanks Jeff - I'll try that.
> 
> I would just add to the case that this is a problem we have had on a
> physical machine - but too many "other" workloads at the same time -
> so we isolated it off to a VM - assuming that it was the mixed
> workload situation that did cause us issues. I cannot be sure that it
> is "excactly" the same problem we're seeing but symptoms are
> identical.
> 

Do you see the "page allocation failure" warnings on bare metal hosts
too? If so, then maybe we're dealing with a problem that isn't
virtio_net specific. In any case, let's get some folks more familiar
with that area involved first and take it from there.

Feel free to cc me on the bug report too.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-06  8:04 5.4.20 - high load - lots of incoming data - small data read Jesper Krogh
  2020-04-06  8:53 ` Yan, Zheng
  2020-04-06 11:45 ` Jeff Layton
@ 2020-04-09 17:50 ` Jesper Krogh
  2020-04-11 17:48   ` Jesper Krogh
  2 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-09 17:50 UTC (permalink / raw)
  To: ceph-devel, Jeff Layton

Hi. I really dont know if this is related to the issue  (page
allocation) or a separate. But It really puzzles me that I can see
dstat -ar output like this:

Keep in mind that this is only a "network backup client" reading from
CephFS - ideally recv == send as it just "transports data through the
host.

STARTS OUT OK.
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
  2   2  70  25   0|   0     0 | 100M  100M|   0     0 |7690  1944 |   0     0
  4   2  74  19   0|   0     0 | 156M  154M|   0     0 |  12k 3942 |   0     0
  4   2  70  24   0|   0     0 | 214M  127M|   0     0 |  12k 3892 |   0     0
  4   2  65  29   0|   0     0 | 120M  163M|   0     0 |9763  2347 |   0     0
  5   4  77  14   0|   0     0 | 216M  242M|   0     0 |  15k 4797 |   0     0
HERE IT BALOONS
  3  14  20  63   0|   0     0 | 912M 5970k|   0     0 |  33k   16k|   0     0
  2  14   1  83   0|   0     0 |1121M 4723k|   0     0 |  37k   14k|   0     0
  3  16   3  78   0|   0    84k|1198M 8738k|   0     0 |  39k   15k|   0  4.00
  3  14  14  69   0|   0     0 |1244M 5772k|   0     0 |  40k   14k|   0     0
  2  12  15  71   0|   0    24k|1354M |   0    24k|  41k 8241 |   0  6.00
  2   9   1  87   0|   0     0 |1271M 1540k|   0     0 |  38k 5887 |   0     0
  2   7   0  90   0|   0    52k|1222M 1609k|   0     0 |  37k 6359 |   0  6.00
  2   8   0  90   0|   0    96k|1260M 5676k|   0     0 |  39k 6589 |   0  20.0
  2   6   0  92   0|   0     0 |1043M 3002k|   0     0 |  33k 6189 |   0     0
  2   6   0  92   0|   0     0 | 946M 1223k|   0     0 |  30k 6080 |   0     0
  2   6   0  92   0|   0     0 | 908M 5331k|   0     0 |  29k 9983 |   0     0
  2   5   0  94   0|   0     0 | 773M 1067k|   0     0 |  26k 6691 |   0     0
  2   4   0  94   0|   0     0 | 626M 3190k|   0     0 |  21k 5868 |   0     0
  1   4   0  95   0|   0     0 | 505M   15M|   0     0 |  17k 4686 |   0     0
and then it move back to normal..

But a pattern of 1000x more on the recieve side than send is really puzzling.
A VM on 25Gbit interconnect with all ceph nodes.

On Mon, Apr 6, 2020 at 10:04 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> and transport data to the tape-library from CephFS - below 2 threads is
> essentially doing something equivalent to
>
> find /cephfs/ -type f | xargs cat | nc server
>
> 2 threads only, load exploding and the "net read vs net write" has
> more than 100x difference.
>
> Can anyone explain this as "normal" behaviour?
> Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
>
> jk@wombat:~$ w
>  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$ dstat -ar
> --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
>   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
>   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
>   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
>   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
>   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
>   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
>   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
>   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
>   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
>   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
>   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
>   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
>   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
>   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
>   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
>   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
>   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
>   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
>   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> 0     0 ^C
> jk@wombat:~$ w
>  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> jk@wombat:~$
>
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-09 16:30             ` Jeff Layton
@ 2020-04-09 22:47               ` Tony Lill
  2020-04-10  7:13                 ` Jesper Krogh
  0 siblings, 1 reply; 21+ messages in thread
From: Tony Lill @ 2020-04-09 22:47 UTC (permalink / raw)
  To: Jeff Layton, Jesper Krogh; +Cc: ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 1275 bytes --]



On 4/9/20 12:30 PM, Jeff Layton wrote:
> On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
>> Thanks Jeff - I'll try that.
>>
>> I would just add to the case that this is a problem we have had on a
>> physical machine - but too many "other" workloads at the same time -
>> so we isolated it off to a VM - assuming that it was the mixed
>> workload situation that did cause us issues. I cannot be sure that it
>> is "excactly" the same problem we're seeing but symptoms are
>> identical.
>>
> 
> Do you see the "page allocation failure" warnings on bare metal hosts
> too? If so, then maybe we're dealing with a problem that isn't
> virtio_net specific. In any case, let's get some folks more familiar
> with that area involved first and take it from there.
> 
> Feel free to cc me on the bug report too.
> 
> Thanks,
> 

In 5.4.20, the default rsize and wsize is 64M. This has caused me page
allocation failures in a different context. Try setting it to something
sensible.
-- 
Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
President, A. J. Lill Consultants               (519) 650 0660
539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
-------------- http://www.ajlc.waterloo.on.ca/ ---------------




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-09 22:47               ` Tony Lill
@ 2020-04-10  7:13                 ` Jesper Krogh
  2020-04-10 16:32                   ` Tony Lill
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-10  7:13 UTC (permalink / raw)
  To: Tony Lill; +Cc: Jeff Layton, ceph-devel

Hi. What is the suggested change? - is it Ceph that has an rsize,wsize
of 64MB ?

On Fri, Apr 10, 2020 at 12:47 AM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
>
>
>
> On 4/9/20 12:30 PM, Jeff Layton wrote:
> > On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
> >> Thanks Jeff - I'll try that.
> >>
> >> I would just add to the case that this is a problem we have had on a
> >> physical machine - but too many "other" workloads at the same time -
> >> so we isolated it off to a VM - assuming that it was the mixed
> >> workload situation that did cause us issues. I cannot be sure that it
> >> is "excactly" the same problem we're seeing but symptoms are
> >> identical.
> >>
> >
> > Do you see the "page allocation failure" warnings on bare metal hosts
> > too? If so, then maybe we're dealing with a problem that isn't
> > virtio_net specific. In any case, let's get some folks more familiar
> > with that area involved first and take it from there.
> >
> > Feel free to cc me on the bug report too.
> >
> > Thanks,
> >
>
> In 5.4.20, the default rsize and wsize is 64M. This has caused me page
> allocation failures in a different context. Try setting it to something
> sensible.
> --
> Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> President, A. J. Lill Consultants               (519) 650 0660
> 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> -------------- http://www.ajlc.waterloo.on.ca/ ---------------
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-10  7:13                 ` Jesper Krogh
@ 2020-04-10 16:32                   ` Tony Lill
  2020-04-11 13:20                     ` Jesper Krogh
  0 siblings, 1 reply; 21+ messages in thread
From: Tony Lill @ 2020-04-10 16:32 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Jeff Layton, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 2537 bytes --]



On 4/10/20 3:13 AM, Jesper Krogh wrote:
> Hi. What is the suggested change? - is it Ceph that has an rsize,wsize
> of 64MB ?
> 

Sorry, set rsize and wsize in the mount options for your cephfs to
something smaller.

My problem with this is that I use autofs to mount my filesystem.
Starting with 4.14.82, after a few mount/unmount cycles, the mount would
fail with order 4 allocation error, and I'd have to reboot.

I traced it to a change that doubled CEPH_MSG_MAX_DATA_LEN from 16M to
32M. Later in the 5 series kernels, this was doubled again, and that
caused an order 5 allocation failure. This define is used to set the max
and default rsize and wsize.

Reducing the rsize and wsize in the mount option fixed the problem for
me. This may do nothing for you, but, if it clears your allocation issue...


> On Fri, Apr 10, 2020 at 12:47 AM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
>>
>>
>>
>> On 4/9/20 12:30 PM, Jeff Layton wrote:
>>> On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
>>>> Thanks Jeff - I'll try that.
>>>>
>>>> I would just add to the case that this is a problem we have had on a
>>>> physical machine - but too many "other" workloads at the same time -
>>>> so we isolated it off to a VM - assuming that it was the mixed
>>>> workload situation that did cause us issues. I cannot be sure that it
>>>> is "excactly" the same problem we're seeing but symptoms are
>>>> identical.
>>>>
>>>
>>> Do you see the "page allocation failure" warnings on bare metal hosts
>>> too? If so, then maybe we're dealing with a problem that isn't
>>> virtio_net specific. In any case, let's get some folks more familiar
>>> with that area involved first and take it from there.
>>>
>>> Feel free to cc me on the bug report too.
>>>
>>> Thanks,
>>>
>>
>> In 5.4.20, the default rsize and wsize is 64M. This has caused me page
>> allocation failures in a different context. Try setting it to something
>> sensible.
>> --
>> Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
>> President, A. J. Lill Consultants               (519) 650 0660
>> 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
>> -------------- http://www.ajlc.waterloo.on.ca/ ---------------
>>
>>
>>

-- 
Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
President, A. J. Lill Consultants               (519) 650 0660
539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
-------------- http://www.ajlc.waterloo.on.ca/ ---------------




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-10 16:32                   ` Tony Lill
@ 2020-04-11 13:20                     ` Jesper Krogh
  2020-04-11 18:07                       ` Jesper Krogh
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-11 13:20 UTC (permalink / raw)
  To: Tony Lill; +Cc: Jeff Layton, ceph-devel

Ok, i'll change the mount options and report back in a few days on
that topic. This is fairly reproducible, so I would expect to see the
effect if it works.

On Fri, Apr 10, 2020 at 6:32 PM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
>
>
>
> On 4/10/20 3:13 AM, Jesper Krogh wrote:
> > Hi. What is the suggested change? - is it Ceph that has an rsize,wsize
> > of 64MB ?
> >
>
> Sorry, set rsize and wsize in the mount options for your cephfs to
> something smaller.
>
> My problem with this is that I use autofs to mount my filesystem.
> Starting with 4.14.82, after a few mount/unmount cycles, the mount would
> fail with order 4 allocation error, and I'd have to reboot.
>
> I traced it to a change that doubled CEPH_MSG_MAX_DATA_LEN from 16M to
> 32M. Later in the 5 series kernels, this was doubled again, and that
> caused an order 5 allocation failure. This define is used to set the max
> and default rsize and wsize.
>
> Reducing the rsize and wsize in the mount option fixed the problem for
> me. This may do nothing for you, but, if it clears your allocation issue...
>
>
> > On Fri, Apr 10, 2020 at 12:47 AM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
> >>
> >>
> >>
> >> On 4/9/20 12:30 PM, Jeff Layton wrote:
> >>> On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
> >>>> Thanks Jeff - I'll try that.
> >>>>
> >>>> I would just add to the case that this is a problem we have had on a
> >>>> physical machine - but too many "other" workloads at the same time -
> >>>> so we isolated it off to a VM - assuming that it was the mixed
> >>>> workload situation that did cause us issues. I cannot be sure that it
> >>>> is "excactly" the same problem we're seeing but symptoms are
> >>>> identical.
> >>>>
> >>>
> >>> Do you see the "page allocation failure" warnings on bare metal hosts
> >>> too? If so, then maybe we're dealing with a problem that isn't
> >>> virtio_net specific. In any case, let's get some folks more familiar
> >>> with that area involved first and take it from there.
> >>>
> >>> Feel free to cc me on the bug report too.
> >>>
> >>> Thanks,
> >>>
> >>
> >> In 5.4.20, the default rsize and wsize is 64M. This has caused me page
> >> allocation failures in a different context. Try setting it to something
> >> sensible.
> >> --
> >> Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> >> President, A. J. Lill Consultants               (519) 650 0660
> >> 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> >> -------------- http://www.ajlc.waterloo.on.ca/ ---------------
> >>
> >>
> >>
>
> --
> Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> President, A. J. Lill Consultants               (519) 650 0660
> 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> -------------- http://www.ajlc.waterloo.on.ca/ ---------------
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-09 17:50 ` Jesper Krogh
@ 2020-04-11 17:48   ` Jesper Krogh
  0 siblings, 0 replies; 21+ messages in thread
From: Jesper Krogh @ 2020-04-11 17:48 UTC (permalink / raw)
  To: ceph-devel, Jeff Layton

Pure speculation - could it be "readahead" that somehow ends up
pulling data into the system that are never needed when dealing with
small files? Any ways to demystify this pattern?

On Thu, Apr 9, 2020 at 7:50 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> Hi. I really dont know if this is related to the issue  (page
> allocation) or a separate. But It really puzzles me that I can see
> dstat -ar output like this:
>
> Keep in mind that this is only a "network backup client" reading from
> CephFS - ideally recv == send as it just "transports data through the
> host.
>
> STARTS OUT OK.
> --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
>   2   2  70  25   0|   0     0 | 100M  100M|   0     0 |7690  1944 |   0     0
>   4   2  74  19   0|   0     0 | 156M  154M|   0     0 |  12k 3942 |   0     0
>   4   2  70  24   0|   0     0 | 214M  127M|   0     0 |  12k 3892 |   0     0
>   4   2  65  29   0|   0     0 | 120M  163M|   0     0 |9763  2347 |   0     0
>   5   4  77  14   0|   0     0 | 216M  242M|   0     0 |  15k 4797 |   0     0
> HERE IT BALOONS
>   3  14  20  63   0|   0     0 | 912M 5970k|   0     0 |  33k   16k|   0     0
>   2  14   1  83   0|   0     0 |1121M 4723k|   0     0 |  37k   14k|   0     0
>   3  16   3  78   0|   0    84k|1198M 8738k|   0     0 |  39k   15k|   0  4.00
>   3  14  14  69   0|   0     0 |1244M 5772k|   0     0 |  40k   14k|   0     0
>   2  12  15  71   0|   0    24k|1354M |   0    24k|  41k 8241 |   0  6.00
>   2   9   1  87   0|   0     0 |1271M 1540k|   0     0 |  38k 5887 |   0     0
>   2   7   0  90   0|   0    52k|1222M 1609k|   0     0 |  37k 6359 |   0  6.00
>   2   8   0  90   0|   0    96k|1260M 5676k|   0     0 |  39k 6589 |   0  20.0
>   2   6   0  92   0|   0     0 |1043M 3002k|   0     0 |  33k 6189 |   0     0
>   2   6   0  92   0|   0     0 | 946M 1223k|   0     0 |  30k 6080 |   0     0
>   2   6   0  92   0|   0     0 | 908M 5331k|   0     0 |  29k 9983 |   0     0
>   2   5   0  94   0|   0     0 | 773M 1067k|   0     0 |  26k 6691 |   0     0
>   2   4   0  94   0|   0     0 | 626M 3190k|   0     0 |  21k 5868 |   0     0
>   1   4   0  95   0|   0     0 | 505M   15M|   0     0 |  17k 4686 |   0     0
> and then it move back to normal..
>
> But a pattern of 1000x more on the recieve side than send is really puzzling.
> A VM on 25Gbit interconnect with all ceph nodes.
>
> On Mon, Apr 6, 2020 at 10:04 AM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> >
> > This is a CephFS client - its only purpose is to run the "filedaemon" of bacula
> > and transport data to the tape-library from CephFS - below 2 threads is
> > essentially doing something equivalent to
> >
> > find /cephfs/ -type f | xargs cat | nc server
> >
> > 2 threads only, load exploding and the "net read vs net write" has
> > more than 100x difference.
> >
> > Can anyone explain this as "normal" behaviour?
> > Server is a  VM with 16 "vCPU" and 16GB memory running libvirt/qemu
> >
> > jk@wombat:~$ w
> >  07:50:33 up 11:25,  1 user,  load average: 206.43, 76.23, 50.58
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$ dstat -ar
> > --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- --io/total-
> > usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | read  writ
> >   0   0  98   1   0|  14k   34k|   0     0 |   3B   27B| 481   294 |0.55  0.73
> >   1   1   0  98   0|   0     0 |  60M  220k|   0     0 |6402  6182 |   0     0
> >   0   1   0  98   0|   0     0 |  69M  255k|   0     0 |7305  4339 |   0     0
> >   1   2   0  98   0|   0     0 |  76M  282k|   0     0 |7914  4886 |   0     0
> >   1   1   0  99   0|   0     0 |  70M  260k|   0     0 |7293  4444 |   0     0
> >   1   1   0  98   0|   0     0 |  80M  278k|   0     0 |8018  4931 |   0     0
> >   0   1   0  98   0|   0     0 |  60M  221k|   0     0 |6435  5951 |   0     0
> >   0   1   0  99   0|   0     0 |  59M  211k|   0     0 |6163  3584 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  323k|   0     0 |6653  3881 |   0     0
> >   1   0   0  99   0|   0     0 |  61M  243k|   0     0 |6822  4401 |   0     0
> >   0   1   0  99   0|   0     0 |  55M  205k|   0     0 |5975  3518 |   0     0
> >   1   1   0  98   0|   0     0 |  68M  242k|   0     0 |7094  6544 |   0     0
> >   0   1   0  99   0|   0     0 |  58M  230k|   0     0 |6639  4178 |   0     0
> >   1   2   0  98   0|   0     0 |  61M  243k|   0     0 |7117  4477 |   0     0
> >   0   1   0  99   0|   0     0 |  61M  228k|   0     0 |6500  4078 |   0     0
> >   0   1   0  99   0|   0     0 |  65M  234k|   0     0 |6595  3914 |   0     0
> >   0   1   0  98   0|   0     0 |  64M  219k|   0     0 |6507  5755 |   0     0
> >   1   1   0  99   0|   0     0 |  64M  233k|   0     0 |6869  4153 |   0     0
> >   1   2   0  98   0|   0     0 |  63M  232k|   0     0 |6632  3907 |
> > 0     0 ^C
> > jk@wombat:~$ w
> >  07:50:56 up 11:25,  1 user,  load average: 221.35, 88.07, 55.02
> > USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
> > jk       pts/0    10.194.133.42    06:54    0.00s  0.05s  0.00s w
> > jk@wombat:~$
> >
> > Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-11 13:20                     ` Jesper Krogh
@ 2020-04-11 18:07                       ` Jesper Krogh
  2020-04-12 11:53                         ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Krogh @ 2020-04-11 18:07 UTC (permalink / raw)
  To: Tony Lill; +Cc: Jeff Layton, ceph-devel

Ok, I set the mount options to 32MB and after a fresh VM reboot I got
a page-allocation error again.
This time it was the "first" kernel stacktrace after reboot - no "hung
tasks for more than 120s prior".

[Sat Apr 11 17:52:58 2020] kworker/7:1: page allocation failure:
order:0, mode:0xa20(GFP_ATOMIC),
nodemask=(null),cpuset=/,mems_allowed=0
[Sat Apr 11 17:52:58 2020] CPU: 7 PID: 2638 Comm: kworker/7:1 Not
tainted 5.4.30 #5
[Sat Apr 11 17:52:58 2020] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[Sat Apr 11 17:52:58 2020] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Sat Apr 11 17:52:58 2020] Call Trace:
[Sat Apr 11 17:52:58 2020]  <IRQ>
[Sat Apr 11 17:52:58 2020]  dump_stack+0x6d/0x95
[Sat Apr 11 17:52:58 2020]  warn_alloc+0x10c/0x170
[Sat Apr 11 17:52:58 2020]  __alloc_pages_slowpath+0xe6c/0xef0
[Sat Apr 11 17:52:58 2020]  ? ip_local_deliver_finish+0x48/0x50
[Sat Apr 11 17:52:58 2020]  ? ip_local_deliver+0x6f/0xe0
[Sat Apr 11 17:52:58 2020]  ? tcp_v4_early_demux+0x11c/0x150
[Sat Apr 11 17:52:58 2020]  __alloc_pages_nodemask+0x2f3/0x360
[Sat Apr 11 17:52:58 2020]  page_frag_alloc+0x129/0x150
[Sat Apr 11 17:52:58 2020]  __napi_alloc_skb+0x86/0xd0
[Sat Apr 11 17:52:58 2020]  page_to_skb+0x67/0x350 [virtio_net]
[Sat Apr 11 17:52:58 2020]  receive_buf+0xe47/0x16c0 [virtio_net]
[Sat Apr 11 17:52:58 2020]  virtnet_poll+0xf2/0x364 [virtio_net]
[Sat Apr 11 17:52:58 2020]  net_rx_action+0x265/0x3e0
[Sat Apr 11 17:52:58 2020]  __do_softirq+0xf9/0x2aa
[Sat Apr 11 17:52:58 2020]  irq_exit+0xae/0xb0
[Sat Apr 11 17:52:58 2020]  do_IRQ+0x59/0xe0
[Sat Apr 11 17:52:58 2020]  common_interrupt+0xf/0xf
[Sat Apr 11 17:52:58 2020]  </IRQ>
[Sat Apr 11 17:52:58 2020] RIP: 0010:__check_object_size+0x40/0x1b0
[Sat Apr 11 17:52:58 2020] Code: 4c 8d 34 3e 41 55 41 54 44 0f b6 ea
49 8d 46 ff 53 49 89 f4 48 89 fb 48 39 c7 0f 87 48 01 00 00 48 83 ff
10 0f 86 54 01 00 00 <e8> 0b fe ff ff 85 c0 74 27 78 0f 83 f8 02 7f 0a
5b 41 5c 41 5d 41
[Sat Apr 11 17:52:58 2020] RSP: 0018:ffffb96c00c479f8 EFLAGS: 00000286
ORIG_RAX: ffffffffffffffdc
[Sat Apr 11 17:52:58 2020] RAX: ffff92b9f36a7aca RBX: ffff92b9f36a7ab6
RCX: ffffb96c00c47c58
[Sat Apr 11 17:52:58 2020] RDX: 0000000000000001 RSI: 0000000000000015
RDI: ffff92b9f36a7ab6
[Sat Apr 11 17:52:58 2020] RBP: ffffb96c00c47a18 R08: 0000000000000000
R09: 0000000000000000
[Sat Apr 11 17:52:58 2020] R10: 0000000000000015 R11: ffff92bbea8957f8
R12: 0000000000000015
[Sat Apr 11 17:52:58 2020] R13: 0000000000000001 R14: ffff92b9f36a7acb
R15: 0000000000000005
[Sat Apr 11 17:52:58 2020]  simple_copy_to_iter+0x2a/0x50
[Sat Apr 11 17:52:58 2020]  __skb_datagram_iter+0x162/0x280
[Sat Apr 11 17:52:58 2020]  ? skb_kill_datagram+0x70/0x70
[Sat Apr 11 17:52:58 2020]  skb_copy_datagram_iter+0x40/0xa0
[Sat Apr 11 17:52:58 2020]  tcp_recvmsg+0x6f7/0xbc0
[Sat Apr 11 17:52:58 2020]  ? aa_sk_perm+0x43/0x190
[Sat Apr 11 17:52:58 2020]  inet_recvmsg+0x64/0xf0
[Sat Apr 11 17:52:58 2020]  sock_recvmsg+0x66/0x70
[Sat Apr 11 17:52:58 2020]  ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[Sat Apr 11 17:52:58 2020]  read_partial.isra.26+0x50/0x80 [libceph]
[Sat Apr 11 17:52:58 2020]  read_partial_message+0x198/0x7c0 [libceph]
[Sat Apr 11 17:52:58 2020]  ceph_con_workfn+0xa6a/0x23d0 [libceph]
[Sat Apr 11 17:52:58 2020]  ? __dir_lease_try_check+0x90/0x90 [ceph]
[Sat Apr 11 17:52:58 2020]  process_one_work+0x167/0x400
[Sat Apr 11 17:52:58 2020]  worker_thread+0x4d/0x460
[Sat Apr 11 17:52:58 2020]  kthread+0x105/0x140
[Sat Apr 11 17:52:58 2020]  ? rescuer_thread+0x370/0x370
[Sat Apr 11 17:52:58 2020]  ? kthread_destroy_worker+0x50/0x50
[Sat Apr 11 17:52:58 2020]  ret_from_fork+0x35/0x40
[Sat Apr 11 17:52:58 2020] Mem-Info:
[Sat Apr 11 17:52:58 2020] active_anon:6795 inactive_anon:8914 isolated_anon:0
                            active_file:6364 inactive_file:3914973
isolated_file:0
                            unevictable:0 dirty:25 writeback:0 unstable:0
                            slab_reclaimable:52533 slab_unreclaimable:45778
                            mapped:5289 shmem:564 pagetables:1216 bounce:0
                            free:24896 free_pcp:5276 free_cma:0
[Sat Apr 11 17:52:58 2020] Node 0 active_anon:27180kB
inactive_anon:35656kB active_file:25456kB inactive_file:15659892kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:21156kB
dirty:100kB writeback:0kB shmem:2256kB shmem_thp: 0kB shmem_pmdmapped:
0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Sat Apr 11 17:52:58 2020] Node 0 DMA free:15908kB min:64kB low:80kB
high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB
managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
[Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 3444 15930 15930 15930
[Sat Apr 11 17:52:58 2020] Node 0 DMA32 free:55068kB min:14596kB
low:18244kB high:21892kB active_anon:1940kB inactive_anon:2316kB
active_file:1748kB inactive_file:3386696kB unevictable:0kB
writepending:56kB present:3653608kB managed:3588072kB mlocked:0kB
kernel_stack:16kB pagetables:28kB bounce:0kB free_pcp:11168kB
local_pcp:1552kB free_cma:0kB
[Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 0 12485 12485 12485
[Sat Apr 11 17:52:58 2020] Node 0 Normal free:28608kB min:77492kB
low:90720kB high:103948kB active_anon:25240kB inactive_anon:33340kB
active_file:23708kB inactive_file:12273660kB unevictable:0kB
writepending:44kB present:13107200kB managed:12793540kB mlocked:0kB
kernel_stack:7504kB pagetables:4836kB bounce:0kB free_pcp:9936kB
local_pcp:1360kB free_cma:0kB
[Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 0 0 0 0
[Sat Apr 11 17:52:58 2020] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB
(U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB
(M) 3*4096kB (M) = 15908kB
[Sat Apr 11 17:52:58 2020] Node 0 DMA32: 17*4kB (MEH) 22*8kB (UMH)
18*16kB (MH) 2*32kB (ME) 147*64kB (UME) 70*128kB (UME) 16*256kB (UM)
15*512kB (UME) 4*1024kB (UME) 2*2048kB (M) 4*4096kB (M) = 55316kB
[Sat Apr 11 17:52:58 2020] Node 0 Normal: 187*4kB (UMEH) 62*8kB (UMEH)
280*16kB (UMEH) 318*32kB (UMEH) 98*64kB (UMEH) 45*128kB (UMEH) 4*256kB
(UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 28956kB
[Sat Apr 11 17:52:58 2020] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[Sat Apr 11 17:52:58 2020] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[Sat Apr 11 17:52:58 2020] 3922168 total pagecache pages
[Sat Apr 11 17:52:58 2020] 279 pages in swap cache
[Sat Apr 11 17:52:58 2020] 279 pages in swap cache
[Sat Apr 11 17:52:58 2020] Swap cache stats: add 4764, delete 4485, find 347/500
[Sat Apr 11 17:52:58 2020] Free swap  = 4019440kB
[Sat Apr 11 17:52:58 2020] Total swap = 4038652kB
[Sat Apr 11 17:52:58 2020] 4194200 pages RAM
[Sat Apr 11 17:52:58 2020] 0 pages HighMem/MovableOnly
[Sat Apr 11 17:52:58 2020] 94820 pages reserved
[Sat Apr 11 17:52:58 2020] 0 pages cma reserved
[Sat Apr 11 17:52:58 2020] 0 pages hwpoisoned
[Sat Apr 11 17:57:01 2020] warn_alloc: 50 callbacks suppressed

On Sat, Apr 11, 2020 at 3:20 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
>
> Ok, i'll change the mount options and report back in a few days on
> that topic. This is fairly reproducible, so I would expect to see the
> effect if it works.
>
> On Fri, Apr 10, 2020 at 6:32 PM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
> >
> >
> >
> > On 4/10/20 3:13 AM, Jesper Krogh wrote:
> > > Hi. What is the suggested change? - is it Ceph that has an rsize,wsize
> > > of 64MB ?
> > >
> >
> > Sorry, set rsize and wsize in the mount options for your cephfs to
> > something smaller.
> >
> > My problem with this is that I use autofs to mount my filesystem.
> > Starting with 4.14.82, after a few mount/unmount cycles, the mount would
> > fail with order 4 allocation error, and I'd have to reboot.
> >
> > I traced it to a change that doubled CEPH_MSG_MAX_DATA_LEN from 16M to
> > 32M. Later in the 5 series kernels, this was doubled again, and that
> > caused an order 5 allocation failure. This define is used to set the max
> > and default rsize and wsize.
> >
> > Reducing the rsize and wsize in the mount option fixed the problem for
> > me. This may do nothing for you, but, if it clears your allocation issue...
> >
> >
> > > On Fri, Apr 10, 2020 at 12:47 AM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
> > >>
> > >>
> > >>
> > >> On 4/9/20 12:30 PM, Jeff Layton wrote:
> > >>> On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
> > >>>> Thanks Jeff - I'll try that.
> > >>>>
> > >>>> I would just add to the case that this is a problem we have had on a
> > >>>> physical machine - but too many "other" workloads at the same time -
> > >>>> so we isolated it off to a VM - assuming that it was the mixed
> > >>>> workload situation that did cause us issues. I cannot be sure that it
> > >>>> is "excactly" the same problem we're seeing but symptoms are
> > >>>> identical.
> > >>>>
> > >>>
> > >>> Do you see the "page allocation failure" warnings on bare metal hosts
> > >>> too? If so, then maybe we're dealing with a problem that isn't
> > >>> virtio_net specific. In any case, let's get some folks more familiar
> > >>> with that area involved first and take it from there.
> > >>>
> > >>> Feel free to cc me on the bug report too.
> > >>>
> > >>> Thanks,
> > >>>
> > >>
> > >> In 5.4.20, the default rsize and wsize is 64M. This has caused me page
> > >> allocation failures in a different context. Try setting it to something
> > >> sensible.
> > >> --
> > >> Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> > >> President, A. J. Lill Consultants               (519) 650 0660
> > >> 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> > >> -------------- http://www.ajlc.waterloo.on.ca/ ---------------
> > >>
> > >>
> > >>
> >
> > --
> > Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> > President, A. J. Lill Consultants               (519) 650 0660
> > 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> > -------------- http://www.ajlc.waterloo.on.ca/ ---------------
> >
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 5.4.20 - high load - lots of incoming data - small data read.
  2020-04-11 18:07                       ` Jesper Krogh
@ 2020-04-12 11:53                         ` Jeff Layton
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-04-12 11:53 UTC (permalink / raw)
  To: Jesper Krogh, Tony Lill; +Cc: ceph-devel

I doubt changing rsize/wsize will make any difference here. This is an
order:0 allocation. It's trying to allocate 2^0 pages, or a single page,
and failing.

The problem that Tony described may have been due to memory
fragmentation. The kernel needed a set of contiguous pages and it
couldn't find one. That really shouldn't be a problem here since it just
needs a single page.

Either the box was _really_ out of memory when it went to do the
allocation, and freed some up just before the Mem-Info summary could be
printed, or the memory that was available was not suitable to the
allocator for some reason.

It's not clear to me which it was, which was why I was suggesting we
involve the network developers.

-- Jeff

On Sat, 2020-04-11 at 20:07 +0200, Jesper Krogh wrote:
> Ok, I set the mount options to 32MB and after a fresh VM reboot I got
> a page-allocation error again.
> This time it was the "first" kernel stacktrace after reboot - no "hung
> tasks for more than 120s prior".
> 
> [Sat Apr 11 17:52:58 2020] kworker/7:1: page allocation failure:
> order:0, mode:0xa20(GFP_ATOMIC),
> nodemask=(null),cpuset=/,mems_allowed=0
> [Sat Apr 11 17:52:58 2020] CPU: 7 PID: 2638 Comm: kworker/7:1 Not
> tainted 5.4.30 #5
> [Sat Apr 11 17:52:58 2020] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [Sat Apr 11 17:52:58 2020] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> [Sat Apr 11 17:52:58 2020] Call Trace:
> [Sat Apr 11 17:52:58 2020]  <IRQ>
> [Sat Apr 11 17:52:58 2020]  dump_stack+0x6d/0x95
> [Sat Apr 11 17:52:58 2020]  warn_alloc+0x10c/0x170
> [Sat Apr 11 17:52:58 2020]  __alloc_pages_slowpath+0xe6c/0xef0
> [Sat Apr 11 17:52:58 2020]  ? ip_local_deliver_finish+0x48/0x50
> [Sat Apr 11 17:52:58 2020]  ? ip_local_deliver+0x6f/0xe0
> [Sat Apr 11 17:52:58 2020]  ? tcp_v4_early_demux+0x11c/0x150
> [Sat Apr 11 17:52:58 2020]  __alloc_pages_nodemask+0x2f3/0x360
> [Sat Apr 11 17:52:58 2020]  page_frag_alloc+0x129/0x150
> [Sat Apr 11 17:52:58 2020]  __napi_alloc_skb+0x86/0xd0
> [Sat Apr 11 17:52:58 2020]  page_to_skb+0x67/0x350 [virtio_net]
> [Sat Apr 11 17:52:58 2020]  receive_buf+0xe47/0x16c0 [virtio_net]
> [Sat Apr 11 17:52:58 2020]  virtnet_poll+0xf2/0x364 [virtio_net]
> [Sat Apr 11 17:52:58 2020]  net_rx_action+0x265/0x3e0
> [Sat Apr 11 17:52:58 2020]  __do_softirq+0xf9/0x2aa
> [Sat Apr 11 17:52:58 2020]  irq_exit+0xae/0xb0
> [Sat Apr 11 17:52:58 2020]  do_IRQ+0x59/0xe0
> [Sat Apr 11 17:52:58 2020]  common_interrupt+0xf/0xf
> [Sat Apr 11 17:52:58 2020]  </IRQ>
> [Sat Apr 11 17:52:58 2020] RIP: 0010:__check_object_size+0x40/0x1b0
> [Sat Apr 11 17:52:58 2020] Code: 4c 8d 34 3e 41 55 41 54 44 0f b6 ea
> 49 8d 46 ff 53 49 89 f4 48 89 fb 48 39 c7 0f 87 48 01 00 00 48 83 ff
> 10 0f 86 54 01 00 00 <e8> 0b fe ff ff 85 c0 74 27 78 0f 83 f8 02 7f 0a
> 5b 41 5c 41 5d 41
> [Sat Apr 11 17:52:58 2020] RSP: 0018:ffffb96c00c479f8 EFLAGS: 00000286
> ORIG_RAX: ffffffffffffffdc
> [Sat Apr 11 17:52:58 2020] RAX: ffff92b9f36a7aca RBX: ffff92b9f36a7ab6
> RCX: ffffb96c00c47c58
> [Sat Apr 11 17:52:58 2020] RDX: 0000000000000001 RSI: 0000000000000015
> RDI: ffff92b9f36a7ab6
> [Sat Apr 11 17:52:58 2020] RBP: ffffb96c00c47a18 R08: 0000000000000000
> R09: 0000000000000000
> [Sat Apr 11 17:52:58 2020] R10: 0000000000000015 R11: ffff92bbea8957f8
> R12: 0000000000000015
> [Sat Apr 11 17:52:58 2020] R13: 0000000000000001 R14: ffff92b9f36a7acb
> R15: 0000000000000005
> [Sat Apr 11 17:52:58 2020]  simple_copy_to_iter+0x2a/0x50
> [Sat Apr 11 17:52:58 2020]  __skb_datagram_iter+0x162/0x280
> [Sat Apr 11 17:52:58 2020]  ? skb_kill_datagram+0x70/0x70
> [Sat Apr 11 17:52:58 2020]  skb_copy_datagram_iter+0x40/0xa0
> [Sat Apr 11 17:52:58 2020]  tcp_recvmsg+0x6f7/0xbc0
> [Sat Apr 11 17:52:58 2020]  ? aa_sk_perm+0x43/0x190
> [Sat Apr 11 17:52:58 2020]  inet_recvmsg+0x64/0xf0
> [Sat Apr 11 17:52:58 2020]  sock_recvmsg+0x66/0x70
> [Sat Apr 11 17:52:58 2020]  ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
> [Sat Apr 11 17:52:58 2020]  read_partial.isra.26+0x50/0x80 [libceph]
> [Sat Apr 11 17:52:58 2020]  read_partial_message+0x198/0x7c0 [libceph]
> [Sat Apr 11 17:52:58 2020]  ceph_con_workfn+0xa6a/0x23d0 [libceph]
> [Sat Apr 11 17:52:58 2020]  ? __dir_lease_try_check+0x90/0x90 [ceph]
> [Sat Apr 11 17:52:58 2020]  process_one_work+0x167/0x400
> [Sat Apr 11 17:52:58 2020]  worker_thread+0x4d/0x460
> [Sat Apr 11 17:52:58 2020]  kthread+0x105/0x140
> [Sat Apr 11 17:52:58 2020]  ? rescuer_thread+0x370/0x370
> [Sat Apr 11 17:52:58 2020]  ? kthread_destroy_worker+0x50/0x50
> [Sat Apr 11 17:52:58 2020]  ret_from_fork+0x35/0x40
> [Sat Apr 11 17:52:58 2020] Mem-Info:
> [Sat Apr 11 17:52:58 2020] active_anon:6795 inactive_anon:8914 isolated_anon:0
>                             active_file:6364 inactive_file:3914973
> isolated_file:0
>                             unevictable:0 dirty:25 writeback:0 unstable:0
>                             slab_reclaimable:52533 slab_unreclaimable:45778
>                             mapped:5289 shmem:564 pagetables:1216 bounce:0
>                             free:24896 free_pcp:5276 free_cma:0
> [Sat Apr 11 17:52:58 2020] Node 0 active_anon:27180kB
> inactive_anon:35656kB active_file:25456kB inactive_file:15659892kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:21156kB
> dirty:100kB writeback:0kB shmem:2256kB shmem_thp: 0kB shmem_pmdmapped:
> 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [Sat Apr 11 17:52:58 2020] Node 0 DMA free:15908kB min:64kB low:80kB
> high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB
> managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
> free_pcp:0kB local_pcp:0kB free_cma:0kB
> [Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 3444 15930 15930 15930
> [Sat Apr 11 17:52:58 2020] Node 0 DMA32 free:55068kB min:14596kB
> low:18244kB high:21892kB active_anon:1940kB inactive_anon:2316kB
> active_file:1748kB inactive_file:3386696kB unevictable:0kB
> writepending:56kB present:3653608kB managed:3588072kB mlocked:0kB
> kernel_stack:16kB pagetables:28kB bounce:0kB free_pcp:11168kB
> local_pcp:1552kB free_cma:0kB
> [Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 0 12485 12485 12485
> [Sat Apr 11 17:52:58 2020] Node 0 Normal free:28608kB min:77492kB
> low:90720kB high:103948kB active_anon:25240kB inactive_anon:33340kB
> active_file:23708kB inactive_file:12273660kB unevictable:0kB
> writepending:44kB present:13107200kB managed:12793540kB mlocked:0kB
> kernel_stack:7504kB pagetables:4836kB bounce:0kB free_pcp:9936kB
> local_pcp:1360kB free_cma:0kB
> [Sat Apr 11 17:52:58 2020] lowmem_reserve[]: 0 0 0 0 0
> [Sat Apr 11 17:52:58 2020] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB
> (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB
> (M) 3*4096kB (M) = 15908kB
> [Sat Apr 11 17:52:58 2020] Node 0 DMA32: 17*4kB (MEH) 22*8kB (UMH)
> 18*16kB (MH) 2*32kB (ME) 147*64kB (UME) 70*128kB (UME) 16*256kB (UM)
> 15*512kB (UME) 4*1024kB (UME) 2*2048kB (M) 4*4096kB (M) = 55316kB
> [Sat Apr 11 17:52:58 2020] Node 0 Normal: 187*4kB (UMEH) 62*8kB (UMEH)
> 280*16kB (UMEH) 318*32kB (UMEH) 98*64kB (UMEH) 45*128kB (UMEH) 4*256kB
> (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 28956kB
> [Sat Apr 11 17:52:58 2020] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [Sat Apr 11 17:52:58 2020] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> [Sat Apr 11 17:52:58 2020] 3922168 total pagecache pages
> [Sat Apr 11 17:52:58 2020] 279 pages in swap cache
> [Sat Apr 11 17:52:58 2020] 279 pages in swap cache
> [Sat Apr 11 17:52:58 2020] Swap cache stats: add 4764, delete 4485, find 347/500
> [Sat Apr 11 17:52:58 2020] Free swap  = 4019440kB
> [Sat Apr 11 17:52:58 2020] Total swap = 4038652kB
> [Sat Apr 11 17:52:58 2020] 4194200 pages RAM
> [Sat Apr 11 17:52:58 2020] 0 pages HighMem/MovableOnly
> [Sat Apr 11 17:52:58 2020] 94820 pages reserved
> [Sat Apr 11 17:52:58 2020] 0 pages cma reserved
> [Sat Apr 11 17:52:58 2020] 0 pages hwpoisoned
> [Sat Apr 11 17:57:01 2020] warn_alloc: 50 callbacks suppressed
> 
> On Sat, Apr 11, 2020 at 3:20 PM Jesper Krogh <jesper.krogh@gmail.com> wrote:
> > Ok, i'll change the mount options and report back in a few days on
> > that topic. This is fairly reproducible, so I would expect to see the
> > effect if it works.
> > 
> > On Fri, Apr 10, 2020 at 6:32 PM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
> > > 
> > > 
> > > On 4/10/20 3:13 AM, Jesper Krogh wrote:
> > > > Hi. What is the suggested change? - is it Ceph that has an rsize,wsize
> > > > of 64MB ?
> > > > 
> > > 
> > > Sorry, set rsize and wsize in the mount options for your cephfs to
> > > something smaller.
> > > 
> > > My problem with this is that I use autofs to mount my filesystem.
> > > Starting with 4.14.82, after a few mount/unmount cycles, the mount would
> > > fail with order 4 allocation error, and I'd have to reboot.
> > > 
> > > I traced it to a change that doubled CEPH_MSG_MAX_DATA_LEN from 16M to
> > > 32M. Later in the 5 series kernels, this was doubled again, and that
> > > caused an order 5 allocation failure. This define is used to set the max
> > > and default rsize and wsize.
> > > 
> > > Reducing the rsize and wsize in the mount option fixed the problem for
> > > me. This may do nothing for you, but, if it clears your allocation issue...
> > > 
> > > 
> > > > On Fri, Apr 10, 2020 at 12:47 AM Tony Lill <ajlill@ajlc.waterloo.on.ca> wrote:
> > > > > 
> > > > > 
> > > > > On 4/9/20 12:30 PM, Jeff Layton wrote:
> > > > > > On Thu, 2020-04-09 at 18:00 +0200, Jesper Krogh wrote:
> > > > > > > Thanks Jeff - I'll try that.
> > > > > > > 
> > > > > > > I would just add to the case that this is a problem we have had on a
> > > > > > > physical machine - but too many "other" workloads at the same time -
> > > > > > > so we isolated it off to a VM - assuming that it was the mixed
> > > > > > > workload situation that did cause us issues. I cannot be sure that it
> > > > > > > is "excactly" the same problem we're seeing but symptoms are
> > > > > > > identical.
> > > > > > > 
> > > > > > 
> > > > > > Do you see the "page allocation failure" warnings on bare metal hosts
> > > > > > too? If so, then maybe we're dealing with a problem that isn't
> > > > > > virtio_net specific. In any case, let's get some folks more familiar
> > > > > > with that area involved first and take it from there.
> > > > > > 
> > > > > > Feel free to cc me on the bug report too.
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > 
> > > > > In 5.4.20, the default rsize and wsize is 64M. This has caused me page
> > > > > allocation failures in a different context. Try setting it to something
> > > > > sensible.
> > > > > --
> > > > > Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> > > > > President, A. J. Lill Consultants               (519) 650 0660
> > > > > 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> > > > > -------------- http://www.ajlc.waterloo.on.ca/ ---------------
> > > > > 
> > > > > 
> > > > > 
> > > 
> > > --
> > > Tony Lill, OCT,                     ajlill@AJLC.Waterloo.ON.CA
> > > President, A. J. Lill Consultants               (519) 650 0660
> > > 539 Grand Valley Dr., Cambridge, Ont. N3H 2S2   (519) 241 2461
> > > -------------- http://www.ajlc.waterloo.on.ca/ ---------------
> > > 
> > > 
> > > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-04-12 11:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-06  8:04 5.4.20 - high load - lots of incoming data - small data read Jesper Krogh
2020-04-06  8:53 ` Yan, Zheng
2020-04-06  9:09   ` Jesper Krogh
2020-04-06  9:18     ` Jesper Krogh
2020-04-06 10:23       ` Jesper Krogh
2020-04-06 11:58         ` Jeff Layton
2020-04-06 11:49     ` Jeff Layton
2020-04-06 11:45 ` Jeff Layton
2020-04-06 13:17   ` Jesper Krogh
2020-04-06 19:45     ` Jeff Layton
     [not found]       ` <CAED-sicefQuJ3GGWh2bBXz6R=BFr1EQ=h6Sz-W_mh3u+-tO0wA@mail.gmail.com>
2020-04-09 14:06         ` Jeff Layton
2020-04-09 16:00           ` Jesper Krogh
2020-04-09 16:30             ` Jeff Layton
2020-04-09 22:47               ` Tony Lill
2020-04-10  7:13                 ` Jesper Krogh
2020-04-10 16:32                   ` Tony Lill
2020-04-11 13:20                     ` Jesper Krogh
2020-04-11 18:07                       ` Jesper Krogh
2020-04-12 11:53                         ` Jeff Layton
2020-04-09 17:50 ` Jesper Krogh
2020-04-11 17:48   ` Jesper Krogh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.